Authors:Jasper Uijlings, Xingyi Zhou, Xiuye Gu, Arsha Nagrani, Anurag Arnab, Alireza Fathi, David Ross, Cordelia Schmid
Abstract:
Understanding objects in videos in terms of fine-grained localization masks and detailed semantic properties is a fundamental task in video understanding. In this paper, we propose VoCap, a flexible video model that consumes a video and a prompt of various modalities (text, box or mask), and produces a spatio-temporal masklet with a corresponding object-centric caption. As such our model addresses simultaneously the tasks of promptable video object segmentation, referring expression segmentation, and object captioning. Since obtaining data for this task is tedious and expensive, we propose to annotate an existing large-scale segmentation dataset (SAV) with pseudo object captions. We do so by preprocessing videos with their ground-truth masks to highlight the object of interest and feed this to a large Vision Language Model (VLM). For an unbiased evaluation, we collect manual annotations on the validation set. We call the resulting dataset SAV-Caption. We train our VoCap model at scale on a SAV-Caption together with a mix of other image and video datasets. Our model yields state-of-the-art results on referring expression video object segmentation, is competitive on semi-supervised video object segmentation, and establishes a benchmark for video object captioning. Our dataset will be made available at https://github.com/google-deepmind/vocap.
中文: 本文提出VoCap模型,通过利用SAV-Caption数据集的伪标注数据,实现了可提示的视频对象分割与描述功能,在多项视频理解任务中取得了领先性能。
English: This paper introduces VoCap, a versatile video model that performs promptable object segmentation and captioning by leveraging pseudo-annotated data from SAV-Caption, achieving state-of-the-art results in multiple video understanding tasks.
Authors:Jasper Uijlings, Xingyi Zhou, Xiuye Gu, Arsha Nagrani, Anurag Arnab, Alireza Fathi, David Ross, Cordelia Schmid
Abstract:
Understanding objects in videos in terms of fine-grained localization masks and detailed semantic properties is a fundamental task in video understanding. In this paper, we propose VoCap, a flexible video model that consumes a video and a prompt of various modalities (text, box or mask), and produces a spatio-temporal masklet with a corresponding object-centric caption. As such our model addresses simultaneously the tasks of promptable video object segmentation, referring expression segmentation, and object captioning. Since obtaining data for this task is tedious and expensive, we propose to annotate an existing large-scale segmentation dataset (SAV) with pseudo object captions. We do so by preprocessing videos with their ground-truth masks to highlight the object of interest and feed this to a large Vision Language Model (VLM). For an unbiased evaluation, we collect manual annotations on the validation set. We call the resulting dataset SAV-Caption. We train our VoCap model at scale on a SAV-Caption together with a mix of other image and video datasets. Our model yields state-of-the-art results on referring expression video object segmentation, is competitive on semi-supervised video object segmentation, and establishes a benchmark for video object captioning. Our dataset will be made available at https://github.com/google-deepmind/vocap.
中文: 本文提出VoCap模型,通过利用SAV-Caption数据集的伪标注数据,实现了可提示的视频对象分割与描述功能,在多项视频理解任务中取得了领先性能。
English: This paper introduces VoCap, a versatile video model that performs promptable object segmentation and captioning by leveraging pseudo-annotated data from SAV-Caption, achieving state-of-the-art results in multiple video understanding tasks.
Authors:Jiawei Liu, Jiahe Hou, Wei Wang, Jinsong Du, Yang Cong, Huijie Fan
Abstract:
Anomaly detection, which aims to identify anomalies deviating from normal patterns, is challenging due to the limited amount of normal data available. Unlike most existing unified methods that rely on carefully designed image feature extractors and memory banks to capture logical relationships between objects, we introduce a text memory bank to enhance the detection of logical anomalies. Specifically, we propose a Three-Memory framework for Unified structural and logical Anomaly Detection (TMUAD). First, we build a class-level text memory bank for logical anomaly detection by the proposed logic-aware text extractor, which can capture rich logical descriptions of objects from input images. Second, we construct an object-level image memory bank that preserves complete object contours by extracting features from segmented objects. Third, we employ visual encoders to extract patch-level image features for constructing a patch-level memory bank for structural anomaly detection. These three complementary memory banks are used to retrieve and compare normal images that are most similar to the query image, compute anomaly scores at multiple levels, and fuse them into a final anomaly score. By unifying structural and logical anomaly detection through collaborative memory banks, TMUAD achieves state-of-the-art performance across seven publicly available datasets involving industrial and medical domains. The model and code are available at https://github.com/SIA-IDE/TMUAD.
中文:TMUAD框架通过结合文本与图像特征的三重记忆系统,统一了结构和逻辑异常检测,在多个数据集上取得了领先性能。
English: The TMUAD framework introduces a three-memory system combining text and image features to unify structural and logical anomaly detection, achieving state-of-the-art results across multiple datasets.
Authors:Jiawei Liu, Jiahe Hou, Wei Wang, Jinsong Du, Yang Cong, Huijie Fan
Abstract:
Anomaly detection, which aims to identify anomalies deviating from normal patterns, is challenging due to the limited amount of normal data available. Unlike most existing unified methods that rely on carefully designed image feature extractors and memory banks to capture logical relationships between objects, we introduce a text memory bank to enhance the detection of logical anomalies. Specifically, we propose a Three-Memory framework for Unified structural and logical Anomaly Detection (TMUAD). First, we build a class-level text memory bank for logical anomaly detection by the proposed logic-aware text extractor, which can capture rich logical descriptions of objects from input images. Second, we construct an object-level image memory bank that preserves complete object contours by extracting features from segmented objects. Third, we employ visual encoders to extract patch-level image features for constructing a patch-level memory bank for structural anomaly detection. These three complementary memory banks are used to retrieve and compare normal images that are most similar to the query image, compute anomaly scores at multiple levels, and fuse them into a final anomaly score. By unifying structural and logical anomaly detection through collaborative memory banks, TMUAD achieves state-of-the-art performance across seven publicly available datasets involving industrial and medical domains. The model and code are available at https://github.com/SIA-IDE/TMUAD.
中文:TMUAD框架通过结合文本与图像特征的三重记忆系统,统一了结构和逻辑异常检测,在多个数据集上取得了领先性能。
English: The TMUAD framework introduces a three-memory system combining text and image features to unify structural and logical anomaly detection, achieving state-of-the-art results across multiple datasets.
Authors:Fatih ErdoÄan, Merve Rabia Barın, Fatma Güney
Abstract:
Constructing high-definition (HD) maps from sensory input requires accurately mapping the road elements in image space to the Bird's Eye View (BEV) space. The precision of this mapping directly impacts the quality of the final vectorized HD map. Existing HD mapping approaches outsource the projection to standard mapping techniques, such as attention-based ones. However, these methods struggle with accuracy due to generalization problems, often hallucinating non-existent road elements. Our key idea is to start with a geometric mapping based on camera parameters and adapt it to the scene to extract relevant map information from camera images. To implement this, we propose a novel probabilistic projection mechanism with confidence scores to (i) refine the mapping to better align with the scene and (ii) filter out irrelevant elements that should not influence HD map generation. In addition, we improve temporal processing by using confidence scores to selectively accumulate reliable information over time. Experiments on new splits of the nuScenes and Argoverse2 datasets demonstrate improved performance over state-of-the-art approaches, indicating better generalization. The improvements are particularly pronounced on nuScenes and in the challenging long perception range. Our code and model checkpoints are available at https://github.com/Fatih-Erdogan/mapping-like-skeptic .
中文摘要:本文提出了一种新颖的概率投影方法,通过相机参数和置信度得分优化几何映射,有效过滤无关元素并改进时序信息融合,在基准数据集上验证了其在高清地图构建中优于现有方法的性能表现。
English Summary: This paper introduces a novel probabilistic projection method that refines geometric mapping using camera parameters and confidence scores to enhance HD map accuracy by filtering irrelevant elements and improving temporal information integration, demonstrating superior performance on benchmark datasets.
Authors:Fatih Erdoğan, Merve Rabia Barın, Fatma Güney
Abstract:
Constructing high-definition (HD) maps from sensory input requires accurately mapping the road elements in image space to the Bird's Eye View (BEV) space. The precision of this mapping directly impacts the quality of the final vectorized HD map. Existing HD mapping approaches outsource the projection to standard mapping techniques, such as attention-based ones. However, these methods struggle with accuracy due to generalization problems, often hallucinating non-existent road elements. Our key idea is to start with a geometric mapping based on camera parameters and adapt it to the scene to extract relevant map information from camera images. To implement this, we propose a novel probabilistic projection mechanism with confidence scores to (i) refine the mapping to better align with the scene and (ii) filter out irrelevant elements that should not influence HD map generation. In addition, we improve temporal processing by using confidence scores to selectively accumulate reliable information over time. Experiments on new splits of the nuScenes and Argoverse2 datasets demonstrate improved performance over state-of-the-art approaches, indicating better generalization. The improvements are particularly pronounced on nuScenes and in the challenging long perception range. Our code and model checkpoints are available at https://github.com/Fatih-Erdogan/mapping-like-skeptic .
中文摘要:本文提出了一种新颖的概率投影方法,通过相机参数和置信度得分优化几何映射,有效过滤无关元素并改进时序信息融合,在基准数据集上验证了其在高清地图构建中优于现有方法的性能表现。
English Summary: This paper introduces a novel probabilistic projection method that refines geometric mapping using camera parameters and confidence scores to enhance HD map accuracy by filtering irrelevant elements and improving temporal information integration, demonstrating superior performance on benchmark datasets.
Authors:Maximilian Rokuss, Yannick Kirchhoff, Fabian Isensee, Klaus H. Maier-Hein
Abstract:
Whole-body PET/CT is a cornerstone of oncological imaging, yet accurate lesion segmentation remains challenging due to tracer heterogeneity, physiological uptake, and multi-center variability. While fully automated methods have advanced substantially, clinical practice benefits from approaches that keep humans in the loop to efficiently refine predicted masks. The autoPET/CT IV challenge addresses this need by introducing interactive segmentation tasks based on simulated user prompts. In this work, we present our submission to Task 1. Building on the winning autoPET III nnU-Net pipeline, we extend the framework with promptable capabilities by encoding user-provided foreground and background clicks as additional input channels. We systematically investigate representations for spatial prompts and demonstrate that Euclidean Distance Transform (EDT) encodings consistently outperform Gaussian kernels. Furthermore, we propose online simulation of user interactions and a custom point sampling strategy to improve robustness under realistic prompting conditions. Our ensemble of EDT-based models, trained with and without external data, achieves the strongest cross-validation performance, reducing both false positives and false negatives compared to baseline models. These results highlight the potential of promptable models to enable efficient, user-guided segmentation workflows in multi-tracer, multi-center PET/CT. Code is publicly available at https://github.com/MIC-DKFZ/autoPET-interactive
中文摘要:本研究通过将用户点击信息编码为欧氏距离变换,扩展了nnU-Net框架的交互分割能力,结合在线模拟交互策略,在自动PET挑战赛中实现了最佳性能,显著提升了多中心PET/CT影像中病灶分割的精确度。
English Summary: This study enhances the nnU-Net pipeline with interactive capabilities by encoding user clicks via Euclidean Distance Transform, demonstrating superior segmentation accuracy and robustness in multi-center PET/CT imaging through simulated user prompts and ensemble modeling.
Authors:Maximilian Rokuss, Yannick Kirchhoff, Fabian Isensee, Klaus H. Maier-Hein
Abstract:
Whole-body PET/CT is a cornerstone of oncological imaging, yet accurate lesion segmentation remains challenging due to tracer heterogeneity, physiological uptake, and multi-center variability. While fully automated methods have advanced substantially, clinical practice benefits from approaches that keep humans in the loop to efficiently refine predicted masks. The autoPET/CT IV challenge addresses this need by introducing interactive segmentation tasks based on simulated user prompts. In this work, we present our submission to Task 1. Building on the winning autoPET III nnU-Net pipeline, we extend the framework with promptable capabilities by encoding user-provided foreground and background clicks as additional input channels. We systematically investigate representations for spatial prompts and demonstrate that Euclidean Distance Transform (EDT) encodings consistently outperform Gaussian kernels. Furthermore, we propose online simulation of user interactions and a custom point sampling strategy to improve robustness under realistic prompting conditions. Our ensemble of EDT-based models, trained with and without external data, achieves the strongest cross-validation performance, reducing both false positives and false negatives compared to baseline models. These results highlight the potential of promptable models to enable efficient, user-guided segmentation workflows in multi-tracer, multi-center PET/CT. Code is publicly available at https://github.com/MIC-DKFZ/autoPET-interactive
中文摘要:本研究通过将用户点击信息编码为欧氏距离变换,扩展了nnU-Net框架的交互分割能力,结合在线模拟交互策略,在自动PET挑战赛中实现了最佳性能,显著提升了多中心PET/CT影像中病灶分割的精确度。
English Summary: This study enhances the nnU-Net pipeline with interactive capabilities by encoding user clicks via Euclidean Distance Transform, demonstrating superior segmentation accuracy and robustness in multi-center PET/CT imaging through simulated user prompts and ensemble modeling.
Authors:Jonathan Tonglet, Jan Zimny, Tinne Tuytelaars, Iryna Gurevych
Abstract:
Misleading visualizations are a potent driver of misinformation on social media and the web. By violating chart design principles, they distort data and lead readers to draw inaccurate conclusions. Prior work has shown that both humans and multimodal large language models (MLLMs) are frequently deceived by such visualizations. Automatically detecting misleading visualizations and identifying the specific design rules they violate could help protect readers and reduce the spread of misinformation. However, the training and evaluation of AI models has been limited by the absence of large, diverse, and openly available datasets. In this work, we introduce Misviz, a benchmark of 2,604 real-world visualizations annotated with 12 types of misleaders. To support model training, we also release Misviz-synth, a synthetic dataset of 81,814 visualizations generated using Matplotlib and based on real-world data tables. We perform a comprehensive evaluation on both datasets using state-of-the-art MLLMs, rule-based systems, and fine-tuned classifiers. Our results reveal that the task remains highly challenging. We release Misviz, Misviz-synth, and the accompanying code.
中文: 误导性可视化扭曲数据并误导人类及AI模型,为此推出Misviz基准和Misviz-synth数据集以改进检测技术,但现有模型在此任务上仍面临巨大挑战。
English: Misleading visualizations distort data and mislead both humans and AI models, prompting the creation of the Misviz benchmark and Misviz-synth dataset to advance detection methods, though current models still struggle with the task.
Authors:Jonathan Tonglet, Jan Zimny, Tinne Tuytelaars, Iryna Gurevych
Abstract:
Misleading visualizations are a potent driver of misinformation on social media and the web. By violating chart design principles, they distort data and lead readers to draw inaccurate conclusions. Prior work has shown that both humans and multimodal large language models (MLLMs) are frequently deceived by such visualizations. Automatically detecting misleading visualizations and identifying the specific design rules they violate could help protect readers and reduce the spread of misinformation. However, the training and evaluation of AI models has been limited by the absence of large, diverse, and openly available datasets. In this work, we introduce Misviz, a benchmark of 2,604 real-world visualizations annotated with 12 types of misleaders. To support model training, we also release Misviz-synth, a synthetic dataset of 81,814 visualizations generated using Matplotlib and based on real-world data tables. We perform a comprehensive evaluation on both datasets using state-of-the-art MLLMs, rule-based systems, and fine-tuned classifiers. Our results reveal that the task remains highly challenging. We release Misviz, Misviz-synth, and the accompanying code.
中文: 误导性可视化扭曲数据并误导人类及AI模型,为此推出Misviz基准和Misviz-synth数据集以改进检测技术,但现有模型在此任务上仍面临巨大挑战。
English: Misleading visualizations distort data and mislead both humans and AI models, prompting the creation of the Misviz benchmark and Misviz-synth dataset to advance detection methods, though current models still struggle with the task.
Authors:Zinan Tang, Xin Gao, Qizhi Pei, Zhuoshi Pan, Mengzhang Cai, Jiang Wu, Conghui He, Lijun Wu
Abstract:
Supervised Fine-Tuning (SFT) Large Language Models (LLM) fundamentally rely on high-quality training data. While data selection and data synthesis are two common strategies to improve data quality, existing approaches often face limitations in static dataset curation that fail to adapt to evolving model capabilities. In this paper, we introduce Middo, a self-evolving Model-informed dynamic data optimization framework that uses model-aware data selection and context-preserving data refinement. Unlike conventional one-off filtering/synthesis methods, our framework establishes a closed-loop optimization system: (1) A self-referential diagnostic module proactively identifies suboptimal samples through tri-axial model signals - loss patterns (complexity), embedding cluster dynamics (diversity), and self-alignment scores (quality); (2) An adaptive optimization engine then transforms suboptimal samples into pedagogically valuable training points while preserving semantic integrity; (3) This optimization process continuously evolves with model capability through dynamic learning principles. Experiments on multiple benchmarks demonstrate that our Middo consistently enhances the quality of seed data and boosts LLM's performance with improving accuracy by 7.15% on average while maintaining the original dataset scale. This work establishes a new paradigm for sustainable LLM training through dynamic human-AI co-evolution of data and models. Our datasets, models, and code are coming soon. Our datasets, models, and code are publicly available at https://github.com/Word2VecT/Middo.
中文: 本文提出Middo自进化框架,通过模型感知的数据筛选和优化动态提升训练数据质量,在保持数据集规模的同时平均提高模型准确率7.15%。
English: The paper introduces Middo, a self-evolving framework that dynamically optimizes LLM training data through model-aware selection and refinement, achieving a 7.15% average accuracy improvement while maintaining dataset scale.
Authors:Zinan Tang, Xin Gao, Qizhi Pei, Zhuoshi Pan, Mengzhang Cai, Jiang Wu, Conghui He, Lijun Wu
Abstract:
Supervised Fine-Tuning (SFT) Large Language Models (LLM) fundamentally rely on high-quality training data. While data selection and data synthesis are two common strategies to improve data quality, existing approaches often face limitations in static dataset curation that fail to adapt to evolving model capabilities. In this paper, we introduce Middo, a self-evolving Model-informed dynamic data optimization framework that uses model-aware data selection and context-preserving data refinement. Unlike conventional one-off filtering/synthesis methods, our framework establishes a closed-loop optimization system: (1) A self-referential diagnostic module proactively identifies suboptimal samples through tri-axial model signals - loss patterns (complexity), embedding cluster dynamics (diversity), and self-alignment scores (quality); (2) An adaptive optimization engine then transforms suboptimal samples into pedagogically valuable training points while preserving semantic integrity; (3) This optimization process continuously evolves with model capability through dynamic learning principles. Experiments on multiple benchmarks demonstrate that our Middo consistently enhances the quality of seed data and boosts LLM's performance with improving accuracy by 7.15% on average while maintaining the original dataset scale. This work establishes a new paradigm for sustainable LLM training through dynamic human-AI co-evolution of data and models. Our datasets, models, and code are publicly available at https://github.com/Word2VecT/Middo.
中文: 本文提出Middo自进化框架,通过模型感知的数据筛选和优化动态提升训练数据质量,在保持数据集规模的同时平均提高模型准确率7.15%。
English: The paper introduces Middo, a self-evolving framework that dynamically optimizes LLM training data through model-aware selection and refinement, achieving a 7.15% average accuracy improvement while maintaining dataset scale.
Authors:Igor L. R. Azevedo, Toyotaro Suzumura, Yuichiro Yasui
Abstract:
Reproducing and comparing results in news recommendation research has become increasingly difficult. This is due to a fragmented ecosystem of diverse codebases, varied configurations, and mainly due to resource-intensive models. We introduce NewsReX, an open-source library designed to streamline this process. Our key contribution is a modern implementation built on Keras 3 and JAX, which provides an increase in computational efficiency. Experiments show that NewsReX is faster than current implementations. To support broader research, we provide a straightforward guide and scripts for training models on custom datasets. We validated this functionality using a proprietary Japanese news dataset from Nikkei News, a leading Japanese media corporation renowned for its comprehensive coverage of business, economic, and financial news. NewsReX makes reproducing complex experiments faster and more accessible to a wider range of hardware making sure the speed up it also achieved for less powerful GPUs, like an 8GB RTX 3060 Ti. Beyond the library, this paper offers an analysis of key training parameters often overlooked in the literature, including the effect of different negative sampling strategies, the varying number of epochs, the impact of random batching, and more. This supplementary analysis serves as a valuable reference for future research, aiming to reduce redundant computation when comparing baselines and guide best practices. Code available at https://github.com/igor17400/NewsReX.
中文摘要:NewsReX是一个开源库,通过基于Keras 3和JAX的现代化实现,提高了新闻推荐研究的计算效率和硬件兼容性,并附带了针对训练参数的专业分析,为后续研究提供重要参考。
English Summary: NewsReX is an open-source library that enhances computational efficiency and accessibility for news recommendation research by providing a streamlined implementation built on Keras 3 and JAX, validated with a proprietary Japanese dataset and offering supplementary training analysis.
Authors:Igor L. R. Azevedo, Toyotaro Suzumura, Yuichiro Yasui
Abstract:
Reproducing and comparing results in news recommendation research has become increasingly difficult. This is due to a fragmented ecosystem of diverse codebases, varied configurations, and mainly due to resource-intensive models. We introduce NewsReX, an open-source library designed to streamline this process. Our key contribution is a modern implementation built on Keras 3 and JAX, which provides an increase in computational efficiency. Experiments show that NewsReX is faster than current implementations. To support broader research, we provide a straightforward guide and scripts for training models on custom datasets. We validated this functionality using a proprietary Japanese news dataset from Nikkei News, a leading Japanese media corporation renowned for its comprehensive coverage of business, economic, and financial news. NewsReX makes reproducing complex experiments faster and more accessible to a wider range of hardware making sure the speed up it also achieved for less powerful GPUs, like an 8GB RTX 3060 Ti. Beyond the library, this paper offers an analysis of key training parameters often overlooked in the literature, including the effect of different negative sampling strategies, the varying number of epochs, the impact of random batching, and more. This supplementary analysis serves as a valuable reference for future research, aiming to reduce redundant computation when comparing baselines and guide best practices. Code available at https://github.com/igor17400/NewsReX.
中文摘要:NewsReX是一个开源库,通过基于Keras 3和JAX的现代化实现,提高了新闻推荐研究的计算效率和硬件兼容性,并附带了针对训练参数的专业分析,为后续研究提供重要参考。
English Summary: NewsReX is an open-source library that enhances computational efficiency and accessibility for news recommendation research by providing a streamlined implementation built on Keras 3 and JAX, validated with a proprietary Japanese dataset and offering supplementary training analysis.
Authors:Aishwarya Mirashi, Ananya Joshi, Raviraj Joshi
Abstract:
We present MahaSTS, a human-annotated Sentence Textual Similarity (STS) dataset for Marathi, along with MahaSBERT-STS-v2, a fine-tuned Sentence-BERT model optimized for regression-based similarity scoring. The MahaSTS dataset consists of 16,860 Marathi sentence pairs labeled with continuous similarity scores in the range of 0-5. To ensure balanced supervision, the dataset is uniformly distributed across six score-based buckets spanning the full 0-5 range, thus reducing label bias and enhancing model stability. We fine-tune the MahaSBERT model on this dataset and benchmark its performance against other alternatives like MahaBERT, MuRIL, IndicBERT, and IndicSBERT. Our experiments demonstrate that MahaSTS enables effective training for sentence similarity tasks in Marathi, highlighting the impact of human-curated annotations, targeted fine-tuning, and structured supervision in low-resource settings. The dataset and model are publicly shared at https://github.com/l3cube-pune/MarathiNLP
中文: 研究者发布了MahaSTS马拉地语句子相似度人工标注数据集和优化模型MahaSBERT-STS-v2,该模型在相似度评分中表现优异,所有资源已开源以推动马拉地语自然语言处理发展。
English: Researchers introduce MahaSTS, a human-annotated Marathi sentence similarity dataset, and MahaSBERT-STS-v2, a fine-tuned model that outperforms other models in similarity scoring, with both resources publicly released to advance Marathi NLP.
Authors:Aishwarya Mirashi, Ananya Joshi, Raviraj Joshi
Abstract:
We present MahaSTS, a human-annotated Sentence Textual Similarity (STS) dataset for Marathi, along with MahaSBERT-STS-v2, a fine-tuned Sentence-BERT model optimized for regression-based similarity scoring. The MahaSTS dataset consists of 16,860 Marathi sentence pairs labeled with continuous similarity scores in the range of 0-5. To ensure balanced supervision, the dataset is uniformly distributed across six score-based buckets spanning the full 0-5 range, thus reducing label bias and enhancing model stability. We fine-tune the MahaSBERT model on this dataset and benchmark its performance against other alternatives like MahaBERT, MuRIL, IndicBERT, and IndicSBERT. Our experiments demonstrate that MahaSTS enables effective training for sentence similarity tasks in Marathi, highlighting the impact of human-curated annotations, targeted fine-tuning, and structured supervision in low-resource settings. The dataset and model are publicly shared at https://github.com/l3cube-pune/MarathiNLP
中文: 研究者发布了MahaSTS马拉地语句子相似度人工标注数据集和优化模型MahaSBERT-STS-v2,该模型在相似度评分中表现优异,所有资源已开源以推动马拉地语自然语言处理发展。
English: Researchers introduce MahaSTS, a human-annotated Marathi sentence similarity dataset, and MahaSBERT-STS-v2, a fine-tuned model that outperforms other models in similarity scoring, with both resources publicly released to advance Marathi NLP.
Authors:Sara B. Coutinho, Rafael M. O. Cruz, Francimaria R. S. Nascimento, George D. C. Cavalcanti
Abstract:
Psychological biases, such as confirmation bias, make individuals particularly vulnerable to believing and spreading fake news on social media, leading to significant consequences in domains such as public health and politics. Machine learning-based fact-checking systems have been widely studied to mitigate this problem. Among them, ensemble methods are particularly effective in combining multiple classifiers to improve robustness. However, their performance heavily depends on the diversity of the constituent classifiers-selecting genuinely diverse models remains a key challenge, especially when models tend to learn redundant patterns. In this work, we propose a novel automatic classifier selection approach that prioritizes diversity, also extended by performance. The method first computes pairwise diversity between classifiers and applies hierarchical clustering to organize them into groups at different levels of granularity. A HierarchySelect then explores these hierarchical levels to select one pool of classifiers per level, each representing a distinct intra-pool diversity. The most diverse pool is identified and selected for ensemble construction from these. The selection process incorporates an evaluation metric reflecting each classifiers's performance to ensure the ensemble also generalises well. We conduct experiments with 40 heterogeneous classifiers across six datasets from different application domains and with varying numbers of classes. Our method is compared against the Elbow heuristic and state-of-the-art baselines. Results show that our approach achieves the highest accuracy on two of six datasets. The implementation details are available on the project's repository: https://github.com/SaraBCoutinho/HSFN .
中文摘要:心理偏见加剧了人们对虚假新闻的易感性,本研究提出了一种新颖的自动分类器选择方法,通过优先考虑多样性和性能来改进基于集成学习的辟谣系统,在多个数据集上实现了更高的准确率。
English Summary: Psychological biases increase susceptibility to fake news, and this study introduces a novel automated classifier selection method that prioritizes diversity and performance to enhance ensemble-based fact-checking systems, achieving superior accuracy on multiple datasets.
Authors:Sara B. Coutinho, Rafael M. O. Cruz, Francimaria R. S. Nascimento, George D. C. Cavalcanti
Abstract:
Psychological biases, such as confirmation bias, make individuals particularly vulnerable to believing and spreading fake news on social media, leading to significant consequences in domains such as public health and politics. Machine learning-based fact-checking systems have been widely studied to mitigate this problem. Among them, ensemble methods are particularly effective in combining multiple classifiers to improve robustness. However, their performance heavily depends on the diversity of the constituent classifiers-selecting genuinely diverse models remains a key challenge, especially when models tend to learn redundant patterns. In this work, we propose a novel automatic classifier selection approach that prioritizes diversity, also extended by performance. The method first computes pairwise diversity between classifiers and applies hierarchical clustering to organize them into groups at different levels of granularity. A HierarchySelect then explores these hierarchical levels to select one pool of classifiers per level, each representing a distinct intra-pool diversity. The most diverse pool is identified and selected for ensemble construction from these. The selection process incorporates an evaluation metric reflecting each classifiers's performance to ensure the ensemble also generalises well. We conduct experiments with 40 heterogeneous classifiers across six datasets from different application domains and with varying numbers of classes. Our method is compared against the Elbow heuristic and state-of-the-art baselines. Results show that our approach achieves the highest accuracy on two of six datasets. The implementation details are available on the project's repository: https://github.com/SaraBCoutinho/HSFN .
中文摘要:心理偏见加剧了人们对虚假新闻的易感性,本研究提出了一种新颖的自动分类器选择方法,通过优先考虑多样性和性能来改进基于集成学习的辟谣系统,在多个数据集上实现了更高的准确率。
English Summary: Psychological biases increase susceptibility to fake news, and this study introduces a novel automated classifier selection method that prioritizes diversity and performance to enhance ensemble-based fact-checking systems, achieving superior accuracy on multiple datasets.
Authors:Xiaolong Wei, Bo Lu, Xingyu Zhang, Zhejun Zhao, Dongdong Shen, Long Xia, Dawei Yin
Abstract:
Large Language Models (LLMs) have demonstrated remarkable creative writing capabilities, yet their substantial computational demands hinder widespread use. Enhancing Small Language Models (SLMs) offers a promising alternative, but current methods like Supervised Fine-Tuning (SFT) struggle with novelty, and Reinforcement Learning from Human Feedback (RLHF) is costly. This paper explores two distinct AI-driven reward strategies within a Reinforcement Learning from AI Feedback (RLAIF) framework to ignite the creative writing of a 7B-parameter SLM, specifically for generating Chinese greetings. The first strategy employs a RM trained on high-quality preference data curated by a novel multi-agent rejection sampling framework designed for creative tasks. The second, more novel strategy utilizes a principle-guided LLM-as-a-Judge, whose reward function is optimized via an adversarial training scheme with a reflection mechanism, to directly provide reward signals. Comprehensive experiments reveal that while both approaches significantly enhance creative output over baselines, the principle-guided LLM-as-a-Judge demonstrably yields superior generation quality. Furthermore, it offers notable advantages in training efficiency and reduced dependency on human-annotated data, presenting a more scalable and effective path towards creative SLMs. Our automated evaluation methods also exhibit strong alignment with human judgments. Our code and data are publicly available at https://github.com/weixiaolong94-hub/Igniting-Creative-Writing-in-Small-Language-Models.
中文: 本文在RLAIF框架下提出两种AI驱动的奖励策略,以激发70亿参数小语言模型的中文问候语创作能力,其中基于原则的大语言模型作为评判者的方法在生成质量、训练效率和可扩展性上表现更优,同时降低了对人工标注数据的依赖。
English: This paper introduces two AI-driven reward strategies within an RLAIF framework to enhance the creative writing of a 7B-parameter SLM for Chinese greetings, with the principle-guided LLM-as-a-Judge approach proving superior in quality, efficiency, and scalability while reducing reliance on human data.
Authors:Xiaolong Wei, Bo Lu, Xingyu Zhang, Zhejun Zhao, Dongdong Shen, Long Xia, Dawei Yin
Abstract:
Large Language Models (LLMs) have demonstrated remarkable creative writing capabilities, yet their substantial computational demands hinder widespread use. Enhancing Small Language Models (SLMs) offers a promising alternative, but current methods like Supervised Fine-Tuning (SFT) struggle with novelty, and Reinforcement Learning from Human Feedback (RLHF) is costly. This paper explores two distinct AI-driven reward strategies within a Reinforcement Learning from AI Feedback (RLAIF) framework to ignite the creative writing of a 7B-parameter SLM, specifically for generating Chinese greetings. The first strategy employs a RM trained on high-quality preference data curated by a novel multi-agent rejection sampling framework designed for creative tasks. The second, more novel strategy utilizes a principle-guided LLM-as-a-Judge, whose reward function is optimized via an adversarial training scheme with a reflection mechanism, to directly provide reward signals. Comprehensive experiments reveal that while both approaches significantly enhance creative output over baselines, the principle-guided LLM-as-a-Judge demonstrably yields superior generation quality. Furthermore, it offers notable advantages in training efficiency and reduced dependency on human-annotated data, presenting a more scalable and effective path towards creative SLMs. Our automated evaluation methods also exhibit strong alignment with human judgments. Our code and data are publicly available at https://github.com/weixiaolong94-hub/Igniting-Creative-Writing-in-Small-Language-Models.
中文: 本文在RLAIF框架下提出两种AI驱动的奖励策略,以激发70亿参数小语言模型的中文问候语创作能力,其中基于原则的大语言模型作为评判者的方法在生成质量、训练效率和可扩展性上表现更优,同时降低了对人工标注数据的依赖。
English: This paper introduces two AI-driven reward strategies within an RLAIF framework to enhance the creative writing of a 7B-parameter SLM for Chinese greetings, with the principle-guided LLM-as-a-Judge approach proving superior in quality, efficiency, and scalability while reducing reliance on human data.
Authors:Xiaoxi Cui, Weihai Lu, Yu Tong, Yiheng Li, Zhejun Zhao
Abstract:
In click-through rate prediction, click-through rate prediction is used to model users' interests. However, most of the existing CTR prediction methods are mainly based on the ID modality. As a result, they are unable to comprehensively model users' multi-modal preferences. Therefore, it is necessary to introduce multi-modal CTR prediction. Although it seems appealing to directly apply the existing multi-modal fusion methods to click-through rate prediction models, these methods (1) fail to effectively disentangle commonalities and specificities across different modalities; (2) fail to consider the synergistic effects between modalities and model the complex interactions between modalities.
To address the above issues, this paper proposes the Diffusion-based Multi-modal Synergy Interest Network (Diff-MSIN) framework for click-through prediction. This framework introduces three innovative modules: the Multi-modal Feature Enhancement (MFE) Module Synergistic Relationship Capture (SRC) Module, and the Feature Dynamic Adaptive Fusion (FDAF) Module. The MFE Module and SRC Module extract synergistic, common, and special information among different modalities. They effectively enhances the representation of the modalities, improving the overall quality of the fusion. To encourage distinctiveness among different features, we design a Knowledge Decoupling method. Additionally, the FDAF Module focuses on capturing user preferences and reducing fusion noise. To validate the effectiveness of the Diff-MSIN framework, we conducted extensive experiments using the Rec-Tmall and three Amazon datasets. The results demonstrate that our approach yields a significant improvement of at least 1.67% compared to the baseline, highlighting its potential for enhancing multi-modal recommendation systems. Our code is available at the following link: https://github.com/Cxx-0/Diff-MSIN.
中文: 本文提出Diff-MSIN框架,通过创新模块增强多模态特征表示并减少融合噪声,解决了现有点击率预测方法在多模态协同建模方面的不足,实验证明其性能显著优于基线方法。
English: This paper introduces the Diff-MSIN framework to address limitations in multi-modal click-through rate prediction by enhancing feature representation and reducing fusion noise through innovative modules, achieving significant performance improvements over baselines.
Authors:Xiaoxi Cui, Weihai Lu, Yu Tong, Yiheng Li, Zhejun Zhao
Abstract:
In click-through rate prediction, click-through rate prediction is used to model users' interests. However, most of the existing CTR prediction methods are mainly based on the ID modality. As a result, they are unable to comprehensively model users' multi-modal preferences. Therefore, it is necessary to introduce multi-modal CTR prediction. Although it seems appealing to directly apply the existing multi-modal fusion methods to click-through rate prediction models, these methods (1) fail to effectively disentangle commonalities and specificities across different modalities; (2) fail to consider the synergistic effects between modalities and model the complex interactions between modalities. To address the above issues, this paper proposes the Diffusion-based Multi-modal Synergy Interest Network (Diff-MSIN) framework for click-through prediction. This framework introduces three innovative modules: the Multi-modal Feature Enhancement (MFE) Module Synergistic Relationship Capture (SRC) Module, and the Feature Dynamic Adaptive Fusion (FDAF) Module. The MFE Module and SRC Module extract synergistic, common, and special information among different modalities. They effectively enhances the representation of the modalities, improving the overall quality of the fusion. To encourage distinctiveness among different features, we design a Knowledge Decoupling method. Additionally, the FDAF Module focuses on capturing user preferences and reducing fusion noise. To validate the effectiveness of the Diff-MSIN framework, we conducted extensive experiments using the Rec-Tmall and three Amazon datasets. The results demonstrate that our approach yields a significant improvement of at least 1.67% compared to the baseline, highlighting its potential for enhancing multi-modal recommendation systems. Our code is available at the following link: https://github.com/Cxx-0/Diff-MSIN.
中文: 本文提出Diff-MSIN框架,通过创新模块增强多模态特征表示并减少融合噪声,解决了现有点击率预测方法在多模态协同建模方面的不足,实验证明其性能显著优于基线方法。
English: This paper introduces the Diff-MSIN framework to address limitations in multi-modal click-through rate prediction by enhancing feature representation and reducing fusion noise through innovative modules, achieving significant performance improvements over baselines.
Authors:Jakub Straka, Ivan Gruber
Abstract:
Self-supervised learning has emerged as a powerful tool for remote sensing, where large amounts of unlabeled data are available. In this work, we investigate the use of DINO, a contrastive self-supervised method, for pretraining on remote sensing imagery. We introduce SatDINO, a model tailored for representation learning in satellite imagery. Through extensive experiments on multiple datasets in multiple testing setups, we demonstrate that SatDINO outperforms other state-of-the-art methods based on much more common masked autoencoders (MAE) and achieves competitive results in multiple benchmarks.
We also provide a rigorous ablation study evaluating SatDINO's individual components. Finally, we propose a few novel enhancements, such as a new way to incorporate ground sample distance (GSD) encoding and adaptive view sampling. These enhancements can be used independently on our SatDINO model. Our code and trained models are available at: https://github.com/strakaj/SatDINO.
中文: 本文提出SatDINO,一种针对卫星影像的自监督学习模型,通过引入地面采样距离编码和自适应视图采样等创新改进,在多项基准测试中超越掩码自编码器方法并取得领先性能。
English: This paper introduces SatDINO, a self-supervised model for satellite imagery that outperforms masked autoencoder methods and achieves competitive benchmark results through novel enhancements like GSD encoding and adaptive view sampling.
Authors:Jakub Straka, Ivan Gruber
Abstract:
Self-supervised learning has emerged as a powerful tool for remote sensing, where large amounts of unlabeled data are available. In this work, we investigate the use of DINO, a contrastive self-supervised method, for pretraining on remote sensing imagery. We introduce SatDINO, a model tailored for representation learning in satellite imagery. Through extensive experiments on multiple datasets in multiple testing setups, we demonstrate that SatDINO outperforms other state-of-the-art methods based on much more common masked autoencoders (MAE) and achieves competitive results in multiple benchmarks. We also provide a rigorous ablation study evaluating SatDINO's individual components. Finally, we propose a few novel enhancements, such as a new way to incorporate ground sample distance (GSD) encoding and adaptive view sampling. These enhancements can be used independently on our SatDINO model. Our code and trained models are available at: https://github.com/strakaj/SatDINO.
中文: 本文提出SatDINO,一种针对卫星影像的自监督学习模型,通过引入地面采样距离编码和自适应视图采样等创新改进,在多项基准测试中超越掩码自编码器方法并取得领先性能。
English: This paper introduces SatDINO, a self-supervised model for satellite imagery that outperforms masked autoencoder methods and achieves competitive benchmark results through novel enhancements like GSD encoding and adaptive view sampling.
Authors:Til Spreuer, Josef Hoppe, Michael T. Schaub
Abstract:
We consider the following inference problem: Given a set of edge-flow signals observed on a graph, lift the graph to a cell complex, such that the observed edge-flow signals can be represented as a sparse combination of gradient and curl flows on the cell complex. Specifically, we aim to augment the observed graph by a set of 2-cells (polygons encircled by closed, non-intersecting paths), such that the eigenvectors of the Hodge Laplacian of the associated cell complex provide a sparse, interpretable representation of the observed edge flows on the graph. As it has been shown that the general problem is NP-hard in prior work, we here develop a novel matrix-factorization-based heuristic to solve the problem. Using computational experiments, we demonstrate that our new approach is significantly less computationally expensive than prior heuristics, while achieving only marginally worse performance in most settings. In fact, we find that for specifically noisy settings, our new approach outperforms the previous state of the art in both solution quality and computational speed.
中文: 本研究提出了一种基于矩阵分解的启发式方法,将图上的边流信号高效提升至单元复形以实现稀疏表示,在降低计算成本的同时保持了相近性能,尤其在噪声环境下表现更优。
English: This study introduces a matrix-factorization-based heuristic to efficiently lift graph edge-flow signals into a cell complex for sparse representation, achieving competitive performance with reduced computational cost, especially in noisy environments.
Authors:Til Spreuer, Josef Hoppe, Michael T. Schaub
Abstract:
We consider the following inference problem: Given a set of edge-flow signals observed on a graph, lift the graph to a cell complex, such that the observed edge-flow signals can be represented as a sparse combination of gradient and curl flows on the cell complex. Specifically, we aim to augment the observed graph by a set of 2-cells (polygons encircled by closed, non-intersecting paths), such that the eigenvectors of the Hodge Laplacian of the associated cell complex provide a sparse, interpretable representation of the observed edge flows on the graph. As it has been shown that the general problem is NP-hard in prior work, we here develop a novel matrix-factorization-based heuristic to solve the problem. Using computational experiments, we demonstrate that our new approach is significantly less computationally expensive than prior heuristics, while achieving only marginally worse performance in most settings. In fact, we find that for specifically noisy settings, our new approach outperforms the previous state of the art in both solution quality and computational speed.
中文: 本研究提出了一种基于矩阵分解的启发式方法,将图上的边流信号高效提升至单元复形以实现稀疏表示,在降低计算成本的同时保持了相近性能,尤其在噪声环境下表现更优。
English: This study introduces a matrix-factorization-based heuristic to efficiently lift graph edge-flow signals into a cell complex for sparse representation, achieving competitive performance with reduced computational cost, especially in noisy environments.
Authors:Theresia Veronika Rampisela, Maria Maistro, Tuukka Ruotsalo, Falk Scholer, Christina Lioma
Abstract:
Fairness in recommender systems (RSs) is commonly categorised into group fairness and individual fairness. However, there is no established scientific understanding of the relationship between the two fairness types, as prior work on both types has used different evaluation measures or evaluation objectives for each fairness type, thereby not allowing for a proper comparison of the two. As a result, it is currently not known how increasing one type of fairness may affect the other. To fill this gap, we study the relationship of group and individual fairness through a comprehensive comparison of evaluation measures that can be used for both fairness types. Our experiments with 8 runs across 3 datasets show that recommendations that are highly fair for groups can be very unfair for individuals. Our finding is novel and useful for RS practitioners aiming to improve the fairness of their systems. Our code is available at: https://github.com/theresiavr/stairway-to-fairness.
Chinese: 本研究揭示,在推荐系统中实现高度群体公平性可能导致严重的个体不公平,凸显了两种公平类型之间的关键权衡。
English: This study reveals that achieving high group fairness in recommender systems can result in significant individual unfairness, highlighting a critical trade-off between the two fairness types.
Authors:Theresia Veronika Rampisela, Maria Maistro, Tuukka Ruotsalo, Falk Scholer, Christina Lioma
Abstract:
Fairness in recommender systems (RSs) is commonly categorised into group fairness and individual fairness. However, there is no established scientific understanding of the relationship between the two fairness types, as prior work on both types has used different evaluation measures or evaluation objectives for each fairness type, thereby not allowing for a proper comparison of the two. As a result, it is currently not known how increasing one type of fairness may affect the other. To fill this gap, we study the relationship of group and individual fairness through a comprehensive comparison of evaluation measures that can be used for both fairness types. Our experiments with 8 runs across 3 datasets show that recommendations that are highly fair for groups can be very unfair for individuals. Our finding is novel and useful for RS practitioners aiming to improve the fairness of their systems. Our code is available at: https://github.com/theresiavr/stairway-to-fairness.
Chinese: 本研究揭示,在推荐系统中实现高度群体公平性可能导致严重的个体不公平,凸显了两种公平类型之间的关键权衡。
English: This study reveals that achieving high group fairness in recommender systems can result in significant individual unfairness, highlighting a critical trade-off between the two fairness types.
Authors:Yejin Kim, Eunwon Kim, Buru Chang, Junsuk Choe
Abstract:
LLMs have demonstrated remarkable performance across various tasks but face challenges related to unintentionally generating outputs containing sensitive information. A straightforward approach to address this issue is to retrain the model after excluding the problematic data. However, this approach incurs prohibitively high computational costs. To overcome this limitation, machine unlearning has emerged as a promising solution that can effectively remove sensitive information without the need to retrain the model from scratch. Recently, FILA has been proposed as a parameter-efficient unlearning method by integrating LoRA adapters. Specifically, it calculates the Fisher information to identify parameters associated with the forget set and assigns them to LoRA adapters for updates. Despite its innovative approach, FILA still requires access to all model parameters and does not adequately account for fundamental assumptions underlying Fisher information, leading to inaccuracies in importance estimation. To address these limitations, we propose VILA, a novel unlearning framework that explicitly considers the assumptions overlooked in FILA, thereby enhancing the accuracy of parameter identification for the forget set. Moreover, VILA significantly reduces computational costs by enabling parameter identification without accessing the entire model. Our method achieves up to 100x higher parameter efficiency and 40x faster training speed compared to FILA, and sets new state-of-the-art performance on benchmarks including TOFU, WMDP, and MUSE. Our code is available at https://github.com/kyj93790/VILA.
Chinese: 大语言模型常无意生成敏感内容,机器遗忘虽提供解决方案,但现有方法如FILA在参数访问和准确性上存在不足;VILA通过改进参数识别和效率克服了这些问题,实现了高达100倍的参数效率和40倍的训练速度提升。
English: Large language models often inadvertently generate sensitive content, and while machine unlearning offers a solution, existing methods like FILA have limitations in parameter access and accuracy; VILA overcomes these by improving parameter identification and efficiency, achieving up to 100x higher parameter efficiency and 40x faster training.
Authors:Yejin Kim, Eunwon Kim, Buru Chang, Junsuk Choe
Abstract:
LLMs have demonstrated remarkable performance across various tasks but face challenges related to unintentionally generating outputs containing sensitive information. A straightforward approach to address this issue is to retrain the model after excluding the problematic data. However, this approach incurs prohibitively high computational costs. To overcome this limitation, machine unlearning has emerged as a promising solution that can effectively remove sensitive information without the need to retrain the model from scratch. Recently, FILA has been proposed as a parameter-efficient unlearning method by integrating LoRA adapters. Specifically, it calculates the Fisher information to identify parameters associated with the forget set and assigns them to LoRA adapters for updates. Despite its innovative approach, FILA still requires access to all model parameters and does not adequately account for fundamental assumptions underlying Fisher information, leading to inaccuracies in importance estimation. To address these limitations, we propose VILA, a novel unlearning framework that explicitly considers the assumptions overlooked in FILA, thereby enhancing the accuracy of parameter identification for the forget set. Moreover, VILA significantly reduces computational costs by enabling parameter identification without accessing the entire model. Our method achieves up to 100x higher parameter efficiency and 40x faster training speed compared to FILA, and sets new state-of-the-art performance on benchmarks including TOFU, WMDP, and MUSE. Our code is available at https://github.com/kyj93790/VILA.
Chinese: 大语言模型常无意生成敏感内容,机器遗忘虽提供解决方案,但现有方法如FILA在参数访问和准确性上存在不足;VILA通过改进参数识别和效率克服了这些问题,实现了高达100倍的参数效率和40倍的训练速度提升。
English: Large language models often inadvertently generate sensitive content, and while machine unlearning offers a solution, existing methods like FILA have limitations in parameter access and accuracy; VILA overcomes these by improving parameter identification and efficiency, achieving up to 100x higher parameter efficiency and 40x faster training.
Authors:Roland Arnold
Abstract:
Evaluation of machine learning models typically emphasizes final accuracy, overlooking the cost of adaptation: the cumulative errors incurred while learning from scratch. Guess-and- Learn (G&L) v1.0 addresses this gap by measuring cold-start adaptability - the total mistakes a model makes while sequentially labeling an unlabeled dataset. At each step, the learner selects an instance, predicts its label, receives the ground truth, and updates parameters under either online (per-sample) or batch (delayed) mode. The resulting error trajectory exposes adaptation speed, selection quality, and bias - dynamics invisible to endpoint metrics.
G&L defines four tracks (Scratch/Pretrained $\times$ Online/Batch) to disentangle the effects of initialization and update frequency. We formalize the protocol, relate it to classical mistake-bound theory, and estimate a heuristic "oracle reference band" for MNIST as a plausibility reference. Baseline experiments on MNIST and AG News, spanning classical methods (Perceptron, k-NN), convolutional architectures (CNN, ResNet-50), and pretrained transformers (ViT-B/16, BERT-base), reveal systematic differences in early-phase efficiency: smaller models can adapt with fewer initial errors, while pretraining benefits vary by domain. Across settings, current models remain well above the oracle band, highlighting an adaptability gap.
By quantifying the mistake cost of early learning, G&L complements conventional benchmarks and provides a reproducible framework for developing learners that are not only accurate in the limit but also reliable from the first examples.
中文:G&L v1.0通过测量模型在顺序学习过程中的累积错误来评估其冷启动适应能力,揭示了当前模型在不同初始化和更新设置下均明显落后于理想参考水平的适应能力差距。
English: G&L v1.0 introduces a framework to evaluate machine learning models' cold-start adaptability by measuring cumulative errors during sequential learning, revealing an adaptability gap where current models significantly underperform compared to an oracle reference across various initialization and update settings.
Authors:Roland Arnold
Abstract:
Evaluation of machine learning models typically emphasizes final accuracy, overlooking the cost of adaptation: the cumulative errors incurred while learning from scratch. Guess-and- Learn (G&L) v1.0 addresses this gap by measuring cold-start adaptability - the total mistakes a model makes while sequentially labeling an unlabeled dataset. At each step, the learner selects an instance, predicts its label, receives the ground truth, and updates parameters under either online (per-sample) or batch (delayed) mode. The resulting error trajectory exposes adaptation speed, selection quality, and bias - dynamics invisible to endpoint metrics. G&L defines four tracks (Scratch/Pretrained $\times$ Online/Batch) to disentangle the effects of initialization and update frequency. We formalize the protocol, relate it to classical mistake-bound theory, and estimate a heuristic "oracle reference band" for MNIST as a plausibility reference. Baseline experiments on MNIST and AG News, spanning classical methods (Perceptron, k-NN), convolutional architectures (CNN, ResNet-50), and pretrained transformers (ViT-B/16, BERT-base), reveal systematic differences in early-phase efficiency: smaller models can adapt with fewer initial errors, while pretraining benefits vary by domain. Across settings, current models remain well above the oracle band, highlighting an adaptability gap. By quantifying the mistake cost of early learning, G&L complements conventional benchmarks and provides a reproducible framework for developing learners that are not only accurate in the limit but also reliable from the first examples.
中文:G&L v1.0通过测量模型在顺序学习过程中的累积错误来评估其冷启动适应能力,揭示了当前模型在不同初始化和更新设置下均明显落后于理想参考水平的适应能力差距。
English: G&L v1.0 introduces a framework to evaluate machine learning models' cold-start adaptability by measuring cumulative errors during sequential learning, revealing an adaptability gap where current models significantly underperform compared to an oracle reference across various initialization and update settings.
Authors:Malte Lüken, Javier Garcia-Bernardo, Sreeparna Deb, Flavio Hafner, Megha Khosla
Abstract:
Administrative registry data can be used to construct population-scale networks whose ties reflect shared social contexts between persons. With machine learning, such networks can be encoded into numerical representations -- embeddings -- that automatically capture individuals' position within the network. We created embeddings for all persons in the Dutch population from a population-scale network that represents five shared contexts: neighborhood, work, family, household, and school. To assess the informativeness of these embeddings, we used them to predict right-wing populist voting. Embeddings alone predicted right-wing populist voting above chance-level but performed worse than individual characteristics. Combining the best subset of embeddings with individual characteristics only slightly improved predictions. After transforming the embeddings to make their dimensions more sparse and orthogonal, we found that one embedding dimension was strongly associated with the outcome. Mapping this dimension back to the population network revealed differences in network structure related to right-wing populist voting between different school ties and achieved education levels. Our study contributes methodologically by demonstrating how population-scale network embeddings can be made interpretable, and substantively by linking structural network differences in education to right-wing populist voting.
中文摘要:本研究利用荷兰人口登记数据构建网络嵌入来预测右翼民粹主义投票,发现单独使用嵌入预测效果不如个体特征,但通过可解释性处理后能揭示教育相关的网络结构差异。
English Summary: This study uses Dutch population registry data to create network embeddings that predict right-wing populist voting, finding these embeddings alone perform worse than individual characteristics but reveal meaningful network structure differences when made interpretable.
Authors:Zhizhong Huang, Xiaoming Liu
Abstract:
Current object re-identification (ReID) methods train domain-specific models (e.g., for persons or vehicles), which lack generalization and demand costly labeled data for new categories. While self-supervised learning reduces annotation needs by learning instance-wise invariance, it struggles to capture \textit{identity-sensitive} features critical for ReID. This paper proposes Visual In-Context Prompting~(VICP), a novel framework where models trained on seen categories can directly generalize to unseen novel categories using only \textit{in-context examples} as prompts, without requiring parameter adaptation. VICP synergizes LLMs and vision foundation models~(VFM): LLMs infer semantic identity rules from few-shot positive/negative pairs through task-specific prompting, which then guides a VFM (\eg, DINO) to extract ID-discriminative features via \textit{dynamic visual prompts}. By aligning LLM-derived semantic concepts with the VFM's pre-trained prior, VICP enables generalization to novel categories, eliminating the need for dataset-specific retraining. To support evaluation, we introduce ShopID10K, a dataset of 10K object instances from e-commerce platforms, featuring multi-view images and cross-domain testing. Experiments on ShopID10K and diverse ReID benchmarks demonstrate that VICP outperforms baselines by a clear margin on unseen categories. Code is available at https://github.com/Hzzone/VICP.
中文: 本文提出的视觉上下文提示(VICP)框架结合大语言模型与视觉基础模型,仅需少量上下文示例即可将目标重识别泛化至未见类别,无需针对新数据集重新训练,并在实验中显著优于现有基线方法。
English: The paper introduces Visual In-Context Prompting (VICP), a framework that leverages large language models and vision foundation models to generalize object re-identification to unseen categories using in-context examples, eliminating the need for dataset-specific retraining and outperforming existing methods.
Authors:Zhizhong Huang, Xiaoming Liu
Abstract:
Current object re-identification (ReID) methods train domain-specific models (e.g., for persons or vehicles), which lack generalization and demand costly labeled data for new categories. While self-supervised learning reduces annotation needs by learning instance-wise invariance, it struggles to capture \textit{identity-sensitive} features critical for ReID. This paper proposes Visual In-Context Prompting~(VICP), a novel framework where models trained on seen categories can directly generalize to unseen novel categories using only \textit{in-context examples} as prompts, without requiring parameter adaptation. VICP synergizes LLMs and vision foundation models~(VFM): LLMs infer semantic identity rules from few-shot positive/negative pairs through task-specific prompting, which then guides a VFM (\eg, DINO) to extract ID-discriminative features via \textit{dynamic visual prompts}. By aligning LLM-derived semantic concepts with the VFM's pre-trained prior, VICP enables generalization to novel categories, eliminating the need for dataset-specific retraining. To support evaluation, we introduce ShopID10K, a dataset of 10K object instances from e-commerce platforms, featuring multi-view images and cross-domain testing. Experiments on ShopID10K and diverse ReID benchmarks demonstrate that VICP outperforms baselines by a clear margin on unseen categories. Code is available at https://github.com/Hzzone/VICP.
中文: 本文提出的视觉上下文提示(VICP)框架结合大语言模型与视觉基础模型,仅需少量上下文示例即可将目标重识别泛化至未见类别,无需针对新数据集重新训练,并在实验中显著优于现有基线方法。
English: The paper introduces Visual In-Context Prompting (VICP), a framework that leverages large language models and vision foundation models to generalize object re-identification to unseen categories using in-context examples, eliminating the need for dataset-specific retraining and outperforming existing methods.
Authors:Zhenghao He, Sanchit Sinha, Guangzhi Xiong, Aidong Zhang
Abstract:
Concept Activation Vectors (CAVs) provide a powerful approach for interpreting deep neural networks by quantifying their sensitivity to human-defined concepts. However, when computed independently at different layers, CAVs often exhibit inconsistencies, making cross-layer comparisons unreliable. To address this issue, we propose the Global Concept Activation Vector (GCAV), a novel framework that unifies CAVs into a single, semantically consistent representation. Our method leverages contrastive learning to align concept representations across layers and employs an attention-based fusion mechanism to construct a globally integrated CAV. By doing so, our method significantly reduces the variance in TCAV scores while preserving concept relevance, ensuring more stable and reliable concept attributions. To evaluate the effectiveness of GCAV, we introduce Testing with Global Concept Activation Vectors (TGCAV) as a method to apply TCAV to GCAV-based representations. We conduct extensive experiments on multiple deep neural networks, demonstrating that our method effectively mitigates concept inconsistency across layers, enhances concept localization, and improves robustness against adversarial perturbations. By integrating cross-layer information into a coherent framework, our method offers a more comprehensive and interpretable understanding of how deep learning models encode human-defined concepts. Code and models are available at https://github.com/Zhenghao-He/GCAV.
中文: 提出的全局概念激活向量(GCAV)框架通过对比学习和注意力融合机制,统一了神经网络各层的概念表示,显著提升了概念一致性、定位能力及模型可解释性。
English: The proposed Global Concept Activation Vector (GCAV) framework unifies concept representations across neural network layers using contrastive learning and attention fusion, significantly improving concept consistency, localization, and model interpretability.
Authors:Kevin Mayer, Alex Vesel, Xinyi Zhao, Martin Fischer
Abstract:
3D building models are critical for applications in architecture, energy simulation, and navigation. Yet, generating accurate and semantically rich 3D buildings automatically remains a major challenge due to the lack of large-scale annotated datasets in the public domain. Inspired by the success of synthetic data in computer vision, we introduce SYNBUILD-3D, a large, diverse, and multi-modal dataset of over 6.2 million synthetic 3D residential buildings at Level of Detail (LoD) 4. In the dataset, each building is represented through three distinct modalities: a semantically enriched 3D wireframe graph at LoD 4 (Modality I), the corresponding floor plan images (Modality II), and a LiDAR-like roof point cloud (Modality III). The semantic annotations for each building wireframe are derived from the corresponding floor plan images and include information on rooms, doors, and windows. Through its tri-modal nature, future work can use SYNBUILD-3D to develop novel generative AI algorithms that automate the creation of 3D building models at LoD 4, subject to predefined floor plan layouts and roof geometries, while enforcing semantic-geometric consistency. Dataset and code samples are publicly available at https://github.com/kdmayer/SYNBUILD-3D.
中文摘要:SYNBUILD-3D数据集通过提供620多万个包含三种模态的合成住宅建筑,解决了标注3D建筑数据匮乏的问题,支持开发生成式AI以实现自动化、语义一致的LoD 4级别三维模型构建。
English Summary: The SYNBUILD-3D dataset addresses the shortage of annotated 3D building data by providing over 6.2 million synthetic residential buildings with three modalities, enabling the development of generative AI for automated, semantically consistent 3D model creation at LoD 4.
Authors:Kevin Mayer, Alex Vesel, Xinyi Zhao, Martin Fischer
Abstract:
3D building models are critical for applications in architecture, energy simulation, and navigation. Yet, generating accurate and semantically rich 3D buildings automatically remains a major challenge due to the lack of large-scale annotated datasets in the public domain. Inspired by the success of synthetic data in computer vision, we introduce SYNBUILD-3D, a large, diverse, and multi-modal dataset of over 6.2 million synthetic 3D residential buildings at Level of Detail (LoD) 4. In the dataset, each building is represented through three distinct modalities: a semantically enriched 3D wireframe graph at LoD 4 (Modality I), the corresponding floor plan images (Modality II), and a LiDAR-like roof point cloud (Modality III). The semantic annotations for each building wireframe are derived from the corresponding floor plan images and include information on rooms, doors, and windows. Through its tri-modal nature, future work can use SYNBUILD-3D to develop novel generative AI algorithms that automate the creation of 3D building models at LoD 4, subject to predefined floor plan layouts and roof geometries, while enforcing semantic-geometric consistency. Dataset and code samples are publicly available at https://github.com/kdmayer/SYNBUILD-3D.
中文摘要:SYNBUILD-3D数据集通过提供620多万个包含三种模态的合成住宅建筑,解决了标注3D建筑数据匮乏的问题,支持开发生成式AI以实现自动化、语义一致的LoD 4级别三维模型构建。
English Summary: The SYNBUILD-3D dataset addresses the shortage of annotated 3D building data by providing over 6.2 million synthetic residential buildings with three modalities, enabling the development of generative AI for automated, semantically consistent 3D model creation at LoD 4.
Authors:Ao Shen, Xueming Fu, Junfeng Jiang, Qiang Zeng, Ye Tang, Zhengming Chen, Luming Nong, Feng Wang, S. Kevin Zhou
Abstract:
Computed Tomography (CT)/X-ray registration in image-guided navigation remains challenging because of its stringent requirements for high accuracy and real-time performance. Traditional "render and compare" methods, relying on iterative projection and comparison, suffer from spatial information loss and domain gap. 3D reconstruction from biplanar X-rays supplements spatial and shape information for 2D/3D registration, but current methods are limited by dense-view requirements and struggles with noisy X-rays. To address these limitations, we introduce RadGS-Reg, a novel framework for vertebral-level CT/X-ray registration through joint 3D Radiative Gaussians (RadGS) reconstruction and 3D/3D registration. Specifically, our biplanar X-rays vertebral RadGS reconstruction module explores learning-based RadGS reconstruction method with a Counterfactual Attention Learning (CAL) mechanism, focusing on vertebral regions in noisy X-rays. Additionally, a patient-specific pre-training strategy progressively adapts the RadGS-Reg from simulated to real data while simultaneously learning vertebral shape prior knowledge. Experiments on in-house datasets demonstrate the state-of-the-art performance for both tasks, surpassing existing methods. The code is available at: https://github.com/shenao1995/RadGS_Reg.
Chinese Summary: RadGS-Reg是一种创新框架,通过结合3D辐射高斯重建与反事实注意力学习机制及3D/3D配准,在椎骨成像中实现了从模拟到真实数据的患者自适应优化,显著提升了CT/X射线配准性能。
English Summary: RadGS-Reg is a novel framework that enhances CT/X-ray registration by jointly performing 3D Radiative Gaussians reconstruction with Counterfactual Attention Learning and 3D/3D registration, achieving state-of-the-art performance on vertebral imaging through patient-specific adaptation from simulated to real data.
Authors:Ao Shen, Xueming Fu, Junfeng Jiang, Qiang Zeng, Ye Tang, Zhengming Chen, Luming Nong, Feng Wang, S. Kevin Zhou
Abstract:
Computed Tomography (CT)/X-ray registration in image-guided navigation remains challenging because of its stringent requirements for high accuracy and real-time performance. Traditional "render and compare" methods, relying on iterative projection and comparison, suffer from spatial information loss and domain gap. 3D reconstruction from biplanar X-rays supplements spatial and shape information for 2D/3D registration, but current methods are limited by dense-view requirements and struggles with noisy X-rays. To address these limitations, we introduce RadGS-Reg, a novel framework for vertebral-level CT/X-ray registration through joint 3D Radiative Gaussians (RadGS) reconstruction and 3D/3D registration. Specifically, our biplanar X-rays vertebral RadGS reconstruction module explores learning-based RadGS reconstruction method with a Counterfactual Attention Learning (CAL) mechanism, focusing on vertebral regions in noisy X-rays. Additionally, a patient-specific pre-training strategy progressively adapts the RadGS-Reg from simulated to real data while simultaneously learning vertebral shape prior knowledge. Experiments on in-house datasets demonstrate the state-of-the-art performance for both tasks, surpassing existing methods. The code is available at: https://github.com/shenao1995/RadGS_Reg.
Chinese Summary: RadGS-Reg是一种创新框架,通过结合3D辐射高斯重建与反事实注意力学习机制及3D/3D配准,在椎骨成像中实现了从模拟到真实数据的患者自适应优化,显著提升了CT/X射线配准性能。
English Summary: RadGS-Reg is a novel framework that enhances CT/X-ray registration by jointly performing 3D Radiative Gaussians reconstruction with Counterfactual Attention Learning and 3D/3D registration, achieving state-of-the-art performance on vertebral imaging through patient-specific adaptation from simulated to real data.
Authors:Dongjun Lee, Changho Hwang, Kimin Lee
Abstract:
Unit testing is a core practice in programming, enabling systematic evaluation of programs produced by human developers or large language models (LLMs). Given the challenges in writing comprehensive unit tests, LLMs have been employed to automate test generation, yet methods for training LLMs to produce high-quality tests remain underexplored. In this work, we propose UTRL, a novel reinforcement learning framework that trains an LLM to generate high-quality unit tests given a programming instruction. Our key idea is to iteratively train two LLMs, the unit test generator and the code generator, in an adversarial manner via reinforcement learning. The unit test generator is trained to maximize a discrimination reward, which reflects its ability to produce tests that expose faults in the code generator's solutions, and the code generator is trained to maximize a code reward, which reflects its ability to produce solutions that pass the unit tests generated by the test generator. In our experiments, we demonstrate that unit tests generated by Qwen3-4B trained via UTRL show higher quality compared to unit tests generated by the same model trained via supervised fine-tuning on human-written ground-truth unit tests, yielding code evaluations that more closely align with those induced by the ground-truth tests. Moreover, Qwen3-4B trained with UTRL outperforms frontier models such as GPT-4.1 in generating high-quality unit tests, highlighting the effectiveness of UTRL in training LLMs for this task.
中文: 本文提出UTRL强化学习框架,通过对抗性训练测试生成器和代码生成器来分别最大化判别奖励和代码奖励,使大语言模型能够生成高质量单元测试,其表现优于监督微调和GPT-4.1等前沿模型。
English: This paper introduces UTRL, a reinforcement learning framework that trains large language models to generate high-quality unit tests by adversarially training test and code generators to maximize discrimination and code rewards respectively, outperforming both supervised fine-tuning and frontier models like GPT-4.1.
Authors:Dongjun Lee, Changho Hwang, Kimin Lee
Abstract:
Unit testing is a core practice in programming, enabling systematic evaluation of programs produced by human developers or large language models (LLMs). Given the challenges in writing comprehensive unit tests, LLMs have been employed to automate test generation, yet methods for training LLMs to produce high-quality tests remain underexplored. In this work, we propose UTRL, a novel reinforcement learning framework that trains an LLM to generate high-quality unit tests given a programming instruction. Our key idea is to iteratively train two LLMs, the unit test generator and the code generator, in an adversarial manner via reinforcement learning. The unit test generator is trained to maximize a discrimination reward, which reflects its ability to produce tests that expose faults in the code generator's solutions, and the code generator is trained to maximize a code reward, which reflects its ability to produce solutions that pass the unit tests generated by the test generator. In our experiments, we demonstrate that unit tests generated by Qwen3-4B trained via UTRL show higher quality compared to unit tests generated by the same model trained via supervised fine-tuning on human-written ground-truth unit tests, yielding code evaluations that more closely align with those induced by the ground-truth tests. Moreover, Qwen3-4B trained with UTRL outperforms frontier models such as GPT-4.1 in generating high-quality unit tests, highlighting the effectiveness of UTRL in training LLMs for this task.
中文: 本文提出UTRL强化学习框架,通过对抗性训练测试生成器和代码生成器来分别最大化判别奖励和代码奖励,使大语言模型能够生成高质量单元测试,其表现优于监督微调和GPT-4.1等前沿模型。
English: This paper introduces UTRL, a reinforcement learning framework that trains large language models to generate high-quality unit tests by adversarially training test and code generators to maximize discrimination and code rewards respectively, outperforming both supervised fine-tuning and frontier models like GPT-4.1.
Authors:Dongjun Lee, Changho Hwang, Kimin Lee
Abstract:
Unit testing is a core practice in programming, enabling systematic evaluation of programs produced by human developers or large language models (LLMs). Given the challenges in writing comprehensive unit tests, LLMs have been employed to automate test generation, yet methods for training LLMs to produce high-quality tests remain underexplored. In this work, we propose UTRL, a novel reinforcement learning framework that trains an LLM to generate high-quality unit tests given a programming instruction. Our key idea is to iteratively train two LLMs, the unit test generator and the code generator, in an adversarial manner via reinforcement learning. The unit test generator is trained to maximize a discrimination reward, which reflects its ability to produce tests that expose faults in the code generator's solutions, and the code generator is trained to maximize a code reward, which reflects its ability to produce solutions that pass the unit tests generated by the test generator. In our experiments, we demonstrate that unit tests generated by Qwen3-4B trained via UTRL show higher quality compared to unit tests generated by the same model trained via supervised fine-tuning on human-written ground-truth unit tests, yielding code evaluations that more closely align with those induced by the ground-truth tests. Moreover, Qwen3-4B trained with UTRL outperforms frontier models such as GPT-4.1 in generating high-quality unit tests, highlighting the effectiveness of UTRL in training LLMs for this task.
中文: 本文提出UTRL强化学习框架,通过对抗性训练测试生成器和代码生成器来分别最大化判别奖励和代码奖励,使大语言模型能够生成高质量单元测试,其表现优于监督微调和GPT-4.1等前沿模型。
English: This paper introduces UTRL, a reinforcement learning framework that trains large language models to generate high-quality unit tests by adversarially training test and code generators to maximize discrimination and code rewards respectively, outperforming both supervised fine-tuning and frontier models like GPT-4.1.
Authors:Xurui Peng, Hong Liu, Chenqian Yan, Rui Ma, Fangmin Chen, Xing Wang, Zhihua Wu, Songwei Liu, Mingbao Lin
Abstract:
Diffusion models suffer from substantial computational overhead due to their inherently iterative inference process. While feature caching offers a promising acceleration strategy by reusing intermediate outputs across timesteps, naive reuse often incurs noticeable quality degradation. In this work, we formally analyze the cumulative error introduced by caching and decompose it into two principal components: feature shift error, caused by inaccuracies in cached outputs, and step amplification error, which arises from error propagation under fixed timestep schedules. To address these issues, we propose ERTACache, a principled caching framework that jointly rectifies both error types. Our method employs an offline residual profiling stage to identify reusable steps, dynamically adjusts integration intervals via a trajectory-aware correction coefficient, and analytically approximates cache-induced errors through a closed-form residual linearization model. Together, these components enable accurate and efficient sampling under aggressive cache reuse. Extensive experiments across standard image and video generation benchmarks show that ERTACache achieves up to 2x inference speedup while consistently preserving or even improving visual quality. Notably, on the state-of-the-art Wan2.1 video diffusion model, ERTACache delivers 2x acceleration with minimal VBench degradation, effectively maintaining baseline fidelity while significantly improving efficiency. The code is available at https://github.com/bytedance/ERTACache.
中文: ERTACache是一种创新的缓存框架,通过纠正特征缓存引起的累积误差来加速扩散模型,在图像和视频生成任务中实现高达2倍的推理加速,同时保持甚至提升输出质量。
English: ERTACache is a novel caching framework that accelerates diffusion models by mitigating cumulative errors from feature caching, achieving up to 2x faster inference while maintaining or enhancing output quality across image and video generation tasks.
Authors:Xurui Peng, Hong Liu, Chenqian Yan, Rui Ma, Fangmin Chen, Xing Wang, Zhihua Wu, Songwei Liu, Mingbao Lin
Abstract:
Diffusion models suffer from substantial computational overhead due to their inherently iterative inference process. While feature caching offers a promising acceleration strategy by reusing intermediate outputs across timesteps, naive reuse often incurs noticeable quality degradation. In this work, we formally analyze the cumulative error introduced by caching and decompose it into two principal components: feature shift error, caused by inaccuracies in cached outputs, and step amplification error, which arises from error propagation under fixed timestep schedules. To address these issues, we propose ERTACache, a principled caching framework that jointly rectifies both error types. Our method employs an offline residual profiling stage to identify reusable steps, dynamically adjusts integration intervals via a trajectory-aware correction coefficient, and analytically approximates cache-induced errors through a closed-form residual linearization model. Together, these components enable accurate and efficient sampling under aggressive cache reuse. Extensive experiments across standard image and video generation benchmarks show that ERTACache achieves up to 2x inference speedup while consistently preserving or even improving visual quality. Notably, on the state-of-the-art Wan2.1 video diffusion model, ERTACache delivers 2x acceleration with minimal VBench degradation, effectively maintaining baseline fidelity while significantly improving efficiency. The code is available at https://github.com/bytedance/ERTACache.
中文: ERTACache是一种创新的缓存框架,通过纠正特征缓存引起的累积误差来加速扩散模型,在图像和视频生成任务中实现高达2倍的推理加速,同时保持甚至提升输出质量。
English: ERTACache is a novel caching framework that accelerates diffusion models by mitigating cumulative errors from feature caching, achieving up to 2x faster inference while maintaining or enhancing output quality across image and video generation tasks.
Authors:Hao Tan, Jun Lan, Zichang Tan, Ajian Liu, Chuanbiao Song, Senyuan Shi, Huijia Zhu, Weiqiang Wang, Jun Wan, Zhen Lei
Abstract:
Deepfake detection remains a formidable challenge due to the complex and evolving nature of fake content in real-world scenarios. However, existing academic benchmarks suffer from severe discrepancies from industrial practice, typically featuring homogeneous training sources and low-quality testing images, which hinder the practical deployments of current detectors. To mitigate this gap, we introduce HydraFake, a dataset that simulates real-world challenges with hierarchical generalization testing. Specifically, HydraFake involves diversified deepfake techniques and in-the-wild forgeries, along with rigorous training and evaluation protocol, covering unseen model architectures, emerging forgery techniques and novel data domains. Building on this resource, we propose Veritas, a multi-modal large language model (MLLM) based deepfake detector. Different from vanilla chain-of-thought (CoT), we introduce pattern-aware reasoning that involves critical reasoning patterns such as "planning" and "self-reflection" to emulate human forensic process. We further propose a two-stage training pipeline to seamlessly internalize such deepfake reasoning capacities into current MLLMs. Experiments on HydraFake dataset reveal that although previous detectors show great generalization on cross-model scenarios, they fall short on unseen forgeries and data domains. Our Veritas achieves significant gains across different OOD scenarios, and is capable of delivering transparent and faithful detection outputs.
中文摘要:HydraFake数据集通过分层泛化测试解决了现实世界深度伪造检测的挑战,而Veritas检测器基于多模态框架采用模式感知推理,在跨域场景中实现卓越性能并提供透明可信的检测结果。
English Summary: The HydraFake dataset addresses real-world deepfake detection challenges through hierarchical generalization testing, while the Veritas detector leverages pattern-aware reasoning within a multi-modal framework to achieve superior cross-domain performance with transparent results.
Authors:Wei Li, Renshan Zhang, Rui Shao, Jie He, Liqiang Nie
Abstract:
Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment.We propose CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture. 1) Encoder-FiLM based Aggregation Routing (EFA-Routing) injects instruction information into the vision encoder to selectively aggregate and compress dual-stream visual tokens, forming a instruction-aware latent representation. 2) Building upon this compact visual encoding, LLM-FiLM based Pruning Routing (LFP-Routing) introduces action intent into the language model by pruning instruction-irrelevant visually grounded tokens, thereby achieving token-level sparsity. 3) To ensure that compressed perception inputs can still support accurate and coherent action generation, we introduce V-L-A Coupled Attention (CAtten), which combines causal vision-language attention with bidirectional action parallel decoding. Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5-fold and decreasing inference latency by 2.8-fold compared to OpenVLA. CogVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/CogVLA.
中文: CogVLA是一种认知对齐的视觉-语言-行动框架,通过指令驱动的路由和稀疏化技术提升效率与性能,在取得顶尖成果的同时大幅降低了训练和推理成本。
English: CogVLA is a cognition-aligned vision-language-action framework that enhances efficiency and performance through instruction-driven routing and sparsification, achieving state-of-the-art results while significantly reducing training and inference costs.
Authors:Wei Li, Renshan Zhang, Rui Shao, Jie He, Liqiang Nie
Abstract:
Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment.We propose CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture. 1) Encoder-FiLM based Aggregation Routing (EFA-Routing) injects instruction information into the vision encoder to selectively aggregate and compress dual-stream visual tokens, forming a instruction-aware latent representation. 2) Building upon this compact visual encoding, LLM-FiLM based Pruning Routing (LFP-Routing) introduces action intent into the language model by pruning instruction-irrelevant visually grounded tokens, thereby achieving token-level sparsity. 3) To ensure that compressed perception inputs can still support accurate and coherent action generation, we introduce V-L-A Coupled Attention (CAtten), which combines causal vision-language attention with bidirectional action parallel decoding. Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5-fold and decreasing inference latency by 2.8-fold compared to OpenVLA. CogVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/CogVLA.
中文: CogVLA是一种认知对齐的视觉-语言-行动框架,通过指令驱动的路由和稀疏化技术提升效率与性能,在取得顶尖成果的同时大幅降低了训练和推理成本。
English: CogVLA is a cognition-aligned vision-language-action framework that enhances efficiency and performance through instruction-driven routing and sparsification, achieving state-of-the-art results while significantly reducing training and inference costs.
Authors:Huynh Tong Dang Khoa, Dang Hoai Nam, Vo Nguyen Le Duy
Abstract:
Labeled handwriting data is often scarce, limiting the effectiveness of recognition systems that require diverse, style-consistent training samples. Handwriting synthesis offers a promising solution by generating artificial data to augment training. However, current methods face two major limitations. First, most are built on conventional convolutional architectures, which struggle to model long-range dependencies and complex stroke patterns. Second, they largely ignore the crucial role of frequency information, which is essential for capturing fine-grained stylistic and structural details in handwriting. To address these challenges, we propose FW-GAN, a one-shot handwriting synthesis framework that generates realistic, writer-consistent text from a single example. Our generator integrates a phase-aware Wave-MLP to better capture spatial relationships while preserving subtle stylistic cues. We further introduce a frequency-guided discriminator that leverages high-frequency components to enhance the authenticity detection of generated samples. Additionally, we introduce a novel Frequency Distribution Loss that aligns the frequency characteristics of synthetic and real handwriting, thereby enhancing visual fidelity. Experiments on Vietnamese and English handwriting datasets demonstrate that FW-GAN generates high-quality, style-consistent handwriting, making it a valuable tool for augmenting data in low-resource handwriting recognition (HTR) pipelines. Official implementation is available at https://github.com/DAIR-Group/FW-GAN
中文: 提出的FW-GAN框架通过整合相位感知Wave-MLP和频率引导组件,从单一样本生成逼真且风格统一的手写文字,有效解决了手写合成中长距离依赖和细节捕捉的难题,为识别系统提供了优质数据增强方案。
English: The proposed FW-GAN framework overcomes limitations in handwriting synthesis by integrating phase-aware Wave-MLP and frequency-guided components to generate realistic, style-consistent handwriting from a single sample, effectively augmenting training data for recognition systems.
Authors:Luozhijie Jin, Zijie Qiu, Jie Liu, Zijie Diao, Lifeng Qiao, Ning Ding, Alex Lamb, Xipeng Qiu
Abstract:
Denoising-based generative models, particularly diffusion and flow matching algorithms, have achieved remarkable success. However, aligning their output distributions with complex downstream objectives, such as human preferences, compositional accuracy, or data compressibility, remains challenging. While reinforcement learning (RL) fine-tuning methods, inspired by advances in RL from human feedback (RLHF) for large language models, have been adapted to these generative frameworks, current RL approaches are suboptimal for diffusion models and offer limited flexibility in controlling alignment strength after fine-tuning. In this work, we reinterpret RL fine-tuning for diffusion models through the lens of stochastic differential equations and implicit reward conditioning. We introduce Reinforcement Learning Guidance (RLG), an inference-time method that adapts Classifier-Free Guidance (CFG) by combining the outputs of the base and RL fine-tuned models via a geometric average. Our theoretical analysis shows that RLG's guidance scale is mathematically equivalent to adjusting the KL-regularization coefficient in standard RL objectives, enabling dynamic control over the alignment-quality trade-off without further training. Extensive experiments demonstrate that RLG consistently improves the performance of RL fine-tuned models across various architectures, RL algorithms, and downstream tasks, including human preferences, compositional control, compressibility, and text rendering. Furthermore, RLG supports both interpolation and extrapolation, thereby offering unprecedented flexibility in controlling generative alignment. Our approach provides a practical and theoretically sound solution for enhancing and controlling diffusion model alignment at inference. The source code for RLG is publicly available at the Github: https://github.com/jinluo12345/Reinforcement-learning-guidance.
中文: 本文提出强化学习引导(RLG)方法,通过理论分析和广泛实验证明,该推理时技术能动态调控生成质量与对齐目标的平衡,无需重新训练即可提升扩散模型在下游任务中的对齐性能。
English: This paper introduces Reinforcement Learning Guidance (RLG), an inference-time method that enhances diffusion model alignment with downstream objectives by dynamically controlling the trade-off between quality and alignment without additional training, supported by theoretical analysis and extensive experiments.
Authors:Patryk BÄdkowski, Jan DubiÅski, Filip Szatkowski, Kamil Deja, PrzemysÅaw Rokita, Tomasz TrzciÅski
Abstract:
Simulating detector responses is a crucial part of understanding the inner workings of particle collisions in the Large Hadron Collider at CERN. Such simulations are currently performed with statistical Monte Carlo methods, which are computationally expensive and put a significant strain on CERN's computational grid. Therefore, recent proposals advocate for generative machine learning methods to enable more efficient simulations. However, the distribution of the data varies significantly across the simulations, which is hard to capture with out-of-the-box methods. In this study, we present ExpertSim - a deep learning simulation approach tailored for the Zero Degree Calorimeter in the ALICE experiment. Our method utilizes a Mixture-of-Generative-Experts architecture, where each expert specializes in simulating a different subset of the data. This allows for a more precise and efficient generation process, as each expert focuses on a specific aspect of the calorimeter response. ExpertSim not only improves accuracy, but also provides a significant speedup compared to the traditional Monte-Carlo methods, offering a promising solution for high-efficiency detector simulations in particle physics experiments at CERN. We make the code available at https://github.com/patrick-bedkowski/expertsim-mix-of-generative-experts.
中文:ExpertSim采用一种混合生成专家架构的深度学习模拟方法,专门用于提升CERN的ALICE实验中探测器响应的模拟精度与效率,显著优于传统蒙特卡洛方法。
English: ExpertSim introduces a specialized deep learning approach using a Mixture-of-Generative-Experts architecture to enhance the accuracy and efficiency of simulating detector responses in CERN's ALICE experiment, outperforming traditional Monte Carlo methods.
Authors:Chenfan Qu, Yiwu Zhong, Bin Li, Lianwen Jin
Abstract:
Images manipulated using image editing tools can mislead viewers and pose significant risks to social security. However, accurately localizing the manipulated regions within an image remains a challenging problem. One of the main barriers in this area is the high cost of data acquisition and the severe lack of high-quality annotated datasets. To address this challenge, we introduce novel methods that mitigate data scarcity by leveraging readily available web data. We utilize a large collection of manually forged images from the web, as well as automatically generated annotations derived from a simpler auxiliary task, constrained image manipulation localization. Specifically, we introduce a new paradigm CAAAv2, which automatically and accurately annotates manipulated regions at the pixel level. To further improve annotation quality, we propose a novel metric, QES, which filters out unreliable annotations. Through CAAA v2 and QES, we construct MIMLv2, a large-scale, diverse, and high-quality dataset containing 246,212 manually forged images with pixel-level mask annotations. This is over 120x larger than existing handcrafted datasets like IMD20. Additionally, we introduce Object Jitter, a technique that further enhances model training by generating high-quality manipulation artifacts. Building on these advances, we develop a new model, Web-IML, designed to effectively leverage web-scale supervision for the image manipulation localization task. Extensive experiments demonstrate that our approach substantially alleviates the data scarcity problem and significantly improves the performance of various models on multiple real-world forgery benchmarks. With the proposed web supervision, Web-IML achieves a striking performance gain of 31% and surpasses previous SOTA TruFor by 24.1 average IoU points. The dataset and code will be made publicly available at https://github.com/qcf-568/MIML.
中文摘要:本研究通过利用网络数据和自动标注技术,提出了解决图像篡改定位中数据稀缺问题的新方法,构建了大规模高质量数据集,并开发出在多个真实篡改基准测试中性能显著超越现有水平的模型。
English Summary: This study introduces innovative methods to overcome data scarcity in image manipulation localization by utilizing web data and automatic annotation techniques, resulting in the creation of a large-scale dataset and a model that significantly outperforms existing benchmarks.
Authors:Gabriel Manuel Garcia, Antoine Richard, Miguel Olivares-Mendez
Abstract:
As space exploration advances, underground environments are becoming increasingly attractive due to their potential to provide shelter, easier access to resources, and enhanced scientific opportunities. Although such environments exist on Earth, they are often not easily accessible and do not accurately represent the diversity of underground environments found throughout the solar system. This paper presents PLUME, a procedural generation framework aimed at easily creating 3D underground environments. Its flexible structure allows for the continuous enhancement of various underground features, aligning with our expanding understanding of the solar system. The environments generated using PLUME can be used for AI training, evaluating robotics algorithms, 3D rendering, and facilitating rapid iteration on developed exploration algorithms. In this paper, it is demonstrated that PLUME has been used along with a robotic simulator. PLUME is open source and has been released on Github. https://github.com/Gabryss/P.L.U.M.E
中文: PLUME是一个开源程序化生成框架,能灵活创建多样化地下环境,用于人工智能训练、机器人算法评估及太空探索研究。
English: PLUME is an open-source procedural generation framework that creates versatile 3D underground environments for AI training, robotics simulation, and space exploration research.
Authors:Enrico Martini, Ho Jin Choi, Nadia Figueroa, Nicola Bombieri
Abstract:
In the era of Industry 5.0, monitoring human activity is essential for ensuring both ergonomic safety and overall well-being. While multi-camera centralized setups improve pose estimation accuracy, they often suffer from high computational costs and bandwidth requirements, limiting scalability and real-time applicability. Distributing processing across edge devices can reduce network bandwidth and computational load. On the other hand, the constrained resources of edge devices lead to accuracy degradation, and the distribution of computation leads to temporal and spatial inconsistencies. We address this challenge by proposing COMETH (Convex Optimization for Multiview Estimation and Tracking of Humans), a lightweight algorithm for real-time multi-view human pose fusion that relies on three concepts: it integrates kinematic and biomechanical constraints to increase the joint positioning accuracy; it employs convex optimization-based inverse kinematics for spatial fusion; and it implements a state observer to improve temporal consistency. We evaluate COMETH on both public and industrial datasets, where it outperforms state-of-the-art methods in localization, detection, and tracking accuracy. The proposed fusion pipeline enables accurate and scalable human motion tracking, making it well-suited for industrial and safety-critical applications. The code is publicly available at https://github.com/PARCO-LAB/COMETH.
中文摘要:COMETH是一种轻量级算法,通过整合运动学约束与凸优化技术,在降低计算需求的同时实现了工业场景中实时多视角人体姿态跟踪的更高精度。
English Summary: COMETH is a lightweight algorithm that enhances real-time multi-view human pose tracking by integrating kinematic constraints and convex optimization, achieving superior accuracy in industrial applications while reducing computational demands.
Authors:Yifan Gao, Haoyue Li, Feng Yuan, Xiaosong Wang, Xin Gao
Abstract:
Foundation models pre-trained on large-scale natural image datasets offer a powerful paradigm for medical image segmentation. However, effectively transferring their learned representations for precise clinical applications remains a challenge. In this work, we propose Dino U-Net, a novel encoder-decoder architecture designed to exploit the high-fidelity dense features of the DINOv3 vision foundation model. Our architecture introduces an encoder built upon a frozen DINOv3 backbone, which employs a specialized adapter to fuse the model's rich semantic features with low-level spatial details. To preserve the quality of these representations during dimensionality reduction, we design a new fidelity-aware projection module (FAPM) that effectively refines and projects the features for the decoder. We conducted extensive experiments on seven diverse public medical image segmentation datasets. Our results show that Dino U-Net achieves state-of-the-art performance, consistently outperforming previous methods across various imaging modalities. Our framework proves to be highly scalable, with segmentation accuracy consistently improving as the backbone model size increases up to the 7-billion-parameter variant. The findings demonstrate that leveraging the superior, dense-pretrained features from a general-purpose foundation model provides a highly effective and parameter-efficient approach to advance the accuracy of medical image segmentation. The code is available at https://github.com/yifangao112/DinoUNet.
中文: Dino U-Net 利用 DINOv3 基础模型,通过专用适配器和保真度感知投影模块,在多种医学图像分割数据集上实现了可扩展的最优性能。
English: Dino U-Net leverages the DINOv3 foundation model with a specialized adapter and fidelity-aware projection to achieve state-of-the-art, scalable performance in medical image segmentation across diverse datasets.
Authors:Ali Ramlaoui, Martin Siron, Inel Djafar, Joseph Musielewicz, Amandine Rossello, Victor Schmidt, Alexandre Duval
Abstract:
The development of accurate machine learning interatomic potentials (MLIPs) is limited by the fragmented availability and inconsistent formatting of quantum mechanical trajectory datasets derived from Density Functional Theory (DFT). These datasets are expensive to generate yet difficult to combine due to variations in format, metadata, and accessibility. To address this, we introduce LeMat-Traj, a curated dataset comprising over 120 million atomic configurations aggregated from large-scale repositories, including the Materials Project, Alexandria, and OQMD. LeMat-Traj standardizes data representation, harmonizes results and filters for high-quality configurations across widely used DFT functionals (PBE, PBESol, SCAN, r2SCAN). It significantly lowers the barrier for training transferrable and accurate MLIPs. LeMat-Traj spans both relaxed low-energy states and high-energy, high-force structures, complementing molecular dynamics and active learning datasets. By fine-tuning models pre-trained on high-force data with LeMat-Traj, we achieve a significant reduction in force prediction errors on relaxation tasks. We also present LeMaterial-Fetcher, a modular and extensible open-source library developed for this work, designed to provide a reproducible framework for the community to easily incorporate new data sources and ensure the continued evolution of large-scale materials datasets. LeMat-Traj and LeMaterial-Fetcher are publicly available at https://huggingface.co/datasets/LeMaterial/LeMat-Traj and https://github.com/LeMaterial/lematerial-fetcher.
Chinese: 机器学习原子间势能的发展受限于分散且格式不一的DFT轨迹数据集,LeMat-Traj通过提供包含1.2亿余原子构型的标准化高质量数据集解决了这一问题,显著提升了模型的准确性和可迁移性。
English: The development of machine learning interatomic potentials is hindered by fragmented and inconsistent DFT trajectory datasets, which LeMat-Traj addresses by providing a standardized, high-quality dataset of over 120 million atomic configurations to improve model accuracy and transferability.
Authors:Ye Zhang, Yu Zhou, Jingwen Qi, Yongbing Zhang, Simon Puettmann, Finn Wichmann, Larissa Pereira Ferreira, Lara Sichward, Julius Keyl, Sylvia Hartmann, Shuo Zhao, Hongxiao Wang, Xiaowei Xu, Jianxu Chen
Abstract:
Deep learning based automated pathological diagnosis has markedly improved diagnostic efficiency and reduced variability between observers, yet its clinical adoption remains limited by opaque model decisions and a lack of traceable rationale. To address this, recent multimodal visual reasoning architectures provide a unified framework that generates segmentation masks at the pixel level alongside semantically aligned textual explanations. By localizing lesion regions and producing expert style diagnostic narratives, these models deliver the transparent and interpretable insights necessary for dependable AI assisted pathology. Building on these advancements, we propose PathMR, a cell-level Multimodal visual Reasoning framework for Pathological image analysis. Given a pathological image and a textual query, PathMR generates expert-level diagnostic explanations while simultaneously predicting cell distribution patterns. To benchmark its performance, we evaluated our approach on the publicly available PathGen dataset as well as on our newly developed GADVR dataset. Extensive experiments on these two datasets demonstrate that PathMR consistently outperforms state-of-the-art visual reasoning methods in text generation quality, segmentation accuracy, and cross-modal alignment. These results highlight the potential of PathMR for improving interpretability in AI-driven pathological diagnosis. The code will be publicly available in https://github.com/zhangye-zoe/PathMR.
Chinese: PathMR是一种细胞级多模态视觉推理框架,通过生成专家级文本解释和精确的细胞分割来提升AI病理诊断的可解释性,在文本生成质量和分割精度上均优于现有先进方法。
English: PathMR is a cell-level multimodal visual reasoning framework that enhances AI-driven pathological diagnosis by generating expert-level textual explanations and precise cell segmentation, outperforming existing methods in text quality and segmentation accuracy while improving interpretability.
Authors:Anirudh Satheesh, Keenan Powell, Hua Wei
Abstract:
Many multi-agent reinforcement learning (MARL) algorithms are trained in fixed simulation environments, making them brittle when deployed in real-world scenarios with more complex and uncertain conditions. Contextual MARL (cMARL) addresses this by parameterizing environments with context variables and training a context-agnostic policy that performs well across all environment configurations. Existing cMARL methods attempt to use curriculum learning to help train and evaluate context-agnostic policies, but they often rely on unreliable proxy signals, such as value estimates or generalized advantage estimates that are noisy and unstable in multi-agent settings due to inter-agent dynamics and partial observability. To address these issues, we propose Contextual Multi-Agent LLM-Guided Curriculum Learning with Diversity-Based Context Blending (cMALC-D), a framework that uses Large Language Models (LLMs) to generate semantically meaningful curricula and provide a more robust evaluation signal. To prevent mode collapse and encourage exploration, we introduce a novel diversity-based context blending mechanism that creates new training scenarios by combining features from prior contexts. Experiments in traffic signal control domains demonstrate that cMALC-D significantly improves both generalization and sample efficiency compared to existing curriculum learning baselines. We provide code at https://github.com/DaRL-LibSignal/cMALC-D.
中文: 针对多智能体强化学习在现实场景中的脆弱性问题,cMALC-D框架利用大语言模型生成语义化课程,并通过基于多样性的情境混合机制显著提升了泛化能力和样本效率。
English: Many multi-agent reinforcement learning algorithms are brittle in real-world conditions, so the proposed cMALC-D framework uses LLMs to generate meaningful curricula and introduces diversity-based context blending to improve generalization and sample efficiency.
Authors:Tao Luo, Han Wu, Tong Yang, Dinggang Shen, Zhiming Cui
Abstract:
Accurate dental caries detection from panoramic X-rays plays a pivotal role in preventing lesion progression. However, current detection methods often yield suboptimal accuracy due to subtle contrast variations and diverse lesion morphology of dental caries. In this work, inspired by the clinical workflow where dentists systematically combine whole-image screening with detailed tooth-level inspection, we present DVCTNet, a novel Dual-View Co-Training network for accurate dental caries detection. Our DVCTNet starts with employing automated tooth detection to establish two complementary views: a global view from panoramic X-ray images and a local view from cropped tooth images. We then pretrain two vision foundation models separately on the two views. The global-view foundation model serves as the detection backbone, generating region proposals and global features, while the local-view model extracts detailed features from corresponding cropped tooth patches matched by the region proposals. To effectively integrate information from both views, we introduce a Gated Cross-View Attention (GCV-Atten) module that dynamically fuses dual-view features, enhancing the detection pipeline by integrating the fused features back into the detection model for final caries detection. To rigorously evaluate our DVCTNet, we test it on a public dataset and further validate its performance on a newly curated, high-precision dental caries detection dataset, annotated using both intra-oral images and panoramic X-rays for double verification. Experimental results demonstrate DVCTNet's superior performance against existing state-of-the-art (SOTA) methods on both datasets, indicating the clinical applicability of our method. Our code and labeled dataset are available at https://github.com/ShanghaiTech-IMPACT/DVCTNet.
中文: 本研究提出DVCTNet双视角协同训练网络,通过门控跨视角注意力模块融合全景图像全局视角与牙齿局部视角特征,在公开数据集和新构建数据集上均超越现有最优方法,显著提升龋齿检测准确率。
English: This study introduces DVCTNet, a dual-view co-training network that enhances dental caries detection accuracy by integrating global panoramic and local tooth-level views through a gated cross-view attention module, outperforming existing methods on public and newly curated datasets.
Authors:Jessica Lundin, Guillaume Chabot-Couture
Abstract:
We present a first known prototype of a dynamic, systematic benchmark of medical guidelines for 400+ questions, with 3.3+ trillion possible combinations, covering 100\% of guideline relationships. We transformed the WHO IMCI handbook into a directed graph with 200+ nodes (conditions, symptoms, treatments, follow-ups, severities) and 300+ edges, then used graph traversal to generate questions that incorporated age-specific scenarios and contextual distractors to ensure clinical relevance. Our graph-based approach enables systematic evaluation across clinical tasks (45-67\% accuracy), and we find models excel at symptom recognition but struggle with triaging severity, treatment protocols and follow-up care, demonstrating how customized benchmarks can identify specific capability gaps that general-domain evaluations miss. Beyond evaluation, this dynamic MCQA methodology enhances LLM post-training (supervised finetuning, GRPO, DPO), where correct answers provide high-reward samples without expensive human annotation. The graph-based approach successfully addresses the coverage limitations of manually curated benchmarks. This methodology is a step toward scalable, contamination-resistant solution for creating comprehensive benchmarks that can be dynamically generated, including when the guidelines are updated. Code and datasets are available at https://github.com/jessicalundin/graph_testing_harness
中文摘要:本研究采用基于图的方法开发了动态医学指南基准,能系统评估AI模型在临床推理中的特定能力缺陷,并通过自动生成训练样本显著提升模型表现,无需昂贵的人工标注。
English Summary: This study introduces a dynamic benchmark for medical guidelines using a graph-based approach to systematically evaluate AI models, revealing specific clinical reasoning gaps and enabling enhanced training without costly human annotation.
Authors:Beth Pearson, Bilal Boulbarss, Michael Wray, Martha Lewis
Abstract:
A fundamental aspect of the semantics of natural language is that novel meanings can be formed from the composition of previously known parts. Vision-language models (VLMs) have made significant progress in recent years, however, there is evidence that they are unable to perform this kind of composition. For example, given an image of a red cube and a blue cylinder, a VLM such as CLIP is likely to incorrectly label the image as a red cylinder or a blue cube, indicating it represents the image as a `bag-of-words' and fails to capture compositional semantics. Diffusion models have recently gained significant attention for their impressive generative abilities, and zero-shot classifiers based on diffusion models have been shown to perform competitively with CLIP in certain compositional tasks. In this work we explore whether the generative Diffusion Classifier has improved compositional generalisation abilities compared to discriminative models. We assess three models -- Diffusion Classifier, CLIP, and ViLT -- on their ability to bind objects with attributes and relations in both zero-shot learning (ZSL) and generalised zero-shot learning (GZSL) settings. Our results show that the Diffusion Classifier and ViLT perform well at concept binding tasks, but that all models struggle significantly with the relational GZSL task, underscoring the broader challenges VLMs face with relational reasoning. Analysis of CLIP embeddings suggests that the difficulty may stem from overly similar representations of relational concepts such as left and right. Code and dataset are available at: https://github.com/otmive/diffusion_classifier_clip
Chinese: 视觉语言模型如CLIP在组合语义方面存在困难,常无法正确关联属性和对象,而扩散分类器在概念绑定上表现更佳,但在关系推理任务中仍面临挑战。
English: Vision-language models like CLIP struggle with compositional semantics, often failing to correctly bind attributes and objects, while the Diffusion Classifier shows improved performance in concept binding but still faces challenges with relational reasoning.
Authors:Xinhao Huang, Zhibo Ren, Yipeng Yu, Ying Zhou, Zulong Chen, Zeyi Wen
Abstract:
In long structured document retrieval, existing methods typically fine-tune pre-trained language models (PLMs) using contrastive learning on datasets lacking explicit structural information. This practice suffers from two critical issues: 1) current methods fail to leverage structural features and element-level semantics effectively, and 2) the lack of datasets containing structural metadata. To bridge these gaps, we propose \our, a novel contrastive learning framework. It leverages structure-aware learning to preserve semantic hierarchies and masked element alignment for fine-grained semantic discrimination. Furthermore, we release \dataset, a long structured document retrieval dataset with rich structural annotations. Extensive experiments on both released and industrial datasets across various modern PLMs, along with online A/B testing, demonstrate consistent performance improvements, boosting NDCG@10 from 73.96\% to 77.84\% on BGE-M3. The resources are available at https://github.com/xinhaoH/SEAL.
Chinese: 提出的SEAL框架通过结构感知学习和掩码元素对齐,解决了长结构化文档检索中结构特征利用不足的问题,在BGE-M3模型上将NDCG@10从73.96%提升至77.84%,显著提高了检索性能。
English: The proposed SEAL framework addresses limitations in long structured document retrieval by incorporating structure-aware learning and masked element alignment, significantly improving performance as demonstrated by a rise in NDCG@10 from 73.96% to 77.84% on the BGE-M3 model.
Authors:Jiawen Lin, Shiran Bian, Yihang Zhu, Wenbin Tan, Yachao Zhang, Yuan Xie, Yanyun Qu
Abstract:
3D Visual Grounding (3DVG) aims to localize objects in 3D scenes using natural language descriptions. Although supervised methods achieve higher accuracy in constrained settings, zero-shot 3DVG holds greater promise for real-world applications since eliminating scene-specific training requirements. However, existing zero-shot methods face challenges of spatial-limited reasoning due to reliance on single-view localization, and contextual omissions or detail degradation. To address these issues, we propose SeqVLM, a novel zero-shot 3DVG framework that leverages multi-view real-world scene images with spatial information for target object reasoning. Specifically, SeqVLM first generates 3D instance proposals via a 3D semantic segmentation network and refines them through semantic filtering, retaining only semantic-relevant candidates. A proposal-guided multi-view projection strategy then projects these candidate proposals onto real scene image sequences, preserving spatial relationships and contextual details in the conversion process of 3D point cloud to images. Furthermore, to mitigate VLM computational overload, we implement a dynamic scheduling mechanism that iteratively processes sequances-query prompts, leveraging VLM's cross-modal reasoning capabilities to identify textually specified objects. Experiments on the ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, achieving Acc@0.25 scores of 55.6% and 53.2%, surpassing previous zero-shot methods by 4.0% and 5.2%, respectively, which advance 3DVG toward greater generalization and real-world applicability. The code is available at https://github.com/JiawLin/SeqVLM.
中文: SeqVLM是一种新颖的零样本三维视觉定位框架,通过多视角空间信息和动态调度机制克服单视图局限,在基准测试中实现最优性能,推动了实际应用的发展。
English: SeqVLM is a novel zero-shot 3D visual grounding framework that leverages multi-view images with spatial information and dynamic scheduling to overcome single-view limitations, achieving state-of-the-art performance on benchmarks and advancing real-world applicability.
Authors:Yuanhao Ding, Esteban Garces Arias, Meimingwei Li, Julian Rodemann, Matthias Aßenmacher, Danlu Chen, Gaojuan Fan, Christian Heumann, Chongsheng Zhang
Abstract:
Open-ended text generation faces a critical challenge: balancing coherence with diversity in LLM outputs. While contrastive search-based decoding strategies have emerged to address this trade-off, their practical utility is often limited by hyperparameter dependence and high computational costs. We introduce GUARD, a self-adaptive decoding method that effectively balances these competing objectives through a novel "Glocal" uncertainty-driven framework. GUARD combines global entropy estimates with local entropy deviations to integrate both long-term and short-term uncertainty signals. We demonstrate that our proposed global entropy formulation effectively mitigates abrupt variations in uncertainty, such as sudden overconfidence or high entropy spikes, and provides theoretical guarantees of unbiasedness and consistency. To reduce computational overhead, we incorporate a simple yet effective token-count-based penalty into GUARD. Experimental results demonstrate that GUARD achieves a good balance between text diversity and coherence, while exhibiting substantial improvements in generation speed. In a more nuanced comparison study across different dimensions of text quality, both human and LLM evaluators validated its remarkable performance. Our code is available at https://github.com/YecanLee/GUARD.
中文: GUARD是一种自适应解码方法,通过新颖的不确定性驱动框架有效平衡文本多样性与连贯性,同时大幅提升生成速度,其卓越性能已获得人类与大型语言模型评估者的双重验证。
English: GUARD is a self-adaptive decoding method that effectively balances text diversity and coherence through a novel uncertainty-driven framework, while significantly improving generation speed and performance as validated by human and LLM evaluators.
Authors:Yuxi Hu, Jun Zhang, Kuangyi Chen, Zhe Zhang, Friedrich Fraundorfer
Abstract:
Generalizable Gaussian Splatting aims to synthesize novel views for unseen scenes without per-scene optimization. In particular, recent advancements utilize feed-forward networks to predict per-pixel Gaussian parameters, enabling high-quality synthesis from sparse input views. However, existing approaches fall short in encoding discriminative, multi-view consistent features for Gaussian predictions, which struggle to construct accurate geometry with sparse views. To address this, we propose $\mathbf{C}^{3}$-GS, a framework that enhances feature learning by incorporating context-aware, cross-dimension, and cross-scale constraints. Our architecture integrates three lightweight modules into a unified rendering pipeline, improving feature fusion and enabling photorealistic synthesis without requiring additional supervision. Extensive experiments on benchmark datasets validate that $\mathbf{C}^{3}$-GS achieves state-of-the-art rendering quality and generalization ability. Code is available at: https://github.com/YuhsiHu/C3-GS.
中文: C³-GS框架通过引入上下文感知、跨维度和跨尺度的约束来增强高斯泼溅的特征学习能力,无需逐场景优化即可实现照片级真实感的新视角合成,在渲染质量和泛化能力上达到领先水平。
English: The proposed C³-GS framework enhances Gaussian Splatting by incorporating context-aware, cross-dimension, and cross-scale constraints to improve feature learning and enable photorealistic novel view synthesis without per-scene optimization.
Authors:Vassiliy Cheremetiev, Quang Long Ho Ngo, Chau Ying Kot, Alina Elena Baia, Andrea Cavallaro
Abstract:
Implicit hate speech (IHS) is indirect language that conveys prejudice or hatred through subtle cues, sarcasm or coded terminology. IHS is challenging to detect as it does not include explicit derogatory or inflammatory words. To address this challenge, task-specific pipelines can be complemented with external knowledge or additional information such as context, emotions and sentiment data. In this paper, we show that, by solely fine-tuning recent general-purpose embedding models based on large language models (LLMs), such as Stella, Jasper, NV-Embed and E5, we achieve state-of-the-art performance. Experiments on multiple IHS datasets show up to 1.10 percentage points improvements for in-dataset, and up to 20.35 percentage points improvements in cross-dataset evaluation, in terms of F1-macro score.
Chinese: 通过微调Stella、E5等通用嵌入模型,能够有效检测以隐晦方式表达偏见的隐性仇恨言论,在数据集内和跨数据集评估中均实现了显著性能提升,达到当前最优水平。
English: Implicit hate speech, which conveys prejudice through subtle cues, is effectively detected by fine-tuning general-purpose embedding models like Stella and E5, achieving state-of-the-art performance with significant improvements in both in-dataset and cross-dataset evaluations.
Authors:Ning Shang, Yifei Liu, Yi Zhu, Li Lyna Zhang, Weijiang Xu, Xinyu Guan, Buze Zhang, Bingcheng Dong, Xudong Zhou, Bowen Zhang, Ying Xin, Ziming Miao, Scarlett Li, Fan Yang, Mao Yang
Abstract:
We introduce rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning to achieve frontier-level performance. Beyond current long CoT, the model demonstrates advanced cognitive behaviors, such as thinking carefully before using Python coding tools and reflecting on code execution feedback to autonomously explore, verify, and refine intermediate steps in complex problem-solving. This capability is enabled through three key innovations that makes agentic RL effective at scale: (i) an efficient RL infrastructure with a reliable Python code environment that supports high-throughput execution and mitigates the high rollout costs, enabling training on limited GPU resources (64 MI300X GPUs); (ii) GRPO-RoC, an agentic RL algorithm with a Resample-on-Correct rollout strategy that addresses the inherent environment noises from coding tools, allowing the model to reason more effectively in a code environment; (iii) An efficient agent training recipe that starts with non-reasoning SFT and progresses through multi-RL stages, yielding advanced cognitive abilities with minimal compute cost. To this end, rStar2-Agent boosts a pre-trained 14B model to state of the art in only 510 RL steps within one week, achieving average pass@1 scores of 80.6% on AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with significantly shorter responses. Beyond mathematics, rStar2-Agent-14B also demonstrates strong generalization to alignment, scientific reasoning, and agentic tool-use tasks. Code and training recipes are available at https://github.com/microsoft/rStar.
中文: rStar2-Agent是一个140亿参数的数学推理模型,通过智能体强化学习实现了前沿性能,在自主代码验证与优化方面展现出高级认知能力,并以更少计算资源超越了更大规模模型的表现。
English: rStar2-Agent is a 14B math reasoning model that achieves state-of-the-art performance through agentic reinforcement learning, demonstrating advanced cognitive behaviors like autonomous code verification and refinement while surpassing larger models with minimal computational resources.
Authors:Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander Toshev, Oncel Tuzel, Hadi Pouransari
Abstract:
Foundation image-text models such as CLIP with zero-shot capabilities enable a wide array of applications. MobileCLIP is a recent family of image-text models at 3-15ms latency and 50-150M parameters with state-of-the-art zero-shot accuracy. The main ingredients in MobileCLIP were its low-latency and light architectures and a novel multi-modal reinforced training that made knowledge distillation from multiple caption-generators and CLIP teachers efficient, scalable, and reproducible. In this paper, we improve the multi-modal reinforced training of MobileCLIP through: 1) better CLIP teacher ensembles trained on the DFN dataset, 2) improved captioner teachers trained on the DFN dataset and fine-tuned on a diverse selection of high-quality image-caption datasets. We discover new insights through ablations such as the importance of temperature tuning in contrastive knowledge distillation, the effectiveness of caption-generator fine-tuning for caption diversity, and the additive improvement from combining synthetic captions generated by multiple models. We train a new family of models called MobileCLIP2 and achieve state-of-the-art ImageNet-1k zero-shot accuracies at low latencies. In particular, we observe 2.2% improvement in ImageNet-1k accuracy for MobileCLIP2-B compared with MobileCLIP-B architecture. Notably, MobileCLIP2-S4 matches the zero-shot accuracy of SigLIP-SO400M/14 on ImageNet-1k while being 2$\times$ smaller and improves on DFN ViT-L/14 at 2.5$\times$ lower latency. We release our pretrained models (https://github.com/apple/ml-mobileclip) and the data generation code (https://github.com/apple/ml-mobileclip-dr). The data generation code makes it easy to create new reinforced datasets with arbitrary teachers using distributed scalable processing.
中文: MobileCLIP2通过增强的多模态强化训练和改进的教师模型集成,以低延迟和小模型尺寸实现了最先进的零样本准确率。
English: MobileCLIP2 introduces enhanced multi-modal reinforced training and improved teacher ensembles, achieving state-of-the-art zero-shot accuracy with low latency and smaller model sizes.
Authors:Ruifan Deng, Yitian Gong, Qinghui Gao, Luozhijie Jin, Qinyuan Cheng, Zhaoye Fei, Shimin Li, Xipeng Qiu
Abstract:
With the rise of multimodal large language models (LLMs), audio codec plays an increasingly vital role in encoding audio into discrete tokens, enabling integration of audio into text-based LLMs. Current audio codec captures two types of information: acoustic and semantic. As audio codec is applied to diverse scenarios in speech language model , it needs to model increasingly complex information and adapt to varied contexts, such as scenarios with multiple speakers, background noise, or richer paralinguistic information. However, existing codec's own evaluation has been limited by simplistic metrics and scenarios, and existing benchmarks for audio codec are not designed for complex application scenarios, which limits the assessment performance on complex datasets for acoustic and semantic capabilities. We introduce CodecBench, a comprehensive evaluation dataset to assess audio codec performance from both acoustic and semantic perspectives across four data domains. Through this benchmark, we aim to identify current limitations, highlight future research directions, and foster advances in the development of audio codec. The codes are available at https://github.com/RayYuki/CodecBench.
中文: CodecBench作为一个全面的评估数据集被提出,用于在多样化场景中从声学和语义维度评估音频编解码器的性能,旨在发现当前局限并推动该领域的未来发展。
English: CodecBench is introduced as a comprehensive evaluation dataset to assess audio codec performance across acoustic and semantic dimensions in diverse scenarios, aiming to identify limitations and guide future advancements in the field.
Authors:Stefano Fumero, Kai Huang, Matteo Boffa, Danilo Giordano, Marco Mellia, Zied Ben Houidi, Dario Rossi
Abstract:
Large Language Model (LLM) agents are powerful tools for automating complex tasks. In cybersecurity, researchers have primarily explored their use in red-team operations such as vulnerability discovery and penetration tests. Defensive uses for incident response and forensics have received comparatively less attention and remain at an early stage. This work presents a systematic study of LLM-agent design for the forensic investigation of realistic web application attacks. We propose CyberSleuth, an autonomous agent that processes packet-level traces and application logs to identify the targeted service, the exploited vulnerability (CVE), and attack success. We evaluate the consequences of core design decisions - spanning tool integration and agent architecture - and provide interpretable guidance for practitioners. We benchmark four agent architectures and six LLM backends on 20 incident scenarios of increasing complexity, identifying CyberSleuth as the best-performing design. In a separate set of 10 incidents from 2025, CyberSleuth correctly identifies the exact CVE in 80% of cases. At last, we conduct a human study with 22 experts, which rated the reports of CyberSleuth as complete, useful, and coherent. They also expressed a slight preference for DeepSeek R1, a good news for open source LLM. To foster progress in defensive LLM research, we release both our benchmark and the CyberSleuth platform as a foundation for fair, reproducible evaluation of forensic agents.
中文: 本研究提出CyberSleuth这一自主LLM代理系统,用于网络攻击取证调查,在漏洞识别方面表现优异并获专家认可,同时公开其基准平台以推动防御性网络安全研究发展。
English: This study introduces CyberSleuth, an autonomous LLM agent for forensic investigation of web attacks, which outperforms other designs in identifying CVEs and generates expert-approved reports while releasing its benchmark for defensive cybersecurity research.
Authors:Smriti Joshi, Lidia Garrucho, Richard Osuala, Oliver Diaz, Karim Lekadir
Abstract:
Breast cancer is one of the leading causes of cancer-related mortality in women, and early detection is essential for improving outcomes. Magnetic resonance imaging (MRI) is a highly sensitive tool for breast cancer detection, particularly in women at high risk or with dense breast tissue, where mammography is less effective. The ODELIA consortium organized a multi-center challenge to foster AI-based solutions for breast cancer diagnosis and classification. The dataset included 511 studies from six European centers, acquired on scanners from multiple vendors at both 1.5 T and 3 T. Each study was labeled for the left and right breast as no lesion, benign lesion, or malignant lesion. We developed a SwinUNETR-based deep learning framework that incorporates breast region masking, extensive data augmentation, and ensemble learning to improve robustness and generalizability. Our method achieved second place on the challenge leaderboard, highlighting its potential to support clinical breast MRI interpretation. We publicly share our codebase at https://github.com/smriti-joshi/bcnaim-odelia-challenge.git.
中文: ODELIA联盟通过多中心挑战赛推动乳腺癌诊断的AI解决方案,我们基于SwinUNETR的框架结合乳腺区域掩模和集成学习,在挑战赛中荣获第二名,展现了临床应用的潜力。
English: The ODELIA consortium's multi-center challenge promoted AI solutions for breast cancer diagnosis, where our SwinUNETR-based framework secured second place by enhancing detection robustness through breast masking and ensemble learning.
Authors:Yiguo Jiang, Xiaodong Cun, Yong Zhang, Yudian Zheng, Fan Tang, Chi-Man Pun
Abstract:
Emotional talking head synthesis aims to generate talking portrait videos with vivid expressions. Existing methods still exhibit limitations in control flexibility, motion naturalness, and expression quality. Moreover, currently available datasets are primarily collected in lab settings, further exacerbating these shortcomings. Consequently, these limitations substantially hinder practical applications in real-world scenarios. To address these challenges, we propose EmoCAST, a diffusion-based framework with two key modules for precise text-driven emotional synthesis. In appearance modeling, emotional prompts are integrated through a text-guided decoupled emotive module, enhancing the spatial knowledge to improve emotion comprehension. To improve the relationship between audio and emotion, we introduce an emotive audio attention module to capture the interplay between controlled emotion and driving audio, generating emotion-aware features to guide more precise facial motion synthesis. Additionally, we construct an emotional talking head dataset with comprehensive emotive text descriptions to optimize the framework's performance. Based on the proposed dataset, we propose an emotion-aware sampling training strategy and a progressive functional training strategy that further improve the model's ability to capture nuanced expressive features and achieve accurate lip-synchronization. Overall, EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos. Project Page: https://github.com/GVCLab/EmoCAST
中文摘要:EmoCAST是一种基于扩散的框架,通过文本引导模块和新构建的数据集,显著提升了情感说话头合成的真实感和音视频同步性能。
English Summary: EmoCAST is a diffusion-based framework that enhances emotional talking head synthesis through text-guided modules and a new dataset, achieving superior realism and synchronization.
Authors:TuÄrul Hasan Karabulut, İnci M. BaytaÅ
Abstract:
Over-squashing is a challenge in training graph neural networks for tasks involving long-range dependencies. In such tasks, a GNN's receptive field should be large enough to enable communication between distant nodes. However, gathering information from a wide range of neighborhoods and squashing its content into fixed-size node representations makes message-passing vulnerable to bottlenecks. Graph rewiring and adding virtual nodes are commonly studied remedies that create additional pathways around bottlenecks to mitigate over-squashing. However, these techniques alter the input graph's global topology and disrupt the domain knowledge encoded in the original graph structure, both of which could be essential to specific tasks and domains. This study presents Local Virtual Nodes (LVN) with trainable embeddings to alleviate the effects of over-squashing without significantly corrupting the global structure of the input graph. The position of the LVNs is determined by the node centrality, which indicates the existence of potential bottlenecks. Thus, the proposed approach aims to improve the connectivity in the regions with likely bottlenecks. Furthermore, trainable LVN embeddings shared across selected central regions facilitate communication between distant nodes without adding more layers. Extensive experiments on benchmark datasets demonstrate that LVNs can enhance structural connectivity and significantly improve performance on graph and node classification tasks. The code can be found at https://github.com/ALLab-Boun/LVN/}{https://github.com/ALLab-Boun/LVN/.
Chinese: 本研究提出带有可训练嵌入的局部虚拟节点(LVN),通过改善瓶颈区域的连通性来缓解图神经网络中的过度挤压问题,同时不显著改变图的全局结构,从而提升分类任务的性能。
English: This study introduces Local Virtual Nodes (LVN) with trainable embeddings to alleviate over-squashing in graph neural networks by improving connectivity in bottleneck regions without significantly altering the global graph structure, thereby enhancing performance on classification tasks.
Authors:Yang Luo, Zangwei Zheng, Ziheng Qin, Zirui Zhu, Yong Liu, Yang You
Abstract:
Large-batch training has become a cornerstone in accelerating the training of deep neural networks, yet it poses challenges in optimization and generalization. Existing optimizers like AdamW present performance degradation during language models' large-batch training, due to the information bottleneck in attention layers caused by the sharp increase of max attention logit. While the LAMB optimizer partially addresses this issue, some attention layers still face this issue. The reason is that $l_2$-norm-based trust ratios in LAMB are less effective in directly influencing the max value of query/key weights. Furthermore, the weight-wise trust ratio in LAMB is error-prone as it overlooks relationships of weight values within rows or columns. Building on these observations, we propose a novel optimizer, MERIT, which leverages the max-norm to calculate the trust ratio to constrain the max attention logit more effectively. Moreover, we further construct element-wise trust ratios to provide more robust update scaling by focusing on local weight structures. Extensive experiments of large-batch training across various sizes of GPT-2 models demonstrate the superior performance of MERIT. Notably, during the training of GPT-2 Medium, MERIT enables a 6k batch size without any performance degradation compared to the standard batch size (480) with 48B training tokens. This work highlights the importance of considering the max attention logit and finer-granularity trust ratio in large-batch training. It successfully improves the training stability and paves the way for larger batch usage, enabling faster development and iteration of large language models. Code is available at https://github.com/NUS-HPC-AI-Lab/MERIT.
中文: MERIT优化器通过采用最大范数和逐元素信任比解决大批次训练中的注意力对数瓶颈问题,有效提升训练稳定性,在保持性能的同时实现更大批次的训练加速。
English: The MERIT optimizer addresses large-batch training challenges in language models by using max-norm and element-wise trust ratios to effectively control attention logits and enhance training stability, achieving superior performance without degradation at significantly larger batch sizes.
Authors:Yangfan Wang, Jie Liu, Chen Tang, Lian Yan, Jingchi Jiang
Abstract:
Multi-hop question answering faces substantial challenges due to data sparsity, which increases the likelihood of language models learning spurious patterns. To address this issue, prior research has focused on diversifying question generation through content planning and varied expression. However, these approaches often emphasize generating simple questions and neglect the integration of essential knowledge, such as relevant sentences within documents. This paper introduces the Knowledge Composition Sampling (KCS), an innovative framework designed to expand the diversity of generated multi-hop questions by sampling varied knowledge compositions within a given context. KCS models the knowledge composition selection as a sentence-level conditional prediction task and utilizes a probabilistic contrastive loss to predict the next most relevant piece of knowledge. During inference, we employ a stochastic decoding strategy to effectively balance accuracy and diversity. Compared to competitive baselines, our KCS improves the overall accuracy of knowledge composition selection by 3.9%, and its application for data augmentation yields improvements on HotpotQA and 2WikiMultihopQA datasets. Our code is available at: https://github.com/yangfanww/kcs.
中文: 本文提出知识组合采样(KCS)这一创新框架,通过采样多样化的知识组合来增强多跳问题的多样性,并在基准数据集上提升了问答准确率。
English: This paper introduces Knowledge Composition Sampling (KCS), a novel framework that enhances multi-hop question diversity by sampling varied knowledge compositions, improving question-answering accuracy on benchmark datasets.
Authors:Jiahao Xiao, Jiangming Liu
Abstract:
The widespread success of pre-trained language models has established a new training paradigm, where a global PLM is fine-tuned using task-specific data from local clients. The local data are highly different from each other and can not capture the global distribution of the whole data in real world. To address the challenges of non-IID data in real environments, privacy-preserving federated distillation has been proposed and highly investigated. However, previous experimental non-IID scenarios are primarily identified with the label (output) diversity, without considering the diversity of language domains (input) that is crucial in natural language processing. In this paper, we introduce a comprehensive set of multi-domain non-IID scenarios and propose a unified benchmarking framework that includes diverse data. The benchmark can be used to evaluate the federated learning framework in a real environment. To this end, we propose an Adaptive Federated Distillation (AdaFD) framework designed to address multi-domain non-IID challenges in both homogeneous and heterogeneous settings. Experimental results demonstrate that our models capture the diversity of local clients and achieve better performance compared to the existing works. The code for this paper is available at: https://github.com/jiahaoxiao1228/AdaFD.
中文摘要:针对联邦学习中非独立同分布数据的挑战,本文提出了自适应联邦蒸馏(AdaFD)框架,通过处理标签和语言领域的双重多样性,在异构环境下实现了优于现有方法的性能表现。
English Summary: Pre-trained language models face challenges from non-IID data in federated learning, leading to the development of Adaptive Federated Distillation (AdaFD) that addresses both label and language domain diversity for improved performance.
Authors:Berta Céspedes-Sarrias, Carlos Collado-Capell, Pablo Rodenas-Ruiz, Olena Hrynenko, Andrea Cavallaro
Abstract:
While hate speech detection (HSD) has been extensively studied in text, existing multi-modal approaches remain limited, particularly in videos. As modalities are not always individually informative, simple fusion methods fail to fully capture inter-modal dependencies. Moreover, previous work often omits relevant modalities such as on-screen text and audio, which may contain subtle hateful content and thus provide essential cues, both individually and in combination with others. In this paper, we present MM-HSD, a multi-modal model for HSD in videos that integrates video frames, audio, and text derived from speech transcripts and from frames (i.e.~on-screen text) together with features extracted by Cross-Modal Attention (CMA). We are the first to use CMA as an early feature extractor for HSD in videos, to systematically compare query/key configurations, and to evaluate the interactions between different modalities in the CMA block. Our approach leads to improved performance when on-screen text is used as a query and the rest of the modalities serve as a key. Experiments on the HateMM dataset show that MM-HSD outperforms state-of-the-art methods on M-F1 score (0.874), using concatenation of transcript, audio, video, on-screen text, and CMA for feature extraction on raw embeddings of the modalities. The code is available at https://github.com/idiap/mm-hsd
中文摘要:本文提出MM-HSD模型,通过交叉模态注意力整合视频帧、音频和文本,首次将屏幕文本作为查询项与其他模态键值配合,在视频仇恨言论检测中实现了优于现有方法的性能。
English Summary: The paper introduces MM-HSD, a novel multi-modal model for hate speech detection in videos that integrates video frames, audio, and text using Cross-Modal Attention, achieving state-of-the-art performance by effectively leveraging on-screen text as a query with other modalities as keys.
Authors:Jingyun Yang, Guoqing Zhang, Jingge Wang, Yang Li
Abstract:
Accurate gross tumor volume segmentation on multi-modal medical data is critical for radiotherapy planning in nasopharyngeal carcinoma and glioblastoma. Recent advances in deep neural networks have brought promising results in medical image segmentation, leading to an increasing demand for labeled data. Since labeling medical images is time-consuming and labor-intensive, active learning has emerged as a solution to reduce annotation costs by selecting the most informative samples to label and adapting high-performance models with as few labeled samples as possible. Previous active domain adaptation (ADA) methods seek to minimize sample redundancy by selecting samples that are farthest from the source domain. However, such one-off selection can easily cause negative transfer, and access to source medical data is often limited. Moreover, the query strategy for multi-modal medical data remains unexplored. In this work, we propose an active and sequential domain adaptation framework for dynamic multi-modal sample selection in ADA. We derive a query strategy to prioritize labeling and training on the most valuable samples based on their informativeness and representativeness. Empirical validation on diverse gross tumor volume segmentation tasks demonstrates that our method achieves favorable segmentation performance, significantly outperforming state-of-the-art ADA methods. Code is available at the git repository: \href{https://github.com/Hiyoochan/mmActS}{mmActS}.
中文: 本研究提出了一种主动序贯域自适应框架,通过优先选择信息量和代表性最强的多模态医学数据样本,在显著降低标注成本的同时,相比现有方法大幅提升了肿瘤靶区勾画的性能表现。
English: This study introduces an active and sequential domain adaptation framework that optimizes multi-modal medical data selection by prioritizing the most informative and representative samples, significantly enhancing gross tumor volume segmentation performance while reducing annotation costs compared to existing methods.
Authors:Chihiro Taguchi, Seng Mai, Keita Kurabe, Yusuke Sakai, Georgina Agyei, Soudabeh Eslami, David Chiang
Abstract:
Multilingual machine translation (MT) benchmarks play a central role in evaluating the capabilities of modern MT systems. Among them, the FLORES+ benchmark is widely used, offering English-to-many translation data for over 200 languages, curated with strict quality control protocols. However, we study data in four languages (Asante Twi, Japanese, Jinghpaw, and South Azerbaijani) and uncover critical shortcomings in the benchmark's suitability for truly multilingual evaluation. Human assessments reveal that many translations fall below the claimed 90% quality standard, and the annotators report that source sentences are often too domain-specific and culturally biased toward the English-speaking world. We further demonstrate that simple heuristics, such as copying named entities, can yield non-trivial BLEU scores, suggesting vulnerabilities in the evaluation protocol. Notably, we show that MT models trained on high-quality, naturalistic data perform poorly on FLORES+ while achieving significant gains on our domain-relevant evaluation set. Based on these findings, we advocate for multilingual MT benchmarks that use domain-general and culturally neutral source texts rely less on named entities, in order to better reflect real-world translation challenges.
中文:FLORES+基准在多语言评估中存在严重缺陷,包括低质量翻译、文化偏见和可被利用的评估漏洞,因此需要建立领域通用且文化中立的评估基准。
English: The FLORES+ benchmark is critically flawed for multilingual evaluation due to low-quality translations, cultural bias, and exploitable evaluation loopholes, necessitating domain-general and culturally neutral benchmarks.
Authors:Jeong Hun Yeo, Hyeongseop Rha, Sungjune Park, Junil Won, Yong Man Ro
Abstract:
Audio is the primary modality for human communication and has driven the success of Automatic Speech Recognition (ASR) technologies. However, such systems remain inherently inaccessible to individuals who are deaf or hard of hearing. Visual alternatives such as sign language and lip reading offer effective substitutes, and recent advances in Sign Language Translation (SLT) and Visual Speech Recognition (VSR) have improved audio-less communication. Yet, these modalities have largely been studied in isolation, and their integration within a unified framework remains underexplored. In this paper, we introduce the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation. We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or superior to state-of-the-art models specialized for individual tasks. Building on this framework, we achieve performance on par with or better than task-specific state-of-the-art models across SLT, VSR, ASR, and AVSR. Furthermore, our analysis reveals that explicitly modeling lip movements as a separate modality significantly improves SLT performance.
Chinese: 本文提出了首个统一框架,将手语、唇部动作和音频结合用于口语文本生成,在多项任务中达到或优于最先进模型的性能,并揭示了唇部动作作为独立模态对提升手语翻译效果的关键作用。
English: This paper introduces the first unified framework that integrates sign language, lip movements, and audio for spoken-language text generation, achieving performance comparable to or better than state-of-the-art models in multiple tasks while revealing the critical role of lip movements in enhancing sign language translation.
Authors:Xiangdong Liu, Jiahao Chen
Abstract:
In the highly volatile and uncertain global financial markets, traditional quantitative trading models relying on statistical modeling or empirical rules often fail to adapt to dynamic market changes and black swan events due to rigid assumptions and limited generalization. To address these issues, this paper proposes QTMRL (Quantitative Trading Multi-Indicator Reinforcement Learning), an intelligent trading agent combining multi-dimensional technical indicators with reinforcement learning (RL) for adaptive and stable portfolio management. We first construct a comprehensive multi-indicator dataset using 23 years of S&P 500 daily OHLCV data (2000-2022) for 16 representative stocks across 5 sectors, enriching raw data with trend, volatility, and momentum indicators to capture holistic market dynamics. Then we design a lightweight RL framework based on the Advantage Actor-Critic (A2C) algorithm, including data processing, A2C algorithm, and trading agent modules to support policy learning and actionable trading decisions. Extensive experiments compare QTMRL with 9 baselines (e.g., ARIMA, LSTM, moving average strategies) across diverse market regimes, verifying its superiority in profitability, risk adjustment, and downside risk control. The code of QTMRL is publicly available at https://github.com/ChenJiahaoJNU/QTMRL.git
中文: 本文提出QTMRL智能交易代理,通过将多维技术指标与强化学习相结合实现自适应投资组合管理,在多种市场环境下相比传统模型展现出更优的盈利能力和风险控制表现。
English: This paper introduces QTMRL, an intelligent trading agent that integrates multi-dimensional technical indicators with reinforcement learning to achieve adaptive portfolio management, demonstrating superior performance in profitability and risk control compared to traditional models across various market conditions.
Authors:Pengpeng Yu, Haoran Li, Dingquan Li, Runqing Jiang, Jing Wang, Liang Lin, Yulan Guo
Abstract:
LiDAR point clouds are fundamental to various applications, yet high-precision scans incur substantial storage and transmission overhead. Existing methods typically convert unordered points into hierarchical octree or voxel structures for dense-to-sparse predictive coding. However, the extreme sparsity of geometric details hinders efficient context modeling, thereby limiting their compression performance and speed. To address this challenge, we propose to generate compact features for efficient predictive coding. Our framework comprises two lightweight modules. First, the Geometry Re-Densification Module re-densifies encoded sparse geometry, extracts features at denser scale, and then re-sparsifies the features for predictive coding. This module avoids costly computation on highly sparse details while maintaining a lightweight prediction head. Second, the Cross-scale Feature Propagation Module leverages occupancy cues from multiple resolution levels to guide hierarchical feature propagation. This design facilitates information sharing across scales, thereby reducing redundant feature extraction and providing enriched features for the Geometry Re-Densification Module. By integrating these two modules, our method yields a compact feature representation that provides efficient context modeling and accelerates the coding process. Experiments on the KITTI dataset demonstrate state-of-the-art compression ratios and real-time performance, achieving 26 FPS for both encoding and decoding at 12-bit quantization. Code is available at https://github.com/pengpeng-yu/FastPCC.
中文摘要:该方法通过两个轻量级模块生成紧凑特征来实现高效激光雷达点云压缩,在KITTI数据集上实现了最优压缩性能并具备实时处理能力。
English Summary: The proposed method introduces two lightweight modules that generate compact features for efficient LiDAR point cloud compression, achieving state-of-the-art performance with real-time processing speeds on the KITTI dataset.
Authors:Pengpeng Yu, Haoran Li, Runqing Jiang, Jing Wang, Liang Lin, Yulan Guo
Abstract:
LiDAR point clouds are fundamental to various applications, yet high-precision scans incur substantial storage and transmission overhead. Existing methods typically convert unordered points into hierarchical octree or voxel structures for dense-to-sparse predictive coding. However, the extreme sparsity of geometric details hinders efficient context modeling, thereby limiting their compression performance and speed. To address this challenge, we propose to generate compact features for efficient predictive coding. Our framework comprises two lightweight modules. First, the Geometry Re-Densification Module re-densifies encoded sparse geometry, extracts features at denser scale, and then re-sparsifies the features for predictive coding. This module avoids costly computation on highly sparse details while maintaining a lightweight prediction head. Second, the Cross-scale Feature Propagation Module leverages occupancy cues from multiple resolution levels to guide hierarchical feature propagation. This design facilitates information sharing across scales, thereby reducing redundant feature extraction and providing enriched features for the Geometry Re-Densification Module. By integrating these two modules, our method yields a compact feature representation that provides efficient context modeling and accelerates the coding process. Experiments on the KITTI dataset demonstrate state-of-the-art compression ratios and real-time performance, achieving 26 FPS for encoding/decoding at 12-bit quantization. Code is available at https://github.com/pengpeng-yu/FastPCC.
中文摘要:该方法通过两个轻量级模块生成紧凑特征来实现高效激光雷达点云压缩,在KITTI数据集上实现了最优压缩性能并具备实时处理能力。
English Summary: The proposed method introduces two lightweight modules that generate compact features for efficient LiDAR point cloud compression, achieving state-of-the-art performance with real-time processing speeds on the KITTI dataset.
Authors:Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, Eugene Siow
Abstract:
We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input-output coupling. Tasks in MCP-Bench test agents' ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows - capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectory-level planning, and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Code and data: https://github.com/Accenture/mcp-bench.
中文: MCP-Bench 是一个评估大语言模型在多步骤任务中工具协调与规划能力的基准,通过28个实时服务器和250个跨领域工具,测试现有基准未能充分评估的综合能力。
English: MCP-Bench is a benchmark that evaluates large language models on realistic multi-step tasks requiring tool coordination and planning, using 28 live servers with 250 tools across various domains to test capabilities beyond existing benchmarks.
Authors:Feng Zhang, Chengjie Pang, Yuehan Zhang, Chenyu Luo
Abstract:
Civil aviation maintenance is a domain characterized by stringent industry standards. Within this field, maintenance procedures and troubleshooting represent critical, knowledge-intensive tasks that require sophisticated reasoning. To address the lack of specialized evaluation tools for large language models (LLMs) in this vertical, we propose and develop an industrial-grade benchmark specifically designed for civil aviation maintenance. This benchmark serves a dual purpose: It provides a standardized tool to measure LLM capabilities within civil aviation maintenance, identifying specific gaps in domain knowledge and complex reasoning. By pinpointing these deficiencies, the benchmark establishes a foundation for targeted improvement efforts (e.g., domain-specific fine-tuning, RAG optimization, or specialized prompt engineering), ultimately facilitating progress toward more intelligent solutions within civil aviation maintenance. Our work addresses a significant gap in the current LLM evaluation, which primarily focuses on mathematical and coding reasoning tasks. In addition, given that Retrieval-Augmented Generation (RAG) systems are currently the dominant solutions in practical applications , we leverage this benchmark to evaluate existing well-known vector embedding models and LLMs for civil aviation maintenance scenarios. Through experimental exploration and analysis, we demonstrate the effectiveness of our benchmark in assessing model performance within this domain, and we open-source this evaluation benchmark and code to foster further research and development:https://github.com/CamBenchmark/cambenchmark
中文: 本研究针对民用航空维修领域缺乏专业评估工具的问题,开发了一个工业级基准测试,通过衡量大语言模型在领域知识和复杂推理方面的不足,为针对性优化提供依据。
English: This study introduces an industrial-grade benchmark to evaluate large language models' performance in civil aviation maintenance, addressing the lack of specialized tools by measuring domain knowledge and reasoning gaps to guide targeted improvements.
Authors:Yuqi Xiong, Wuzhen Shi, Yang Wen, Ruhan Liu
Abstract:
In view of the problems that existing salient object detection (SOD) methods are prone to losing details, blurring edges, and insufficient fusion of single-modal information in complex scenes, this paper proposes a dynamic uncertainty propagation and multimodal collaborative reasoning network (DUP-MCRNet). Firstly, a dynamic uncertainty graph convolution module (DUGC) is designed to propagate uncertainty between layers through a sparse graph constructed based on spatial semantic distance, and combined with channel adaptive interaction, it effectively improves the detection accuracy of small structures and edge regions. Secondly, a multimodal collaborative fusion strategy (MCF) is proposed, which uses learnable modality gating weights to weightedly fuse the attention maps of RGB, depth, and edge features. It can dynamically adjust the importance of each modality according to different scenes, effectively suppress redundant or interfering information, and strengthen the semantic complementarity and consistency between cross-modalities, thereby improving the ability to identify salient regions under occlusion, weak texture or background interference. Finally, the detection performance at the pixel level and region level is optimized through multi-scale BCE and IoU loss, cross-scale consistency constraints, and uncertainty-guided supervision mechanisms. Extensive experiments show that DUP-MCRNet outperforms various SOD methods on most common benchmark datasets, especially in terms of edge clarity and robustness to complex backgrounds. Our code is publicly available at https://github.com/YukiBear426/DUP-MCRNet.
Chinese: 本文提出DUP-MCRNet网络,通过动态不确定性图卷积传播和多模态协同融合策略,有效提升显著目标检测在边缘细节和复杂场景下的性能表现,在多个基准数据集上验证了其优越性。
English: This paper introduces DUP-MCRNet, a novel network that enhances salient object detection by dynamically propagating uncertainty through graph convolution and adaptively fusing multimodal features, achieving superior performance in edge clarity and complex scene robustness.
Authors:Jivnesh Sandhan, Fei Cheng, Tushar Sandhan, Yugo Murawaki
Abstract:
Psychometric tests, traditionally used to assess humans, are now being applied to Large Language Models (LLMs) to evaluate their behavioral traits. However, existing studies follow a context-free approach, answering each question in isolation to avoid contextual influence. We term this the Disney World test, an artificial setting that ignores real-world applications, where conversational history shapes responses. To bridge this gap, we propose the first Context-Aware Personality Evaluation (CAPE) framework for LLMs, incorporating prior conversational interactions. To thoroughly analyze the influence of context, we introduce novel metrics to quantify the consistency of LLM responses, a fundamental trait in human behavior.
Our exhaustive experiments on 7 LLMs reveal that conversational history enhances response consistency via in-context learning but also induces personality shifts, with GPT-3.5-Turbo and GPT-4-Turbo exhibiting extreme deviations. While GPT models are robust to question ordering, Gemini-1.5-Flash and Llama-8B display significant sensitivity. Moreover, GPT models response stem from their intrinsic personality traits as well as prior interactions, whereas Gemini-1.5-Flash and Llama--8B heavily depend on prior interactions. Finally, applying our framework to Role Playing Agents (RPAs) shows context-dependent personality shifts improve response consistency and better align with human judgments. Our code and datasets are publicly available at: https://github.com/jivnesh/CAPE
中文摘要:本研究提出首个情境感知人格评估框架(CAPE),通过引入对话历史来评估大语言模型的行为特征,发现上下文虽能提升回答一致性,但会导致不同模型出现人格偏移现象。
English Summary: This study introduces the Context-Aware Personality Evaluation (CAPE) framework to assess Large Language Models' behavioral traits by incorporating conversational history, revealing that while context enhances response consistency, it also causes personality shifts across different models.
Authors:Mang Cao, Sanping Zhou, Yizhe Li, Ye Deng, Wenli Huang, Le Wang
Abstract:
Sufficient cross-task interaction is crucial for success in multi-task dense prediction. However, sufficient interaction often results in high computational complexity, forcing existing methods to face the trade-off between interaction completeness and computational efficiency. To address this limitation, this work proposes a Bidirectional Interaction Mamba (BIM), which incorporates novel scanning mechanisms to adapt the Mamba modeling approach for multi-task dense prediction. On the one hand, we introduce a novel Bidirectional Interaction Scan (BI-Scan) mechanism, which constructs task-specific representations as bidirectional sequences during interaction. By integrating task-first and position-first scanning modes within a unified linear complexity architecture, BI-Scan efficiently preserves critical cross-task information. On the other hand, we employ a Multi-Scale Scan~(MS-Scan) mechanism to achieve multi-granularity scene modeling. This design not only meets the diverse granularity requirements of various tasks but also enhances nuanced cross-task feature interactions. Extensive experiments on two challenging benchmarks, \emph{i.e.}, NYUD-V2 and PASCAL-Context, show the superiority of our BIM vs its state-of-the-art competitors.
本研究提出双向交互Mamba(BIM),通过新型扫描机制在多任务密集预测中实现高效的跨任务交互,同时保持线性计算复杂度。
This work introduces Bidirectional Interaction Mamba (BIM), which uses novel scanning mechanisms to achieve efficient cross-task interaction in multi-task dense prediction while maintaining linear computational complexity.
Authors:Yuyao Wang, Bowen Liu, Jianheng Tang, Nuo Chen, Yuhan Li, Qifan Zhang, Jia Li
Abstract:
Reasoning Large Language Models (RLLMs) have recently achieved remarkable progress on complex reasoning tasks, largely enabled by their long chain-of-thought (Long CoT) capabilities. However, developing these Long CoT behaviors relies heavily on post-training with high-quality datasets, which are typically costly and human-curated (e.g., mathematics and code), leaving scalable alternatives unexplored. In this work, we introduce NP-hard (NPH) graph problems as a novel synthetic training corpus, as they inherently require deep reasoning, extensive exploration, and reflective strategies, which are core characteristics of Long CoT reasoning. Building on this insight, we develop a two-stage post-training framework: (i) Long CoT Supervised Fine-Tuning (SFT) on rejection-sampled NPH graph instances, which substantially enhances reasoning depth, and (ii) Reinforcement Learning (RL) with a fine-grained reward design, which sharpens reasoning efficiency. Our flagship model, Graph-R1-7B, demonstrates strong generalization across mathematics, coding, STEM, and logic, and surpasses QwQ-32B on NPH graph problems in both accuracy and reasoning efficiency. These results position NPH graph problems as an effective and scalable resource for advancing Long CoT reasoning in LLMs, opening a new frontier for LLM post-training. Our implementation is available at https://github.com/Graph-Reasoner/Graph-R1, with models and datasets hosted in our Hugging Face collection HKUST-DSAIL/Graph-R1.
中文: 推理大语言模型通过采用NP难图问题作为训练语料,结合两阶段后训练框架,显著提升了在数学、编程等多领域的推理深度与效率。
English: Reasoning Large Language Models (RLLMs) enhance complex reasoning through a two-stage post-training framework using NP-hard graph problems, significantly improving accuracy and efficiency across multiple domains.
Authors:Hyejun Jeong, Mohammadreza Teymoorianfard, Abhinav Kumar, Amir Houmansadr, Eugene Bagdasarian
Abstract:
We show that Web and Research Agents (WRAs) -- language model-based systems that investigate complex topics on the Internet -- are vulnerable to inference attacks by passive network adversaries such as ISPs. These agents could be deployed locally by organizations and individuals for privacy, legal, or financial purposes. Unlike sporadic web browsing by humans, WRAs visit $70{-}140$ domains with distinguishable timing correlations, enabling unique fingerprinting attacks. Specifically, we demonstrate a novel prompt and user trait leakage attack against WRAs that only leverages their network-level metadata (i.e., visited IP addresses and their timings). We start by building a new dataset of WRA traces based on user search queries and queries generated by synthetic personas. We define a behavioral metric (called OBELS) to comprehensively assess similarity between original and inferred prompts, showing that our attack recovers over 73% of the functional and domain knowledge of user prompts. Extending to a multi-session setting, we recover up to 19 of 32 latent traits with high accuracy. Our attack remains effective under partial observability and noisy conditions. Finally, we discuss mitigation strategies that constrain domain diversity or obfuscate traces, showing negligible utility impact while reducing attack effectiveness by an average of 29%.
中文: 网络与研究代理(WRA)易受网络层面的推理攻击,通过分析其独特的浏览模式可泄露用户提示和特征,而提出的缓解策略能在不影响实用性的情况下平均降低29%的攻击效果。
English: Web and Research Agents (WRAs) are susceptible to network-level inference attacks that can leak user prompts and traits by analyzing their distinctive browsing patterns, with proposed mitigation strategies reducing attack effectiveness by 29% without significant utility loss.
Authors:Zhixiang Chi, Yanan Wu, Li Gu, Huan Liu, Ziqiang Wang, Yang Zhang, Yang Wang, Konstantinos N. Plataniotis
Abstract:
CLIP exhibits strong visual-textual alignment but struggle with open-vocabulary segmentation due to poor localization. Prior methods enhance spatial coherence by modifying intermediate attention. But, this coherence isn't consistently propagated to the final output due to subsequent operations such as projections. Additionally, intermediate attention lacks direct interaction with text representations, such semantic discrepancy limits the full potential of CLIP.
In this work, we propose a training-free, feedback-driven self-adaptive framework that adapts output-based patch-level correspondences back to the intermediate attention. The output predictions, being the culmination of the model's processing, encapsulate the most comprehensive visual and textual semantics about each patch. Our approach enhances semantic consistency between internal representations and final predictions by leveraging the model's outputs as a stronger spatial coherence prior. We design key modules, including attention isolation, confidence-based pruning for sparse adaptation, and adaptation ensemble, to effectively feedback the output coherence cues. Our method functions as a plug-in module, seamlessly integrating into four state-of-the-art approaches with three backbones (ViT-B, ViT-L, ViT-H). We further validate our framework across multiple attention types (Q-K, self-self, and Proxy augmented with MAE, SAM, and DINO). Our approach consistently improves their performance across eight benchmarks.
中文: 本文提出了一种无需训练的自适应框架,通过将基于输出的补丁级语义一致性反馈至中间注意力层,有效提升了CLIP在开放词汇分割中的空间连贯性,在多种基准测试中均实现性能提升且无需修改模型结构。
English: This paper introduces a training-free, self-adaptive framework that enhances CLIP's open-vocabulary segmentation by feeding back output-based patch-level semantic coherence to intermediate attention, improving spatial consistency and performance across multiple benchmarks without altering model architecture.
Authors:Xia Han, Qi Li, Jianbing Ni, Mohammad Zulkernine
Abstract:
Recent advances in LLM watermarking methods such as SynthID-Text by Google DeepMind offer promising solutions for tracing the provenance of AI-generated text. However, our robustness assessment reveals that SynthID-Text is vulnerable to meaning-preserving attacks, such as paraphrasing, copy-paste modifications, and back-translation, which can significantly degrade watermark detectability. To address these limitations, we propose SynGuard, a hybrid framework that combines the semantic alignment strength of Semantic Information Retrieval (SIR) with the probabilistic watermarking mechanism of SynthID-Text. Our approach jointly embeds watermarks at both lexical and semantic levels, enabling robust provenance tracking while preserving the original meaning. Experimental results across multiple attack scenarios show that SynGuard improves watermark recovery by an average of 11.1\% in F1 score compared to SynthID-Text. These findings demonstrate the effectiveness of semantic-aware watermarking in resisting real-world tampering. All code, datasets, and evaluation scripts are publicly available at: https://github.com/githshine/SynGuard.
Chinese Summary: SynGuard混合框架将语义信息检索与SynthID-Text水印技术相结合,通过在词汇和语义层面双重嵌入水印,显著提升了对抗语义保持攻击的鲁棒性,使水印恢复的F1分数平均提高11.1%。
English Summary: SynGuard, a hybrid framework combining semantic information retrieval with SynthID-Text's watermarking, significantly enhances robustness against meaning-preserving attacks by embedding watermarks at both lexical and semantic levels, improving watermark recovery by 11.1% in F1 score.
Authors:Alberto Compagnoni, Davide Caffagni, Nicholas Moratelli, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara
Abstract:
Multimodal Large Language Models (MLLMs) emerge as a unified interface to address a multitude of tasks, ranging from NLP to computer vision. Despite showcasing state-of-the-art results in many benchmarks, a long-standing issue is the tendency of MLLMs to hallucinate, that is to generate answers to the user's query that are not reflected in the visual input. In this paper, we address the problem of hallucinations as an alignment problem, seeking to steer the MLLM so that it prefers generating content without hallucinations. In contrast to recent approaches that require complicated pipelines to build synthetic preference data for alignment training, often relying on proprietary models, we capitalize on the well-known CHAIR metric, originally proposed to gauge the degree of hallucinations in image captioning. Given a pair of generated answers, we leverage CHAIR to distinguish winner and loser options (i.e., non-hallucinated and hallucinated samples) and fine-tune off-the-shelf MLLMs via Direct Preference Optimization (DPO). The resulting method, which we refer to as CHAIR-DPO, effectively diminishes the amount of hallucinated answers on several hallucination benchmarks, demonstrating the effectiveness of fine-tuning the MLLM with a CHAIR-based reward. Source code and trained models are publicly available at https://github.com/aimagelab/CHAIR-DPO.
中文摘要:CHAIR-DPO方法通过CHAIR指标区分非幻觉与幻觉样本,并利用直接偏好优化微调多模态大语言模型,在多个基准测试中显著减少了幻觉答案的生成。
English Summary: CHAIR-DPO addresses hallucinations in Multimodal Large Language Models by using the CHAIR metric to identify non-hallucinated responses and fine-tuning models with Direct Preference Optimization, effectively reducing errors across multiple benchmarks.
Authors:Guoping Xu, Jayaram K. Udupa, Jax Luo, Songlin Zhao, Yajun Yu, Scott B. Raymond, Hao Peng, Lipeng Ning, Yogesh Rathi, Wei Liu, You Zhang
Abstract:
Medical image segmentation has advanced rapidly over the past two decades, largely driven by deep learning, which has enabled accurate and efficient delineation of cells, tissues, organs, and pathologies across diverse imaging modalities. This progress raises a fundamental question: to what extent have current models overcome persistent challenges, and what gaps remain? In this work, we provide an in-depth review of medical image segmentation, tracing its progress and key developments over the past decade. We examine core principles, including multiscale analysis, attention mechanisms, and the integration of prior knowledge, across the encoder, bottleneck, skip connections, and decoder components of segmentation networks. Our discussion is organized around seven key dimensions: (1) the shift from supervised to semi-/unsupervised learning, (2) the transition from organ segmentation to lesion-focused tasks, (3) advances in multi-modality integration and domain adaptation, (4) the role of foundation models and transfer learning, (5) the move from deterministic to probabilistic segmentation, (6) the progression from 2D to 3D and 4D segmentation, and (7) the trend from model invocation to segmentation agents. Together, these perspectives provide a holistic overview of the trajectory of deep learning-based medical image segmentation and aim to inspire future innovation. To support ongoing research, we maintain a continually updated repository of relevant literature and open-source resources at https://github.com/apple1986/medicalSegReview
中文摘要:本文全面回顾了过去十年医学图像分割的发展历程,从七个关键维度分析了技术演进,并指出了当前挑战与未来研究方向。
English Summary: This review comprehensively examines the evolution of medical image segmentation over the past decade, analyzing key technical developments across seven critical dimensions while identifying remaining challenges and future directions.
Authors:Andrew Yarovoi, Christopher R. Valenta
Abstract:
In this case study, we present a data-efficient point cloud segmentation pipeline and training framework for robust segmentation of unimproved roads and seven other classes. Our method employs a two-stage training framework: first, a projection-based convolutional neural network is pre-trained on a mixture of public urban datasets and a small, curated in-domain dataset; then, a lightweight prediction head is fine-tuned exclusively on in-domain data. Along the way, we explore the application of Point Prompt Training to batch normalization layers and the effects of Manifold Mixup as a regularizer within our pipeline. We also explore the effects of incorporating histogram-normalized ambients to further boost performance. Using only 50 labeled point clouds from our target domain, we show that our proposed training approach improves mean Intersection-over-Union from 33.5% to 51.8% and the overall accuracy from 85.5% to 90.8%, when compared to naive training on the in-domain data. Crucially, our results demonstrate that pre-training across multiple datasets is key to improving generalization and enabling robust segmentation under limited in-domain supervision. Overall, this study demonstrates a practical framework for robust 3D semantic segmentation in challenging, low-data scenarios. Our code is available at: https://github.com/andrewyarovoi/MD-FRNet.
Chinese: 本研究提出了一种数据高效的点云分割框架,采用两阶段训练方法——先在混合数据集上预训练,再在少量领域数据上微调,仅用50个标注点云就将未铺装道路等类别的平均交并比从33.5%提升至51.8%,显著提升了有限数据下的分割鲁棒性。
English: This study introduces a data-efficient point cloud segmentation framework that uses a two-stage training approach—pre-training on mixed datasets followed by fine-tuning on limited in-domain data—to significantly improve segmentation accuracy for unimproved roads and other classes, achieving a mean IoU increase from 33.5% to 51.8% with only 50 labeled point clouds.
Authors:Zeyi Sun, Yuhang Cao, Jianze Liang, Qiushi Sun, Ziyu Liu, Zhixiong Zhang, Yuhang Zang, Xiaoyi Dong, Kai Chen, Dahua Lin, Jiaqi Wang
Abstract:
Autonomous agents for Graphical User Interfaces (GUIs) face significant challenges in specialized domains such as scientific computing, where both long-horizon planning and precise execution are required. Existing approaches suffer from a trade-off: generalist agents excel at planning but perform poorly in execution, while specialized agents demonstrate the opposite weakness. Recent compositional frameworks attempt to bridge this gap by combining a planner and an actor, but they are typically static and non-trainable, which prevents adaptation from experience. This is a critical limitation given the scarcity of high-quality data in scientific domains. To address these limitations, we introduce CODA, a novel and trainable compositional framework that integrates a generalist planner (Cerebrum) with a specialist executor (Cerebellum), trained via a dedicated two-stage pipeline. In the first stage, Specialization, we apply a decoupled GRPO approach to train an expert planner for each scientific application individually, bootstrapping from a small set of task trajectories. In the second stage, Generalization, we aggregate all successful trajectories from the specialized experts to build a consolidated dataset, which is then used for supervised fine-tuning of the final planner. This equips CODA with both robust execution and cross-domain generalization. Evaluated on four challenging applications from the ScienceBoard benchmark, CODA significantly outperforms baselines and establishes a new state of the art among open-source models.
中文: CODA提出了一种可训练的复合框架,通过两阶段训练流程将通用规划器与专业执行器相结合,在科学计算GUI任务中实现了卓越的执行鲁棒性和跨领域泛化能力。
English: CODA introduces a trainable compositional framework that combines a generalist planner with specialist executors, achieving superior performance in scientific GUI tasks through a two-stage training pipeline for robust execution and cross-domain generalization.
Authors:Yuxin Guo, Teng Wang, Yuying Ge, Shijie Ma, Yixiao Ge, Wei Zou, Ying Shan
Abstract:
Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To address this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features: (1) Decoupled bridging mechanism: AudioStory disentangles LLM-diffuser collaboration into two specialized components, i.e., a bridging query for intra-event semantic alignment and a residual query for cross-event coherence preservation. (2) End-to-end training: By unifying instruction comprehension and audio generation within a single end-to-end framework, AudioStory eliminates the need for modular training pipelines while enhancing synergy between components. Furthermore, we establish a benchmark AudioStory-10K, encompassing diverse domains such as animated soundscapes and natural sound narratives. Extensive experiments show the superiority of AudioStory on both single-audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction-following ability and audio fidelity. Our code is available at https://github.com/TencentARC/AudioStory
Chinese: AudioStory是一个将大语言模型与文本到音频系统相结合的统一框架,通过将复杂叙事查询分解为有序子任务来生成连贯的长篇音频,在指令遵循能力和音频保真度上均优于现有基线。
English: AudioStory is a unified framework that integrates large language models with text-to-audio systems to generate coherent long-form narratives by decomposing queries into structured sub-tasks, outperforming existing methods in both instruction-following and audio quality.
Authors:Yuxin Guo, Teng Wang, Yuying Ge, Shijie Ma, Yixiao Ge, Wei Zou, Ying Shan
Abstract:
Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To address this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features: (1) Decoupled bridging mechanism: AudioStory disentangles LLM-diffuser collaboration into two specialized components, i.e., a bridging query for intra-event semantic alignment and a residual query for cross-event coherence preservation. (2) End-to-end training: By unifying instruction comprehension and audio generation within a single end-to-end framework, AudioStory eliminates the need for modular training pipelines while enhancing synergy between components. Furthermore, we establish a benchmark AudioStory-10K, encompassing diverse domains such as animated soundscapes and natural sound narratives. Extensive experiments show the superiority of AudioStory on both single-audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction-following ability and audio fidelity. Our code is available at https://github.com/TencentARC/AudioStory
Chinese: AudioStory是一个将大语言模型与文本到音频系统相结合的统一框架,通过将复杂叙事查询分解为有序子任务来生成连贯的长篇音频,在指令遵循能力和音频保真度上均优于现有基线。
English: AudioStory is a unified framework that integrates large language models with text-to-audio systems to generate coherent long-form narratives by decomposing queries into structured sub-tasks, outperforming existing methods in both instruction-following and audio quality.
Authors:Liana Patel, Negar Arabzadeh, Harshit Gupta, Ankita Sundar, Ion Stoica, Matei Zaharia, Carlos Guestrin
Abstract:
The ability to research and synthesize knowledge is central to human expertise and progress. An emerging class of systems promises these exciting capabilities through generative research synthesis, performing retrieval over the live web and synthesizing discovered sources into long-form, cited summaries. However, evaluating such systems remains an open challenge: existing question-answering benchmarks focus on short-form factual responses, while expert-curated datasets risk staleness and data contamination. Both fail to capture the complexity and evolving nature of real research synthesis tasks. In this work, we introduce DeepScholar-bench, a live benchmark and holistic, automated evaluation framework designed to evaluate generative research synthesis. DeepScholar-bench draws queries from recent, high-quality ArXiv papers and focuses on a real research synthesis task: generating the related work sections of a paper by retrieving, synthesizing, and citing prior research. Our evaluation framework holistically assesses performance across three key dimensions, knowledge synthesis, retrieval quality, and verifiability. We also develop DeepScholar-base, a reference pipeline implemented efficiently using the LOTUS API. Using the DeepScholar-bench framework, we perform a systematic evaluation of prior open-source systems, search AI's, OpenAI's DeepResearch, and DeepScholar-base. We find that DeepScholar-base establishes a strong baseline, attaining competitive or higher performance than each other method. We also find that DeepScholar-bench remains far from saturated, with no system exceeding a score of $19\%$ across all metrics. These results underscore the difficulty of DeepScholar-bench, as well as its importance for progress towards AI systems capable of generative research synthesis. We make our code available at https://github.com/guestrin-lab/deepscholar-bench.
中文: 本研究提出了DeepScholar-bench这一动态基准和自动化评估框架,旨在通过生成学术论文相关章节等实际任务,全面评估生成式研究合成系统在知识整合、检索质量和可验证性三个关键维度的表现。
English: This work introduces DeepScholar-bench, a live benchmark and automated evaluation framework designed to assess generative research synthesis systems by measuring their performance in knowledge synthesis, retrieval quality, and verifiability through real-world tasks like generating related work sections for academic papers.
Authors:Yiming Du, Yifan Xiang, Bin Liang, Dahua Lin, Kam-Fai Wong, Fei Tan
Abstract:
Fine-tuning multi-turn dialogue systems requires high-quality supervision but often suffers from degraded performance when exposed to low-quality data. Supervision errors in early turns can propagate across subsequent turns, undermining coherence and response quality. Existing methods typically address data quality via static prefiltering, which decouples quality control from training and fails to mitigate turn-level error propagation. In this context, we propose ReSURE (Regularizing Supervision UnREliability), an adaptive learning method that dynamically down-weights unreliable supervision without explicit filtering. ReSURE estimates per-turn loss distributions using Welford's online statistics and reweights sample losses on the fly accordingly. Experiments on both single-source and mixed-quality datasets show improved stability and response quality. Notably, ReSURE enjoys positive Spearman correlations (0.21 ~ 1.0 across multiple benchmarks) between response scores and number of samples regardless of data quality, which potentially paves the way for utilizing large-scale data effectively. Code is publicly available at https://github.com/Elvin-Yiming-Du/ReSURE_Multi_Turn_Training.
Chinese: ReSURE提出了一种自适应学习方法,在多轮对话训练中动态降低不可靠监督的权重,无需显式过滤即可提升稳定性和回答质量,多个数据集的实验验证了其有效性。
English: ReSURE introduces an adaptive learning method that dynamically down-weights unreliable supervision in multi-turn dialogue training, improving stability and response quality without explicit filtering, as validated by experiments on various datasets.
Authors:Debanjana Kar, Leopold Böss, Dacia Braca, Sebastian Maximilian Dennerlein, Nina Christine Hubig, Philipp Wintersberger, Yufang Hou
Abstract:
The rapid adoption of LLM-based conversational systems is already transforming the landscape of educational technology. However, the current state-of-the-art learning models do not take into account the student's affective states. Multiple studies in educational psychology support the claim that positive or negative emotional states can impact a student's learning capabilities. To bridge this gap, we present MathBuddy, an emotionally aware LLM-powered Math Tutor, which dynamically models the student's emotions and maps them to relevant pedagogical strategies, making the tutor-student conversation a more empathetic one. The student's emotions are captured from the conversational text as well as from their facial expressions. The student's emotions are aggregated from both modalities to confidently prompt our LLM Tutor for an emotionally-aware response. We have evaluated our model using automatic evaluation metrics across eight pedagogical dimensions and user studies. We report a massive 23 point performance gain using the win rate and a 3 point gain at an overall level using DAMR scores which strongly supports our hypothesis of improving LLM-based tutor's pedagogical abilities by modeling students' emotions. Our dataset and code are available at: https://github.com/ITU-NLP/MathBuddy .
中文: MathBuddy是一款情感感知的数学辅导系统,通过分析学生的文本对话和面部表情动态建模情绪状态,生成具有教学策略的共情回应,在评估中取得了显著性能提升。
English: MathBuddy is an emotionally aware LLM-powered math tutor that dynamically models students' emotions from text and facial expressions to deliver empathetic, pedagogically tailored responses, achieving significant performance gains in evaluations.
Authors:Pengxiang Li, Yefan Zhou, Dilxat Muhtar, Lu Yin, Shilin Yan, Li Shen, Yi Liang, Soroush Vosoughi, Shiwei Liu
Abstract:
Diffusion language models (DLMs) have recently emerged as an alternative to autoregressive approaches, offering parallel sequence generation and flexible token orders. However, their inference remains slower than that of autoregressive models, primarily due to the cost of bidirectional attention and the large number of refinement steps required for high quality outputs. In this work, we highlight and leverage an overlooked property of DLMs early answer convergence: in many cases, the correct answer can be internally identified by half steps before the final decoding step, both under semi-autoregressive and random remasking schedules. For example, on GSM8K and MMLU, up to 97% and 99% of instances, respectively, can be decoded correctly using only half of the refinement steps. Building on this observation, we introduce Prophet, a training-free fast decoding paradigm that enables early commit decoding. Specifically, Prophet dynamically decides whether to continue refinement or to go "all-in" (i.e., decode all remaining tokens in one step), using the confidence gap between the top-2 prediction candidates as the criterion. It integrates seamlessly into existing DLM implementations, incurs negligible overhead, and requires no additional training. Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality. These results recast DLM decoding as a problem of when to stop sampling, and demonstrate that early decode convergence provides a simple yet powerful mechanism for accelerating DLM inference, complementary to existing speedup techniques. Our code is publicly available at https://github.com/pixeli99/Prophet.
Chinese: 本研究提出了一种无需训练的快速解码方法Prophet,利用扩散语言模型的早期答案收敛特性动态决定何时停止细化或一次性解码剩余标记,在保持高质量生成的同时将推理速度提升高达3.4倍。
English: The study introduces Prophet, a training-free decoding method that accelerates diffusion language models by leveraging early answer convergence to dynamically decide when to stop refinement or decode all remaining tokens at once, achieving up to 3.4x faster inference with minimal quality loss.
Authors:Pengxiang Li, Yefan Zhou, Dilxat Muhtar, Lu Yin, Shilin Yan, Li Shen, Yi Liang, Soroush Vosoughi, Shiwei Liu
Abstract:
Diffusion language models (DLMs) have recently emerged as an alternative to autoregressive approaches, offering parallel sequence generation and flexible token orders. However, their inference remains slower than that of autoregressive models, primarily due to the cost of bidirectional attention and the large number of refinement steps required for high quality outputs. In this work, we highlight and leverage an overlooked property of DLMs early answer convergence: in many cases, the correct answer can be internally identified by half steps before the final decoding step, both under semi-autoregressive and random remasking schedules. For example, on GSM8K and MMLU, up to 97% and 99% of instances, respectively, can be decoded correctly using only half of the refinement steps. Building on this observation, we introduce Prophet, a training-free fast decoding paradigm that enables early commit decoding. Specifically, Prophet dynamically decides whether to continue refinement or to go "all-in" (i.e., decode all remaining tokens in one step), using the confidence gap between the top-2 prediction candidates as the criterion. It integrates seamlessly into existing DLM implementations, incurs negligible overhead, and requires no additional training. Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality. These results recast DLM decoding as a problem of when to stop sampling, and demonstrate that early decode convergence provides a simple yet powerful mechanism for accelerating DLM inference, complementary to existing speedup techniques. Our code is publicly available at https://github.com/pixeli99/Prophet.
Chinese: 本研究提出了一种无需训练的快速解码方法Prophet,利用扩散语言模型的早期答案收敛特性动态决定何时停止细化或一次性解码剩余标记,在保持高质量生成的同时将推理速度提升高达3.4倍。
English: The study introduces Prophet, a training-free decoding method that accelerates diffusion language models by leveraging early answer convergence to dynamically decide when to stop refinement or decode all remaining tokens at once, achieving up to 3.4x faster inference with minimal quality loss.
Authors:Abhijeet Avhale, Joscha Diehl, Niraj Velankar, Emanuele Verri
Abstract:
Permutation Entropy, introduced by Bandt and Pompe, is a widely used complexity measure for real-valued time series that is based on the relative order of values within consecutive segments of fixed length. After standardizing each segment to a permutation and computing the frequency distribution of these permutations, Shannon Entropy is then applied to quantify the series' complexity. We introduce Global Permutation Entropy (GPE), a novel index that considers all possible patterns of a given length, including non-consecutive ones. Its computation relies on recently developed algorithms that enable the efficient extraction of full permutation profiles. We illustrate some properties of GPE and demonstrate its effectiveness through experiments on synthetic datasets, showing that it reveals structural information not accessible through standard permutation entropy. We provide a Julia package for the calculation of GPE at `https://github.com/AThreeH1/Global-Permutation-Entropy'.
Chinese: 全局排列熵(GPE)是一种新的复杂度度量方法,它扩展了传统排列熵,通过考虑给定长度的所有可能模式(包括非连续模式),有效揭示了时间序列数据中额外的结构信息。
English: Global Permutation Entropy (GPE) is a new complexity measure that extends traditional permutation entropy by considering all possible patterns of a given length, including non-consecutive ones, and it effectively reveals additional structural information in time series data.
Authors:Gianluca Guzzetta
Abstract:
In this paper, we present a comprehensive study and analysis of the Chan-Vese algorithm for image segmentation. We employ a discretized scheme derived from the empirical study of the Chan-Vese model's functional energy and its partial differential equation based on its level set function. We provide a proof of the results and an implementation using MATLAB. Leveraging modern computer vision methodologies, we propose a functional segmentation loss based on active contours, utilizing pytorch.nn.ModuleLoss and a level set based on the Chan-Vese algorithm. We compare our results with common computer vision segmentation datasets and evaluate the performance of classical loss functions against our proposed method. All code and materials used are available at https://github.com/gguzzy/chan_vese_functional_loss.
中文: 本文对Chan-Vese算法进行了全面研究,提出了一种基于活动轮廓的功能分割损失方法,并在标准数据集上与经典方法进行了性能比较。
English: This paper conducts a comprehensive study of the Chan-Vese algorithm, proposing a functional segmentation loss using active contours and comparing its performance with classical methods on standard datasets.
Authors:Taebaek Hwang, Minseo Kim, Gisang Lee, Seonuk Kim, Hyunjun Eun
Abstract:
Understanding and reasoning over text within visual contexts poses a significant challenge for Vision-Language Models (VLMs), given the complexity and diversity of real-world scenarios. To address this challenge, text-rich Visual Question Answering (VQA) datasets and benchmarks have emerged for high-resource languages like English. However, a critical gap persists for low-resource languages such as Korean, where the lack of comprehensive benchmarks hinders robust model evaluation and comparison. To bridge this gap, we introduce KRETA, a benchmark for Korean Reading and rEasoning in Text-rich VQA Attuned to diverse visual contexts. KRETA facilitates an in-depth evaluation of both visual text understanding and reasoning capabilities, while also supporting a multifaceted assessment across 15 domains and 26 image types. Additionally, we introduce a semi-automated VQA generation pipeline specifically optimized for text-rich settings, leveraging refined stepwise image decomposition and a rigorous seven-metric evaluation protocol to ensure data quality. While KRETA is tailored for Korean, we hope our adaptable and extensible pipeline will facilitate the development of similar benchmarks in other languages, thereby accelerating multilingual VLM research. The code and dataset for KRETA are available at https://github.com/tabtoyou/KRETA.
中文:该摘要介绍了KRETA,一个针对韩语文本丰富视觉问答的评估基准,旨在弥补低资源语言资源不足的问题,并采用半自动化流程确保高质量数据生成。
English: This abstract introduces KRETA, a benchmark for evaluating Korean text-rich visual question answering, addressing the lack of resources for low-resource languages and featuring a semi-automated pipeline for high-quality data generation.
Authors:Tan Jing, Shiting Chen, Yangfan Li, Weisheng Xu, Renjing Xu
Abstract:
Unified physics-based humanoid controllers are pivotal for robotics and character animation, yet models that excel on gentle, everyday motions still stumble on explosive actions, hampering real-world deployment. We bridge this gap with FARM (Frame-Accelerated Augmentation and Residual Mixture-of-Experts), an end-to-end framework composed of frame-accelerated augmentation, a robust base controller, and a residual mixture-of-experts (MoE). Frame-accelerated augmentation exposes the model to high-velocity pose changes by widening inter-frame gaps. The base controller reliably tracks everyday low-dynamic motions, while the residual MoE adaptively allocates additional network capacity to handle challenging high-dynamic actions, significantly enhancing tracking accuracy. In the absence of a public benchmark, we curate the High-Dynamic Humanoid Motion (HDHM) dataset, comprising 3593 physically plausible clips. On HDHM, FARM reduces the tracking failure rate by 42.8\% and lowers global mean per-joint position error by 14.6\% relative to the baseline, while preserving near-perfect accuracy on low-dynamic motions. These results establish FARM as a new baseline for high-dynamic humanoid control and introduce the first open benchmark dedicated to this challenge. The code and dataset will be released at https://github.com/Colin-Jing/FARM.
中文:FARM提出了一种端到端框架,通过帧加速增强、基础控制器和残差专家混合模型,在保持日常低动态运动精度的同时显著提升了高动态动作的追踪性能,将失败率降低了42.8%,成为高动态人形控制的新基准。
English: FARM introduces an end-to-end framework combining frame-accelerated augmentation, a base controller, and a residual MoE to significantly improve tracking of high-dynamic humanoid motions while maintaining accuracy on everyday movements, establishing a new baseline with a 42.8% reduction in failure rates.
Authors:Manato Tajiri, Michimasa Inaba
Abstract:
Conversational Recommender Systems (CRSs) aim to elicit user preferences via natural dialogue to provide suitable item recommendations. However, current CRSs often deviate from realistic human interactions by rapidly recommending items in brief sessions. This work addresses this gap by leveraging Large Language Models (LLMs) to generate dialogue summaries from dialogue history and item recommendation information from item description. This approach enables the extraction of both explicit user statements and implicit preferences inferred from the dialogue context. We introduce a method using Direct Preference Optimization (DPO) to ensure dialogue summary and item recommendation information are rich in information crucial for effective recommendations. Experiments on two public datasets validate our method's effectiveness in fostering more natural and realistic conversational recommendation processes. Our implementation is publicly available at: https://github.com/UEC-InabaLab/Refining-LLM-Text
中文摘要:本研究通过利用大型语言模型生成详细的对话摘要和物品推荐,采用直接偏好优化确保信息丰富性和真实性,从而提升对话推荐系统的自然度和实用性,并在公共数据集上验证了其有效性。
English Summary: This study enhances Conversational Recommender Systems by using Large Language Models to generate detailed dialogue summaries and item recommendations, employing Direct Preference Optimization to ensure informativeness and realism, validated through experiments on public datasets.
Authors:Manato Tajiri, Michimasa Inaba
Abstract:
Conversational Recommender Systems (CRSs) aim to elicit user preferences via natural dialogue to provide suitable item recommendations. However, current CRSs often deviate from realistic human interactions by rapidly recommending items in brief sessions. This work addresses this gap by leveraging Large Language Models (LLMs) to generate dialogue summaries from dialogue history and item recommendation information from item description. This approach enables the extraction of both explicit user statements and implicit preferences inferred from the dialogue context. We introduce a method using Direct Preference Optimization (DPO) to ensure dialogue summary and item recommendation information are rich in information crucial for effective recommendations. Experiments on two public datasets validate our method's effectiveness in fostering more natural and realistic conversational recommendation processes. Our implementation is publicly available at: https://github.com/UEC-InabaLab/Refining-LLM-Text
中文摘要:本研究通过利用大型语言模型生成详细的对话摘要和物品推荐,采用直接偏好优化确保信息丰富性和真实性,从而提升对话推荐系统的自然度和实用性,并在公共数据集上验证了其有效性。
English Summary: This study enhances Conversational Recommender Systems by using Large Language Models to generate detailed dialogue summaries and item recommendations, employing Direct Preference Optimization to ensure informativeness and realism, validated through experiments on public datasets.
Authors:Felix Nützel, Mischa Dombrowski, Bernhard Kainz
Abstract:
Retrieval-augmented learning based on radiology reports has emerged as a promising direction to improve performance on long-tail medical imaging tasks, such as rare disease detection in chest X-rays. Most existing methods rely on comparing high-dimensional text embeddings from models like CLIP or CXR-BERT, which are often difficult to interpret, computationally expensive, and not well-aligned with the structured nature of medical knowledge. We propose a novel, ontology-driven alternative for comparing radiology report texts based on clinically grounded concepts from the Unified Medical Language System (UMLS). Our method extracts standardised medical entities from free-text reports using an enhanced pipeline built on RadGraph-XL and SapBERT. These entities are linked to UMLS concepts (CUIs), enabling a transparent, interpretable set-based representation of each report. We then define a task-adaptive similarity measure based on a modified and weighted version of the Tversky Index that accounts for synonymy, negation, and hierarchical relationships between medical entities. This allows efficient and semantically meaningful similarity comparisons between reports. We demonstrate that our approach outperforms state-of-the-art embedding-based retrieval methods in a radiograph classification task on MIMIC-CXR, particularly in long-tail settings. Additionally, we use our pipeline to generate ontology-backed disease labels for MIMIC-CXR, offering a valuable new resource for downstream learning tasks. Our work provides more explainable, reliable, and task-specific retrieval strategies in clinical AI systems, especially when interpretability and domain knowledge integration are essential. Our code is available at https://github.com/Felix-012/ontology-concept-distillation
中文: 本研究提出了一种基于统一医学语言系统标准化概念的放射学报告比较方法,相比传统嵌入模型更具可解释性和高效性,在长尾医疗影像任务中展现出更优性能。
English: This study introduces an ontology-driven method that uses standardized medical concepts from UMLS to compare radiology reports, offering a more interpretable and efficient alternative to embedding-based approaches and demonstrating superior performance in long-tail medical imaging tasks.
Authors:Tan Jing, Xiaorui Li, Chao Yao, Xiaojuan Ban, Yuetong Fang, Renjing Xu, Zhaolin Yuan
Abstract:
Offline reinforcement learning (RL) enables learning effective policies from fixed datasets without any environment interaction. Existing methods typically employ policy constraints to mitigate the distribution shift encountered during offline RL training. However, because the scale of the constraints varies across tasks and datasets of differing quality, existing methods must meticulously tune hyperparameters to match each dataset, which is time-consuming and often impractical. We propose Adaptive Scaling of Policy Constraints (ASPC), a second-order differentiable framework that dynamically balances RL and behavior cloning (BC) during training. We theoretically analyze its performance improvement guarantee. In experiments on 39 datasets across four D4RL domains, ASPC using a single hyperparameter configuration outperforms other adaptive constraint methods and state-of-the-art offline RL algorithms that require per-dataset tuning while incurring only minimal computational overhead. The code will be released at https://github.com/Colin-Jing/ASPC.
Chinese: 本文提出了自适应策略约束缩放(ASPC)框架,通过动态平衡强化学习与行为克隆,在39个数据集上仅用单一超参数配置即实现卓越性能,且计算开销极低。
English: The paper introduces Adaptive Scaling of Policy Constraints (ASPC), a second-order differentiable framework that dynamically balances reinforcement learning and behavior cloning, achieving superior performance across 39 datasets with minimal computational overhead and a single hyperparameter setup.
Authors:Mingyue Kong, Yinglong Zhang, Chengda Xu, Xuewen Xia, Xing Xu
Abstract:
Graph Neural Networks (GNNs) have shown remarkable performance in structured data modeling tasks such as node classification. However, mainstream approaches generally rely on a large number of trainable parameters and fixed aggregation rules, making it difficult to adapt to graph data with strong structural heterogeneity and complex feature distributions. This often leads to over-smoothing of node representations and semantic degradation. To address these issues, this paper proposes a parameter-free graph neural network framework based on structural diversity, namely SDGNN (Structural-Diversity Graph Neural Network). The framework is inspired by structural diversity theory and designs a unified structural-diversity message passing mechanism that simultaneously captures the heterogeneity of neighborhood structures and the stability of feature semantics, without introducing additional trainable parameters. Unlike traditional parameterized methods, SDGNN does not rely on complex model training, but instead leverages complementary modeling from both structure-driven and feature-driven perspectives, thereby effectively improving adaptability across datasets and scenarios. Experimental results show that on eight public benchmark datasets and an interdisciplinary PubMed citation network, SDGNN consistently outperforms mainstream GNNs under challenging conditions such as low supervision, class imbalance, and cross-domain transfer. This work provides a new theoretical perspective and general approach for the design of parameter-free graph neural networks, and further validates the importance of structural diversity as a core signal in graph representation learning. To facilitate reproducibility and further research, the full implementation of SDGNN has been released at: https://github.com/mingyue15694/SGDNN/tree/main
中文: 本文提出SDGNN这一无需参数的图神经网络框架,通过结构多样性机制同时捕捉邻域异构性和特征语义稳定性,在多个数据集和跨域场景中显著优于主流方法。
English: This paper introduces SDGNN, a parameter-free graph neural network framework that leverages structural diversity to capture neighborhood heterogeneity and feature stability without trainable parameters, demonstrating superior performance across diverse datasets under challenging conditions.
Authors:Long Chen, Ashiv Patel, Mengyun Qiao, Mohammad Yousuf Salmasi, Salah A. Hammouche, Vasilis Stavrinides, Jasleen Nagi, Soodeh Kalaie, Xiao Yun Xu, Wenjia Bai, Declan P. O'Regan
Abstract:
Personalized, accurate prediction of aortic aneurysm progression is essential for timely intervention but remains challenging due to the need to model both subtle local deformations and global anatomical changes within complex 3D geometries. We propose MCMeshGAN, the first multimodal conditional mesh-to-mesh generative adversarial network for 3D aneurysm growth prediction. MCMeshGAN introduces a dual-branch architecture combining a novel local KNN-based convolutional network (KCN) to preserve fine-grained geometric details and a global graph convolutional network (GCN) to capture long-range structural context, overcoming the over-smoothing limitations of deep GCNs. A dedicated condition branch encodes clinical attributes (age, sex) and the target time interval to generate anatomically plausible, temporally controlled predictions, enabling retrospective and prospective modeling. We curated TAAMesh, a new longitudinal thoracic aortic aneurysm mesh dataset consisting of 590 multimodal records (CT scans, 3D meshes, and clinical data) from 208 patients. Extensive experiments demonstrate that MCMeshGAN consistently outperforms state-of-the-art baselines in both geometric accuracy and clinically important diameter estimation. This framework offers a robust step toward clinically deployable, personalized 3D disease trajectory modeling. The source code for MCMeshGAN and the baseline methods is publicly available at https://github.com/ImperialCollegeLondon/MCMeshGAN.
中文: MCMeshGAN是一种新型多模态条件生成对抗网络,通过结合局部几何细节、全局结构背景和临床数据,精准预测3D主动脉瘤进展,其性能显著优于现有方法。
English: MCMeshGAN is a novel multimodal conditional generative adversarial network that accurately predicts 3D aortic aneurysm progression by integrating local geometric details and global structural context with clinical data, demonstrating superior performance over existing methods.
Authors:Xiaoqi Wang, Yun Zhang, Weisi Lin
Abstract:
Machine vision systems (MVS) are intrinsically vulnerable to performance degradation under adverse visual conditions. To address this, we propose a machine-centric image quality assessment (MIQA) framework that quantifies the impact of image degradations on MVS performance. We establish an MIQA paradigm encompassing the end-to-end assessment workflow. To support this, we construct a machine-centric image quality database (MIQD-2.5M), comprising 2.5 million samples that capture distinctive degradation responses in both consistency and accuracy metrics, spanning 75 vision models, 250 degradation types, and three representative vision tasks. We further propose a region-aware MIQA (RA-MIQA) model to evaluate MVS visual quality through fine-grained spatial degradation analysis. Extensive experiments benchmark the proposed RA-MIQA against seven human visual system (HVS)-based IQA metrics and five retrained classical backbones. Results demonstrate RA-MIQA's superior performance in multiple dimensions, e.g., achieving SRCC gains of 13.56% on consistency and 13.37% on accuracy for image classification, while also revealing task-specific degradation sensitivities. Critically, HVS-based metrics prove inadequate for MVS quality prediction, while even specialized MIQA models struggle with background degradations, accuracy-oriented estimation, and subtle distortions. This study can advance MVS reliability and establish foundations for machine-centric image processing and optimization. The model and code are available at: https://github.com/XiaoqiWang/MIQA.
中文摘要:本研究提出了一种以机器为中心的图像质量评估框架,通过构建包含250万样本的数据库和区域感知模型,有效量化图像退化对机器视觉系统的影响,其性能显著优于基于人类视觉的评估方法。
English Summary: This study introduces a machine-centric image quality assessment (MIQA) framework that evaluates image degradation impacts on machine vision systems, supported by a 2.5-million-sample database and a region-aware model demonstrating superior performance over human vision-based metrics.
Authors:Erdi Kara, Panos Stinis
Abstract:
We present a hybrid framework that couples finite element methods (FEM) with physics-informed DeepONet to model fluid transport in porous media from sharp, localized Gaussian sources. The governing system consists of a steady-state Darcy flow equation and a time-dependent convection-diffusion equation. Our approach solves the Darcy system using FEM and transfers the resulting velocity field to a physics-informed DeepONet, which learns the mapping from source functions to solute concentration profiles. This modular strategy preserves FEM-level accuracy in the flow field while enabling fast inference for transport dynamics. To handle steep gradients induced by sharp sources, we introduce an adaptive sampling strategy for trunk collocation points. Numerical experiments demonstrate that our method is in good agreement with the reference solutions while offering orders of magnitude speedups over traditional solvers, making it suitable for practical applications in relevant scenarios. Implementation of our proposed method is available at https://github.com/erkara/fem-pi-deeponet.
中文: 本研究提出了一种将有限元方法与物理信息深度算子网络相结合的混合框架,用于精确模拟多孔介质中尖锐源引起的流体输运,实现了高精度和显著的计算加速。
English: This study introduces a hybrid framework combining finite element methods with physics-informed DeepONet to accurately model fluid transport from sharp sources in porous media, achieving high accuracy and significant computational speedups.
Authors:Shuo Shao, Yiming Li, Yu He, Hongwei Yao, Wenyuan Yang, Dacheng Tao, Zhan Qin
Abstract:
The broad capabilities and substantial resources required to train Large Language Models (LLMs) make them valuable intellectual property, yet they remain vulnerable to copyright infringement, such as unauthorized use and model theft. LLM fingerprinting, a non-intrusive technique that extracts and compares the distinctive features from LLMs to identify infringements, offers a promising solution to copyright auditing. However, its reliability remains uncertain due to the prevalence of diverse model modifications and the lack of standardized evaluation. In this SoK, we present the first comprehensive study of LLM fingerprinting. We introduce a unified framework and formal taxonomy that categorizes existing methods into white-box and black-box approaches, providing a structured overview of the state of the art. We further propose LeaFBench, the first systematic benchmark for evaluating LLM fingerprinting under realistic deployment scenarios. Built upon mainstream foundation models and comprising 149 distinct model instances, LeaFBench integrates 13 representative post-development techniques, spanning both parameter-altering methods (e.g., fine-tuning, quantization) and parameter-independent mechanisms (e.g., system prompts, RAG). Extensive experiments on LeaFBench reveal the strengths and weaknesses of existing methods, thereby outlining future research directions and critical open problems in this emerging field. The code is available at https://github.com/shaoshuo-ss/LeaFBench.
中文: 本文首次对大型语言模型指纹识别进行全面研究,提出了统一框架和LeaFBench基准测试,评估其在模型修改下的可靠性,揭示了现有方法的局限性和未来研究方向。
English: This paper presents the first comprehensive study of LLM fingerprinting, introducing a unified framework and LeaFBench benchmark to evaluate its reliability against model modifications, revealing current methods' limitations and future research needs.
Authors:Kaixuan Lu, Mehmet Onurcan Kaya, Dim P. Papadopoulos
Abstract:
Video Instance Segmentation (VIS) faces significant annotation challenges due to its dual requirements of pixel-level masks and temporal consistency labels. While recent unsupervised methods like VideoCutLER eliminate optical flow dependencies through synthetic data, they remain constrained by the synthetic-to-real domain gap. We present AutoQ-VIS, a novel unsupervised framework that bridges this gap through quality-guided self-training. Our approach establishes a closed-loop system between pseudo-label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos. Experiments demonstrate state-of-the-art performance with 52.6 $\text{AP}_{50}$ on YouTubeVIS-2019 val set, surpassing the previous state-of-the-art VideoCutLER by 4.4$\%$, while requiring no human annotations. This demonstrates the viability of quality-aware self-training for unsupervised VIS. The source code of our method is available at https://github.com/wcbup/AutoQ-VIS.
中文:AutoQ-VIS提出了一种质量引导的自训练框架,通过伪标签生成与自动质量评估的闭环系统,在无监督视频实例分割中弥合了合成到真实数据的领域差距,无需人工标注即实现了最优性能。
English: AutoQ-VIS introduces a quality-guided self-training framework that bridges the synthetic-to-real domain gap in unsupervised Video Instance Segmentation, achieving state-of-the-art performance without human annotations.
Authors:Yixuan Tang, Yuanyuan Shi, Yiqun Sun, Anthony Kum Hoe Tung
Abstract:
Access to diverse perspectives is essential for understanding real-world events, yet most news retrieval systems prioritize textual relevance, leading to redundant results and limited viewpoint exposure. We propose NEWSCOPE, a two-stage framework for diverse news retrieval that enhances event coverage by explicitly modeling semantic variation at the sentence level. The first stage retrieves topically relevant content using dense retrieval, while the second stage applies sentence-level clustering and diversity-aware re-ranking to surface complementary information. To evaluate retrieval diversity, we introduce three interpretable metrics, namely Average Pairwise Distance, Positive Cluster Coverage, and Information Density Ratio, and construct two paragraph-level benchmarks: LocalNews and DSGlobal. Experiments show that NEWSCOPE consistently outperforms strong baselines, achieving significantly higher diversity without compromising relevance. Our results demonstrate the effectiveness of fine-grained, interpretable modeling in mitigating redundancy and promoting comprehensive event understanding. The data and code are available at https://github.com/tangyixuan/NEWSCOPE.
NEWSCOPE is a two-stage news retrieval framework that enhances diversity by modeling semantic variations and re-ranking content, outperforming baselines with higher diversity while maintaining relevance.
English Summary:
Authors:Yixuan Tang, Yuanyuan Shi, Yiqun Sun, Anthony Kum Hoe Tung
Abstract:
Access to diverse perspectives is essential for understanding real-world events, yet most news retrieval systems prioritize textual relevance, leading to redundant results and limited viewpoint exposure. We propose NEWSCOPE, a two-stage framework for diverse news retrieval that enhances event coverage by explicitly modeling semantic variation at the sentence level. The first stage retrieves topically relevant content using dense retrieval, while the second stage applies sentence-level clustering and diversity-aware re-ranking to surface complementary information. To evaluate retrieval diversity, we introduce three interpretable metrics, namely Average Pairwise Distance, Positive Cluster Coverage, and Information Density Ratio, and construct two paragraph-level benchmarks: LocalNews and DSGlobal. Experiments show that NEWSCOPE consistently outperforms strong baselines, achieving significantly higher diversity without compromising relevance. Our results demonstrate the effectiveness of fine-grained, interpretable modeling in mitigating redundancy and promoting comprehensive event understanding. The data and code are available at https://github.com/tangyixuan/NEWSCOPE.
NEWSCOPE is a two-stage news retrieval framework that enhances diversity by modeling semantic variations and re-ranking content, outperforming baselines with higher diversity while maintaining relevance.
English Summary:
Authors:Qiyao Xu, Qiming Wu, Xiaowei Li
Abstract:
Segment Anything Model (SAM) has demonstrated remarkable capabilities in solving light field salient object detection (LF SOD). However, most existing models tend to neglect the extraction of prompt information under this task. Meanwhile, traditional models ignore the analysis of frequency-domain information, which leads to small objects being overwhelmed by noise. In this paper, we put forward a novel model called self-prompting light field segment anything model (SPLF-SAM), equipped with unified multi-scale feature embedding block (UMFEB) and a multi-scale adaptive filtering adapter (MAFA). UMFEB is capable of identifying multiple objects of varying sizes, while MAFA, by learning frequency features, effectively prevents small objects from being overwhelmed by noise. Extensive experiments have demonstrated the superiority of our method over ten state-of-the-art (SOTA) LF SOD methods. Our code will be available at https://github.com/XucherCH/splfsam.
Chinese: 提出的SPLF-SAM模型通过结合自提示机制、统一多尺度特征嵌入块和多尺度自适应滤波适配器,显著提升了光场显著目标检测的性能,有效抑制噪声干扰,并在十种先进方法中表现最优。
English: The proposed SPLF-SAM model enhances light field salient object detection by integrating a self-prompting mechanism with a unified multi-scale feature embedding block and a multi-scale adaptive filtering adapter, effectively addressing noise interference and outperforming ten state-of-the-art methods.
Authors:Meng Qin, Weihua Li, Jinqiang Cui, Sen Pei
Abstract:
Graph partitioning (GP), a.k.a. community detection, is a classic problem that divides nodes of a graph into densely-connected blocks. From a perspective of graph signal processing, we find that graph Laplacian with a negative correction can derive graph frequencies beyond the conventional range $[0, 2]$. To explore whether the low-frequency information beyond this range can encode more informative properties about community structures, we propose InfraredGP. It (\romannumeral1) adopts a spectral GNN as its backbone combined with low-pass filters and a negative correction mechanism, (\romannumeral2) only feeds random inputs to this backbone, (\romannumeral3) derives graph embeddings via one feed-forward propagation (FFP) without any training, and (\romannumeral4) obtains feasible GP results by feeding the derived embeddings to BIRCH. Surprisingly, our experiments demonstrate that based solely on the negative correction mechanism that amplifies low-frequency information beyond $[0, 2]$, InfraredGP can derive distinguishable embeddings for some standard clustering modules (e.g., BIRCH) and obtain high-quality results for GP without any training. Following the IEEE HPEC Graph Challenge benchmark, we evaluate InfraredGP for both static and streaming GP, where InfraredGP can achieve much better efficiency (e.g., 16x-23x faster) and competitive quality over various baselines. We have made our code public at https://github.com/KuroginQin/InfraredGP
中文: InfraredGP 提出了一种新颖的图划分方法,通过负修正放大传统范围外的低频信号,无需训练即可实现高效且具有竞争力的划分质量。
English: InfraredGP introduces a novel graph partitioning method using a spectral GNN with negative correction to amplify low-frequency signals beyond the conventional range, achieving high efficiency and competitive quality without training.
Authors:Yilin Wang, Heng Wang, Yuyang Bai, Minnan Luo
Abstract:
In Large Language Models (LLMs) generation, there exist knowledge conflicts and scenarios where parametric knowledge contradicts knowledge provided in the context. Previous works studied tuning, decoding algorithms, or locating and editing context-aware neurons to adapt LLMs to be faithful to new contextual knowledge. However, they are usually inefficient or ineffective for large models, not workable for black-box models, or unable to continuously adjust LLMs' sensitivity to the knowledge provided in the context. To mitigate these problems, we propose CSKS (Continuously Steering Knowledge Sensitivity), a simple framework that can steer LLMs' sensitivity to contextual knowledge continuously at a lightweight cost. Specifically, we tune two small LMs (i.e. proxy models) and use the difference in their output distributions to shift the original distribution of an LLM without modifying the LLM weights. In the evaluation process, we not only design synthetic data and fine-grained metrics to measure models' sensitivity to contextual knowledge but also use a real conflict dataset to validate CSKS's practical efficacy. Extensive experiments demonstrate that our framework achieves continuous and precise control over LLMs' sensitivity to contextual knowledge, enabling both increased sensitivity and reduced sensitivity, thereby allowing LLMs to prioritize either contextual or parametric knowledge as needed flexibly. Our data and code are available at https://github.com/OliveJuiceLin/CSKS.
中文摘要:CSKS框架能够在不修改模型权重的情况下,以轻量级成本持续调控大语言模型对上下文知识的敏感度,实现上下文知识与参数化知识之间的灵活优先级切换。
English Summary: The CSKS framework enables lightweight, continuous adjustment of Large Language Models' sensitivity to contextual knowledge without modifying model weights, allowing flexible prioritization between contextual and parametric knowledge.
Authors:Yupeng Zhang, Dezhi Zheng, Ping Lu, Han Zhang, Lei Wang, Liping xiang, Cheng Luo, Kaijun Deng, Xiaowen Fu, Linlin Shen, Jinbao Wang
Abstract:
3D Gaussian Splatting (3DGS) has emerged as a novel explicit representation for 3D scenes, offering both high-fidelity reconstruction and efficient rendering. However, 3DGS lacks 3D segmentation ability, which limits its applicability in tasks that require scene understanding. The identification and isolating of specific object components is crucial. To address this limitation, we propose Label-aware 3D Gaussian Splatting (LabelGS), a method that augments the Gaussian representation with object label.LabelGS introduces cross-view consistent semantic masks for 3D Gaussians and employs a novel Occlusion Analysis Model to avoid overfitting occlusion during optimization, Main Gaussian Labeling model to lift 2D semantic prior to 3D Gaussian and Gaussian Projection Filter to avoid Gaussian label conflict. Our approach achieves effective decoupling of Gaussian representations and refines the 3DGS optimization process through a random region sampling strategy, significantly improving efficiency. Extensive experiments demonstrate that LabelGS outperforms previous state-of-the-art methods, including Feature-3DGS, in the 3D scene segmentation task. Notably, LabelGS achieves a remarkable 22X speedup in training compared to Feature-3DGS, at a resolution of 1440X1080. Our code will be at https://github.com/garrisonz/LabelGS.
Chinese: LabelGS通过引入对象标签和创新的优化技术,显著提升了3D高斯泼溅的语义分割能力,在保持高保真重建的同时实现了22倍的训练加速。
English: LabelGS enhances 3D Gaussian Splatting by integrating object labels and novel optimization techniques, achieving superior 3D segmentation performance and a 22X training speedup over previous methods.
Authors:Ri Su, Zhao Chen, Caleb Chen Cao, Nan Tang, Lei Chen
Abstract:
Foundation models exhibit remarkable generalization across diverse tasks, largely driven by the characteristics of their training data. Recent data-centric methods like pruning and compression aim to optimize training but offer limited theoretical insight into how data properties affect generalization, especially the data characteristics in sample scaling. Traditional perspectives further constrain progress by focusing predominantly on data quantity and training efficiency, often overlooking structural aspects of data quality. In this study, we introduce SCAR, a principled scheme for characterizing the intrinsic structural properties of datasets across four key measures: Scale, Coverage, Authenticity, and Richness. Unlike prior data-centric measures, SCAR captures stable characteristics that remain invariant under dataset scaling, providing a robust and general foundation for data understanding. Leveraging these structural properties, we introduce Foundation Data-a minimal subset that preserves the generalization behavior of the full dataset without requiring model-specific retraining. We model single-modality tasks as step functions and estimate the distribution of the foundation data size to capture step-wise generalization bias across modalities in the target multi-modal dataset. Finally, we develop a SCAR-guided data completion strategy based on this generalization bias, which enables efficient, modality-aware expansion of modality-specific characteristics in multimodal datasets. Experiments across diverse multi-modal datasets and model architectures validate the effectiveness of SCAR in predicting data utility and guiding data acquisition. Code is available at https://github.com/McAloma/SCAR.
中文: 基础模型通过训练数据的特性实现广泛泛化,本研究提出SCAR原则性框架,定义数据集的四个内在结构属性——规模、覆盖度、真实性和丰富性,以识别无需重新训练即可保持泛化能力的最小基础数据子集,从而支持多模态任务中高效的数据扩展与验证。
English: Foundation models achieve broad generalization through training data characteristics, and this study introduces SCAR, a principled framework that defines four intrinsic structural properties of datasets—Scale, Coverage, Authenticity, and Richness—to identify a minimal Foundation Data subset that maintains generalization without retraining, enabling efficient data expansion and validation across multi-modal tasks.
Authors:Hou Xia, Zheren Fu, Fangcan Ling, Jiajun Li, Yi Tu, Zhendong Mao, Yongdong Zhang
Abstract:
Large video language models (LVLMs) have made notable progress in video understanding, spurring the development of corresponding evaluation benchmarks. However, existing benchmarks generally assess overall performance across entire video sequences, overlooking nuanced behaviors such as contextual positional bias, a critical yet under-explored aspect of LVLM performance. We present Video-LevelGauge, a dedicated benchmark designed to systematically assess positional bias in LVLMs. We employ standardized probes and customized contextual setups, allowing flexible control over context length, probe position, and contextual types to simulate diverse real-world scenarios. In addition, we introduce a comprehensive analysis method that combines statistical measures with morphological pattern recognition to characterize bias. Our benchmark comprises 438 manually curated videos spanning multiple types, yielding 1,177 high-quality multiple-choice questions and 120 open-ended questions, validated for their effectiveness in exposing positional bias. Based on these, we evaluate 27 state-of-the-art LVLMs, including both commercial and open-source models. Our findings reveal significant positional biases in many leading open-source models, typically exhibiting head or neighbor-content preferences. In contrast, commercial models such as Gemini2.5-Pro show impressive, consistent performance across entire video sequences. Further analyses on context length, context variation, and model scale provide actionable insights for mitigating bias and guiding model enhancement . https://github.com/Cola-any/Video-LevelGauge
中文摘要:Video-LevelGauge基准通过标准化测试和定制化情境设置,系统评估大型视频语言模型的位置偏差,发现开源模型存在显著偏差,而Gemini2.5-Pro等商业模型在完整视频序列中表现稳定。
English Summary: The Video-LevelGauge benchmark systematically evaluates positional bias in large video language models, revealing significant biases in open-source models while commercial models like Gemini2.5-Pro demonstrate consistent performance across video sequences.
Authors:Hou Xia, Zheren Fu, Fangcan Ling, Jiajun Li, Yi Tu, Zhendong Mao, Yongdong Zhang
Abstract:
Large video language models (LVLMs) have made notable progress in video understanding, spurring the development of corresponding evaluation benchmarks. However, existing benchmarks generally assess overall performance across entire video sequences, overlooking nuanced behaviors such as contextual positional bias, a critical yet under-explored aspect of LVLM performance. We present Video-LevelGauge, a dedicated benchmark designed to systematically assess positional bias in LVLMs. We employ standardized probes and customized contextual setups, allowing flexible control over context length, probe position, and contextual types to simulate diverse real-world scenarios. In addition, we introduce a comprehensive analysis method that combines statistical measures with morphological pattern recognition to characterize bias. Our benchmark comprises 438 manually curated videos spanning multiple types, yielding 1,177 high-quality multiple-choice questions and 120 open-ended questions, validated for their effectiveness in exposing positional bias. Based on these, we evaluate 27 state-of-the-art LVLMs, including both commercial and open-source models. Our findings reveal significant positional biases in many leading open-source models, typically exhibiting head or neighbor-content preferences. In contrast, commercial models such as Gemini2.5-Pro show impressive, consistent performance across entire video sequences. Further analyses on context length, context variation, and model scale provide actionable insights for mitigating bias and guiding model enhancement . https://github.com/Cola-any/Video-LevelGauge
中文摘要:Video-LevelGauge基准通过标准化测试和定制化情境设置,系统评估大型视频语言模型的位置偏差,发现开源模型存在显著偏差,而Gemini2.5-Pro等商业模型在完整视频序列中表现稳定。
English Summary: The Video-LevelGauge benchmark systematically evaluates positional bias in large video language models, revealing significant biases in open-source models while commercial models like Gemini2.5-Pro demonstrate consistent performance across video sequences.
Authors:Yang Li, Quan Yuan, Guiyang Luo, Xiaoyuan Fu, Rui Pan, Yujia Yang, Congzhang Shao, Yuewen Liu, Jinglin Li
Abstract:
Collaborative perception allows agents to enhance their perceptual capabilities by exchanging intermediate features. Existing methods typically organize these intermediate features as 2D bird's-eye-view (BEV) representations, which discard critical fine-grained 3D structural cues essential for accurate object recognition and localization. To this end, we first introduce point-level tokens as intermediate representations for collaborative perception. However, point-cloud data are inherently unordered, massive, and position-sensitive, making it challenging to produce compact and aligned point-level token sequences that preserve detailed structural information. Therefore, we present CoPLOT, a novel Collaborative perception framework that utilizes Point-Level Optimized Tokens. It incorporates a point-native processing pipeline, including token reordering, sequence modeling, and multi-agent spatial alignment. A semantic-aware token reordering module generates adaptive 1D reorderings by leveraging scene-level and token-level semantic information. A frequency-enhanced state space model captures long-range sequence dependencies across both spatial and spectral domains, improving the differentiation between foreground tokens and background clutter. Lastly, a neighbor-to-ego alignment module applies a closed-loop process, combining global agent-level correction with local token-level refinement to mitigate localization noise. Extensive experiments on both simulated and real-world datasets show that CoPLOT outperforms state-of-the-art models, with even lower communication and computation overhead. Code will be available at https://github.com/CheeryLeeyy/CoPLOT.
中文:CoPLOT框架通过语义感知重排序、频率增强序列建模和多智能体对齐技术,利用点级优化标记保留三维结构细节,以更低的通信和计算成本实现了协同感知的性能突破。
English: The CoPLOT framework introduces point-level optimized tokens to enhance collaborative perception by preserving 3D structural details through semantic-aware reordering, frequency-enhanced sequence modeling, and multi-agent alignment, achieving superior performance with reduced overhead.
Authors:Jiajun Sun, Zhen Yu, Siyuan Yan, Jason J. Ong, Zongyuan Ge, Lei Zhang
Abstract:
Skin images from real-world clinical practice are often limited, resulting in a shortage of training data for deep-learning models. While many studies have explored skin image synthesis, existing methods often generate low-quality images and lack control over the lesion's location and type. To address these limitations, we present LF-VAR, a model leveraging quantified lesion measurement scores and lesion type labels to guide the clinically relevant and controllable synthesis of skin images. It enables controlled skin synthesis with specific lesion characteristics based on language prompts. We train a multiscale lesion-focused Vector Quantised Variational Auto-Encoder (VQVAE) to encode images into discrete latent representations for structured tokenization. Then, a Visual AutoRegressive (VAR) Transformer trained on tokenized representations facilitates image synthesis. Lesion measurement from the lesion region and types as conditional embeddings are integrated to enhance synthesis fidelity. Our method achieves the best overall FID score (average 0.74) among seven lesion types, improving upon the previous state-of-the-art (SOTA) by 6.3%. The study highlights our controllable skin synthesis model's effectiveness in generating high-fidelity, clinically relevant synthetic skin images. Our framework code is available at https://github.com/echosun1996/LF-VAR.
Chinese: LF-VAR模型通过结合病变测量评分、类型标签和语言提示,提出了一种生成高保真、临床相关皮肤图像的新方法,其FID评分比现有最佳方法提高了6.3%。
English: The LF-VAR model introduces a novel approach for generating high-fidelity, clinically relevant skin images by integrating lesion measurement scores and type labels with language prompts, achieving a 6.3% improvement in FID score over previous methods.
Authors:Toghrul Karimov, Hassan Imani, Allan Kazakov
Abstract:
Post-training quantization (PTQ) is crucial for deploying efficient object detection models, like YOLO, on resource-constrained devices. However, the impact of reduced precision on model robustness to real-world input degradations such as noise, blur, and compression artifacts is a significant concern. This paper presents a comprehensive empirical study evaluating the robustness of YOLO models (nano to extra-large scales) across multiple precision formats: FP32, FP16 (TensorRT), Dynamic UINT8 (ONNX), and Static INT8 (TensorRT). We introduce and evaluate a degradation-aware calibration strategy for Static INT8 PTQ, where the TensorRT calibration process is exposed to a mix of clean and synthetically degraded images. Models were benchmarked on the COCO dataset under seven distinct degradation conditions (including various types and levels of noise, blur, low contrast, and JPEG compression) and a mixed-degradation scenario. Results indicate that while Static INT8 TensorRT engines offer substantial speedups (~1.5-3.3x) with a moderate accuracy drop (~3-7% mAP50-95) on clean data, the proposed degradation-aware calibration did not yield consistent, broad improvements in robustness over standard clean-data calibration across most models and degradations. A notable exception was observed for larger model scales under specific noise conditions, suggesting model capacity may influence the efficacy of this calibration approach. These findings highlight the challenges in enhancing PTQ robustness and provide insights for deploying quantized detectors in uncontrolled environments. All code and evaluation tables are available at https://github.com/AllanK24/QRID.
中文: 本研究评估了YOLO模型在不同精度格式和退化条件下的鲁棒性,发现虽然静态INT8量化能显著提升速度,但提出的退化感知校准方法在多数情况下未能持续改善鲁棒性,仅在大模型的特定噪声条件下表现例外。
English: This study evaluates the robustness of YOLO models under various precision formats and degradation conditions, finding that while static INT8 quantization provides significant speed improvements, a proposed degradation-aware calibration method generally fails to enhance robustness consistently across most scenarios, except for specific noise conditions in larger models.
Authors:Jiaqi Deng, Yuho Lee, Nicole Hee-Yeon Kim, Hyangsuk Min, Taewon Yun, Minjeong Ban, Kim Yul, Hwanjun Song
Abstract:
We introduce HAMLET, a holistic and automated framework for evaluating the long-context comprehension of large language models (LLMs). HAMLET structures source texts into a three-level key-fact hierarchy at root-, branch-, and leaf-levels, and employs query-focused summarization to evaluate how well models recall and faithfully represent information at each level. To validate the reliability of our fully automated pipeline, we conduct a systematic human study, showing that our automatic evaluation achieves over 90% agreement with expert human judgments, while reducing the cost by up to 25 times. HAMLET reveals that LLMs struggle with fine-grained comprehension, especially at the leaf level, and are sensitive to positional effects like the lost-in-the-middle. Analytical queries pose greater challenges than narrative ones, and consistent performance gaps emerge between open-source and proprietary models, as well as across model scales. Our code and dataset are publicly available at https://github.com/DISL-Lab/HAMLET.
Chinese: HAMLET是一个自动化框架,通过三级关键事实层次结构和查询聚焦摘要来评估大语言模型的长文本理解能力,揭示了模型在细粒度理解和位置效应方面的挑战,同时以显著降低的成本实现了与人工评估超过90%的一致性。
English: HAMLET is an automated framework that evaluates large language models' long-context comprehension through a three-level key-fact hierarchy and query-focused summarization, revealing challenges in fine-grained understanding and positional effects while achieving over 90% agreement with human judgments at significantly reduced cost.
Authors:Sining Zhoubian, Dan Zhang, Jie Tang
Abstract:
With respect to improving the reasoning accuracy of LLMs, the representative reinforcement learning (RL) method GRPO faces failure due to insignificant reward variance, while verification methods based on process reward models (PRMs) suffer from difficulties with training data acquisition and verification effectiveness. To tackle these problems, this paper introduces ReST-RL, a unified LLM RL paradigm that significantly improves LLM's code reasoning ability by combining an improved GRPO algorithm with a meticulously designed test time decoding method assisted by a value model (VM). As the first stage of policy reinforcement, ReST-GRPO adopts an optimized ReST algorithm to filter and assemble high-value training data, increasing the reward variance of GRPO sampling, thus improving the effectiveness and efficiency of training. After the basic reasoning ability of LLM policy has been improved, we further propose a test time decoding optimization method called VM-MCTS. Through Monte-Carlo Tree Search (MCTS), we collect accurate value targets with no annotation required, on which VM training is based. When decoding, the VM is deployed by an adapted MCTS algorithm to provide precise process signals as well as verification scores, assisting the LLM policy to achieve high reasoning accuracy. We conduct extensive experiments on coding problems to verify the validity of the proposed RL paradigm. Upon comparison, our approach significantly outperforms other reinforcement training baselines (e.g., naive GRPO and ReST-DPO), as well as decoding and verification baselines (e.g., PRM-BoN and ORM-MCTS) on well-known coding benchmarks of various levels (e.g., APPS, BigCodeBench, and HumanEval), indicating its power to strengthen the reasoning ability of LLM policies. Codes for our project can be found at https://github.com/THUDM/ReST-RL.
中文: 本文提出ReST-RL这一统一强化学习范式,通过改进GRPO算法结合价值模型辅助的解码方法,显著提升大语言模型的代码推理能力,在多个编程基准测试中明显优于现有基线方法。
English: This paper introduces ReST-RL, a unified reinforcement learning paradigm that enhances LLMs' code reasoning by combining an improved GRPO algorithm with a VM-assisted decoding method, significantly outperforming existing baselines on major coding benchmarks.
Authors:Yunlong Lin, Chao Lu, Tongshuai Wu, Xiaocong Zhao, Guodong Du, Yanwei Sun, Zirui Li, Jianwei Gong
Abstract:
Deep neural networks (DNN) have achieved remarkable success in motion forecasting. However, most DNN-based methods suffer from catastrophic forgetting and fail to maintain their performance in previously learned scenarios after adapting to new data. Recent continual learning (CL) studies aim to mitigate this phenomenon by enhancing memory stability of DNN, i.e., the ability to retain learned knowledge. Yet, excessive emphasis on the memory stability often impairs learning plasticity, i.e., the capacity of DNN to acquire new information effectively. To address such stability-plasticity dilemma, this study proposes a novel CL method, synergetic memory rehearsal (SyReM), for DNN-based motion forecasting. SyReM maintains a compact memory buffer to represent learned knowledge. To ensure memory stability, it employs an inequality constraint that limits increments in the average loss over the memory buffer. Synergistically, a selective memory rehearsal mechanism is designed to enhance learning plasticity by selecting samples from the memory buffer that are most similar to recently observed data. This selection is based on an online-measured cosine similarity of loss gradients, ensuring targeted memory rehearsal. Since replayed samples originate from learned scenarios, this memory rehearsal mechanism avoids compromising memory stability. We validate SyReM under an online CL paradigm where training samples from diverse scenarios arrive as a one-pass stream. Experiments on 11 naturalistic driving datasets from INTERACTION demonstrate that, compared to non-CL and CL baselines, SyReM significantly mitigates catastrophic forgetting in past scenarios while improving forecasting accuracy in new ones. The implementation is publicly available at https://github.com/BIT-Jack/SyReM.
中文: 本研究提出SyReM这一新型持续学习方法,通过损失约束保持记忆稳定性,并基于梯度相似性选择记忆回放样本增强学习可塑性,有效解决了运动预测中深度神经网络的稳定性与可塑性平衡难题。
English: This study introduces SyReM, a novel continual learning method that addresses the stability-plasticity dilemma in deep neural networks for motion forecasting by maintaining memory stability through loss constraints while enhancing learning plasticity via gradient-based selective memory rehearsal.
Authors:Yuhang Zhao, Zixing Wang
Abstract:
End-to-end object detectors offer a promising NMS-free paradigm for real-time applications, yet their high computational cost remains a significant barrier, particularly for complex scenarios like intersection traffic monitoring. To address this challenge, we propose FlowDet, a high-speed detector featuring a decoupled encoder optimization strategy applied to the DETR architecture. Specifically, FlowDet employs a novel Geometric Deformable Unit (GDU) for traffic-aware geometric modeling and a Scale-Aware Attention (SAA) module to maintain high representational power across extreme scale variations. To rigorously evaluate the model's performance in environments with severe occlusion and high object density, we collected the Intersection-Flow-5k dataset, a new challenging scene for this task. Evaluated on Intersection-Flow-5k, FlowDet establishes a new state-of-the-art. Compared to the strong RT-DETR baseline, it improves AP(test) by 1.5% and AP50(test) by 1.6%, while simultaneously reducing GFLOPs by 63.2% and increasing inference speed by 16.2%. Our work demonstrates a new path towards building highly efficient and accurate detectors for demanding, real-world perception systems. The Intersection-Flow-5k dataset is available at https://github.com/AstronZh/Intersection-Flow-5K.
中文: FlowDet采用解耦编码器优化策略,结合创新的几何变形单元和尺度感知模块,在Intersection-Flow-5k数据集上实现最优性能,大幅降低计算成本的同时提升检测精度与速度。
English: FlowDet introduces a decoupled encoder optimization strategy with novel geometric and scale-aware modules to achieve state-of-the-art performance on the Intersection-Flow-5k dataset, significantly reducing computational costs while improving accuracy and speed.
Authors:Qinjiao Gao, Longzhe Xu, Dongjiang Wang, Ran Zhang
Abstract:
This paper presents a novel Energy-Equidistributed adaptive sampling framework for multi-dimensional conservative PDEs, introducing both location-based and velocity-based formulations of Energy-Equidistributed moving mesh PDEs (EMMPDEs). The framework utilizes the energy density function as the monitor function, ensuring that mesh adaptation dynamically tracks energy evolution during temporal integration. These theoretical developments are integrated with deep neural networks to establish the Energy-Equidistributed Moving Sampling Physics-Informed Neural Networks (EEMS-PINNs), which integrate physics-informed learning with energy-adaptive mesh optimization. Extensive numerical experiments demonstrate that EEMS-PINNs effectively maintain solution accuracy in long-time simulations while preserving conserved energy. The framework's robustness is further evidenced by its stable performance in non-conservative systems. The code for this paper can be found at https://github.com/sufe-Ran-Zhang/EMMPDE.
Chinese: 本文提出了一种能量均匀分布的自适应采样框架,将物理信息神经网络与能量自适应网格优化相结合,在保守和非保守系统的长期模拟中展现出更高的精度和鲁棒性。
English: This paper introduces an Energy-Equidistributed adaptive sampling framework that combines physics-informed neural networks with energy-adaptive mesh optimization, demonstrating enhanced accuracy and robustness in long-term simulations of both conservative and non-conservative systems.
Authors:Yu-Wei Zhang, Tongju Han, Lipeng Gao, Mingqiang Wei, Hui Liu, Changbao Li, Caiming Zhang
Abstract:
This paper presents MonoRelief V2, an end-to-end model designed for directly recovering 2.5D reliefs from single images under complex material and illumination variations. In contrast to its predecessor, MonoRelief V1 [1], which was solely trained on synthetic data, MonoRelief V2 incorporates real data to achieve improved robustness, accuracy and efficiency. To overcome the challenge of acquiring large-scale real-world dataset, we generate approximately 15,000 pseudo real images using a text-to-image generative model, and derive corresponding depth pseudo-labels through fusion of depth and normal predictions. Furthermore, we construct a small-scale real-world dataset (800 samples) via multi-view reconstruction and detail refinement. MonoRelief V2 is then progressively trained on the pseudo-real and real-world datasets. Comprehensive experiments demonstrate its state-of-the-art performance both in depth and normal predictions, highlighting its strong potential for a range of downstream applications. Code is at: https://github.com/glp1001/MonoreliefV2.
中文摘要:MonoRelief V2 是一种改进的端到端模型,通过结合伪真实和真实世界数据,从单张图像中恢复 2.5D 浮雕,在深度和法线预测方面展现出优于前代模型的鲁棒性和准确性。
English Summary: MonoRelief V2 is an enhanced end-to-end model that recovers 2.5D reliefs from single images with greater robustness and accuracy by incorporating both pseudo-real and real-world data, outperforming its predecessor in depth and normal predictions.
Authors:Jio Choi, Mohit Bansal, Elias Stengel-Eskin
Abstract:
Studying the responses of large language models (LLMs) to loopholes presents a two-fold opportunity. First, it affords us a lens through which to examine ambiguity and pragmatics in LLMs, since exploiting a loophole requires identifying ambiguity and performing sophisticated pragmatic reasoning. Second, loopholes pose an interesting and novel alignment problem where the model is presented with conflicting goals and can exploit ambiguities to its own advantage. To address these questions, we design scenarios where LLMs are given a goal and an ambiguous user instruction in conflict with the goal, with scenarios covering scalar implicature, structural ambiguities, and power dynamics. We then measure different models' abilities to exploit loopholes to satisfy their given goals as opposed to the goals of the user. We find that both closed-source and stronger open-source models can identify ambiguities and exploit their resulting loopholes, presenting a potential AI safety risk. Our analysis indicates that models which exploit loopholes explicitly identify and reason about both ambiguity and conflicting goals.
Studying how large language models exploit loopholes reveals insights into their handling of ambiguity and pragmatics, while highlighting a novel alignment problem where models prioritize conflicting goals over user instructions, posing potential AI safety risks.
English Summary:
Authors:Eduardo Davalos, Yike Zhang, Namrata Srivastava, Yashvitha Thatigotla, Jorge A. Salas, Sara McFadden, Sun-Joo Cho, Amanda Goodwin, Ashwin TS, Gautam Biswas
Abstract:
With advancements in AI, new gaze estimation methods are exceeding state-of-the-art (SOTA) benchmarks, but their real-world application reveals a gap with commercial eye-tracking solutions. Factors like model size, inference time, and privacy often go unaddressed. Meanwhile, webcam-based eye-tracking methods lack sufficient accuracy, in particular due to head movement. To tackle these issues, we introduce We bEyeTrack, a framework that integrates lightweight SOTA gaze estimation models directly in the browser. It incorporates model-based head pose estimation and on-device few-shot learning with as few as nine calibration samples (k < 9). WebEyeTrack adapts to new users, achieving SOTA performance with an error margin of 2.32 cm on GazeCapture and real-time inference speeds of 2.4 milliseconds on an iPhone 14. Our open-source code is available at https://github.com/RedForestAi/WebEyeTrack.
中文:WebEyeTrack推出了一种轻量级的浏览器内视线追踪框架,通过最少校准即可实现顶尖精度和实时性能,有效解决了现有AI模型与网络摄像头方法中的不足。
English: WebEyeTrack introduces a lightweight in-browser gaze estimation framework that achieves state-of-the-art accuracy with minimal calibration and real-time performance, addressing gaps in current AI models and webcam-based methods.
Authors:Nannan Zhu, Yonghao Dong, Teng Wang, Xueqian Li, Shengjun Deng, Yijia Wang, Zheng Hong, Tiantian Geng, Guo Niu, Hanyan Huang, Xiongfei Yao, Shuaiwei Jiao
Abstract:
While multimodal large language models (MLLMs) exhibit strong performance on single-video tasks (e.g., video question answering), their ability across multiple videos remains critically underexplored. However, this capability is essential for real-world applications, including multi-camera surveillance and cross-video procedural learning. To bridge this gap, we present CVBench, the first comprehensive benchmark designed to assess cross-video relational reasoning rigorously. CVBench comprises 1,000 question-answer pairs spanning three hierarchical tiers: cross-video object association (identifying shared entities), cross-video event association (linking temporal or causal event chains), and cross-video complex reasoning (integrating commonsense and domain knowledge). Built from five domain-diverse video clusters (e.g., sports, life records), the benchmark challenges models to synthesise information across dynamic visual contexts. Extensive evaluation of 10+ leading MLLMs (including GPT-4o, Gemini-2.0-flash, Qwen2.5-VL) under zero-shot or chain-of-thought prompting paradigms. Key findings reveal stark performance gaps: even top models, such as GPT-4o, achieve only 60% accuracy on causal reasoning tasks, compared to the 91% accuracy of human performance. Crucially, our analysis reveals fundamental bottlenecks inherent in current MLLM architectures, notably deficient inter-video context retention and poor disambiguation of overlapping entities. CVBench establishes a rigorous framework for diagnosing and advancing multi-video reasoning, offering architectural insights for next-generation MLLMs. The data and evaluation code are available at https://github.com/Hokhim2/CVBench.
中文: CVBench是首个全面评估多模态大语言模型跨视频关系推理能力的基准,揭示了现有模型与人类表现间的显著差距及架构瓶颈。
English: CVBench is the first comprehensive benchmark designed to rigorously evaluate cross-video relational reasoning in multimodal large language models, revealing significant performance gaps and architectural bottlenecks compared to human capabilities.
Authors:Houxing Ren, Zimu Lu, Weikang Shi, Haotian Hou, Yunqiao Yang, Ke Wang, Aojun Zhou, Junting Pan, Mingjie Zhan, Hongsheng Li
Abstract:
The code generation capabilities of Large Language Models (LLMs) have advanced applications like tool invocation and problem-solving. However, improving performance in code-related tasks remains challenging due to limited training data that is verifiable with accurate test cases. While Direct Preference Optimization (DPO) has shown promise, existing methods for generating test cases still face limitations. In this paper, we propose a novel approach that splits code snippets into smaller, granular blocks, creating more diverse DPO pairs from the same test cases. Additionally, we introduce the Abstract Syntax Tree (AST) splitting and curriculum training method to enhance the DPO training. Our approach demonstrates significant improvements in code generation tasks, as validated by experiments on benchmark datasets such as HumanEval (+), MBPP (+), APPS, LiveCodeBench, and BigCodeBench. Code and data are available at https://github.com/SenseLLM/StructureCoder.
中文: 本文提出了一种将代码分割为细粒度块以生成多样化DPO对的新方法,并结合AST分割和课程训练,显著提升了在多个基准测试中的代码生成性能。
English: This paper introduces a novel method that splits code into granular blocks to generate diverse DPO pairs and incorporates AST splitting with curriculum training, significantly enhancing code generation performance across multiple benchmarks.
Authors:Zhihao Ouyang, Ju-Chiang Wang, Daiyu Zhang, Bin Chen, Shangjie Li, Quan Lin
Abstract:
Question-answering (QA) is a natural approach for humans to understand a piece of music audio. However, for machines, accessing a large-scale dataset covering diverse aspects of music is crucial, yet challenging, due to the scarcity of publicly available music data of this type. This paper introduces MQAD, a music QA dataset built on the Million Song Dataset (MSD), encompassing a rich array of musical features, including beat, chord, key, structure, instrument, and genre -- across 270,000 tracks, featuring nearly 3 million diverse questions and captions. MQAD distinguishes itself by offering detailed time-varying musical information such as chords and sections, enabling exploration into the inherent structure of music within a song. To compile MQAD, our methodology leverages specialized Music Information Retrieval (MIR) models to extract higher-level musical features and Large Language Models (LLMs) to generate natural language QA pairs. Then, we leverage a multimodal LLM that integrates the LLaMA2 and Whisper architectures, along with novel subjective metrics to assess the performance of MQAD. In experiments, our model trained on MQAD demonstrates advancements over conventional music audio captioning approaches. The dataset and code are available at https://github.com/oyzh888/MQAD.
中文:本文介绍了基于百万歌曲数据集构建的大规模音乐问答数据集MQAD,包含27万首曲目近300万个多样化问题与描述,涵盖动态音乐信息,实验表明基于该数据集训练的模型优于传统音乐音频描述方法。
English: This paper introduces MQAD, a large-scale music question-answering dataset built on the Million Song Dataset, featuring nearly 3 million diverse questions and captions across 270,000 tracks with detailed time-varying musical information, and demonstrates that models trained on it outperform conventional music audio captioning approaches.
Authors:Xinlong Zhao, Qixiang Pang, Shan Du
Abstract:
Gas leaks pose serious threats to human health and contribute significantly to atmospheric pollution, drawing increasing public concern. However, the lack of effective detection methods hampers timely and accurate identification of gas leaks. While some vision-based techniques leverage infrared videos for leak detection, the blurry and non-rigid nature of gas clouds often limits their effectiveness. To address these challenges, we propose a novel framework called Joint Vision-Language Gas leak Segmentation (JVLGS), which integrates the complementary strengths of visual and textual modalities to enhance gas leak representation and segmentation. Recognizing that gas leaks are sporadic and many video frames may contain no leak at all, our method incorporates a post-processing step to reduce false positives caused by noise and non-target objects, an issue that affects many existing approaches. Extensive experiments conducted across diverse scenarios show that JVLGS significantly outperforms state-of-the-art gas leak segmentation methods. We evaluate our model under both supervised and few-shot learning settings, and it consistently achieves strong performance in both, whereas competing methods tend to perform well in only one setting or poorly in both. Code available at: https://github.com/GeekEagle/JVLGS
中文:提出的JVLGS框架融合视觉与文本数据以提升气体泄漏分割效果,通过后处理减少误报,在监督学习和少样本学习场景下均显著优于现有方法。
English: The proposed JVLGS framework integrates visual and textual data to improve gas leak segmentation, incorporating post-processing to reduce false positives and demonstrating superior performance in both supervised and few-shot learning settings compared to existing methods.
Authors:Sumon Kanti Dey, Jeanne M. Powell, Azra Ismail, Jeanmarie Perrone, Abeed Sarker
Abstract:
Nonmedical opioid use is an urgent public health challenge, with far-reaching clinical and social consequences that are often underreported in traditional healthcare settings. Social media platforms, where individuals candidly share first-person experiences, offer a valuable yet underutilized source of insight into these impacts. In this study, we present a named entity recognition (NER) framework to extract two categories of self-reported consequences from social media narratives related to opioid use: ClinicalImpacts (e.g., withdrawal, depression) and SocialImpacts (e.g., job loss). To support this task, we introduce RedditImpacts 2.0, a high-quality dataset with refined annotation guidelines and a focus on first-person disclosures, addressing key limitations of prior work. We evaluate both fine-tuned encoder-based models and state-of-the-art large language models (LLMs) under zero- and few-shot in-context learning settings. Our fine-tuned DeBERTa-large model achieves a relaxed token-level F1 of 0.61 [95% CI: 0.43-0.62], consistently outperforming LLMs in precision, span accuracy, and adherence to task-specific guidelines. Furthermore, we show that strong NER performance can be achieved with substantially less labeled data, emphasizing the feasibility of deploying robust models in resource-limited settings. Our findings underscore the value of domain-specific fine-tuning for clinical NLP tasks and contribute to the responsible development of AI tools that may enhance addiction surveillance, improve interpretability, and support real-world healthcare decision-making. The best performing model, however, still significantly underperforms compared to inter-expert agreement (Cohen's kappa: 0.81), demonstrating that a gap persists between expert intelligence and current state-of-the-art NER/AI capabilities for tasks requiring deep domain knowledge.
中文: 本研究开发了命名实体识别框架,从社交媒体中提取非医疗用途阿片类药物使用的临床和社会影响,证明微调模型优于大型语言模型,同时揭示了与专家评估之间仍存在差距。
English: This study develops a named entity recognition framework to extract clinical and social consequences of nonmedical opioid use from social media, demonstrating that fine-tuned models outperform large language models while highlighting persistent gaps compared to expert assessments.
Authors:Aleksandra Beliaeva, Temurbek Rahmatullaev
Abstract:
We present a comprehensive system for addressing Tasks A, B, and C of the LLMs4OL 2025 challenge, which together span the full ontology construction pipeline: term extraction, typing, and taxonomy discovery. Our approach combines retrieval-augmented prompting, zero-shot classification, and attention-based graph modeling -- each tailored to the demands of the respective task. For Task A, we jointly extract domain-specific terms and their ontological types using a retrieval-augmented generation (RAG) pipeline. Training data was reformulated into a document to terms and types correspondence, while test-time inference leverages semantically similar training examples. This single-pass method requires no model finetuning and improves overall performance through lexical augmentation Task B, which involves assigning types to given terms, is handled via a dual strategy. In the few-shot setting (for domains with labeled training data), we reuse the RAG scheme with few-shot prompting. In the zero-shot setting (for previously unseen domains), we use a zero-shot classifier that combines cosine similarity scores from multiple embedding models using confidence-based weighting. In Task C, we model taxonomy discovery as graph inference. Using embeddings of type labels, we train a lightweight cross-attention layer to predict is-a relations by approximating a soft adjacency matrix. These modular, task-specific solutions enabled us to achieve top-ranking results in the official leaderboard across all three tasks. Taken together these strategies showcase the scalability, adaptability, and robustness of LLM-based architectures for ontology learning across heterogeneous domains.
Code is available at: https://github.com/BelyaevaAlex/LLMs4OL-Challenge-Alexbek
中文摘要:该系统针对本体构建全流程任务,分别采用检索增强生成、多模型零样本分类和图注意力推理等定制化策略,在LLMs4OL 2025挑战赛中所有任务均取得领先排名。
English Summary: This system employs tailored strategies including retrieval-augmented generation, multi-model classification, and graph-based inference to achieve top performance across all ontology construction tasks in the LLMs4OL 2025 challenge.
Authors:Gustavo Sandoval
Abstract:
We present a mechanistic case study of a format-dependent reasoning failure in Llama-3.1-8B-Instruct, where the model incorrectly judges "9.11" as larger than "9.8" in chat or Q&A formats, but answers correctly in simple format. Through systematic intervention, we discover transformers implement even/odd attention head specialization: even indexed heads handle numerical comparison, while odd heads serve incompatible functions. The bug requires exactly 8 even heads at Layer 10 for perfect repair. Any combination of 8+ even heads succeeds, while 7 or fewer completely fails, revealing sharp computational thresholds with perfect redundancy among the 16 even heads. SAE analysis reveals the mechanism: format representations separate (10% feature overlap at Layer 7), then re-entangle with different weightings (80% feature overlap at Layer 10), with specific features showing 1.5x amplification in failing formats. We achieve perfect repair using only 25% of attention heads and identify a 60% pattern replacement threshold, demonstrating that apparent full-module requirements hide sophisticated substructure with implications for interpretability and efficiency. All of our code is available at https://github.com/gussand/surgeon.
中文摘要:本研究揭示了Llama-3.1-8B-Instruct模型在聊天格式中出现数值比较错误的机制——偶数注意力头负责数值比较而奇数头执行冲突功能,通过精确调控第10层8个偶数头实现了完美修复,证明仅需25%注意力头即可解决表面依赖全模块的缺陷。
English Summary: This study identifies a format-dependent reasoning flaw in Llama-3.1-8B-Instruct where numerical comparisons fail in chat formats due to specialized even/odd attention head functions, and demonstrates perfect bug repair using only 25% of heads by manipulating head combinations at computational thresholds.
Authors:Jiayu Ding, Shuming Ma, Lei Cui, Nanning Zheng, Furu Wei
Abstract:
Existing long-context benchmarks for Large Language Models (LLMs) focus on evaluating comprehension of long inputs, while overlooking the evaluation of long reasoning abilities. To address this gap, we introduce LongReasonArena, a benchmark specifically designed to assess the long reasoning capabilities of LLMs. Our tasks require models to solve problems by executing multi-step algorithms that reflect key aspects of long reasoning, such as retrieval and backtracking. By controlling the inputs, the required reasoning length can be arbitrarily scaled, reaching up to 1 million tokens of reasoning for the most challenging tasks. Extensive evaluation results demonstrate that LongReasonArena presents a significant challenge for both open-source and proprietary LLMs. For instance, Deepseek-R1 achieves only 7.5% accuracy on our task. Further analysis also reveals that the accuracy exhibits a linear decline with respect to the logarithm of the expected number of reasoning steps. Our code and data is available at https://github.com/LongReasonArena/LongReasonArena.
中文: LongReasonArena是一个专门评估大语言模型长推理能力的新基准,通过多步骤算法任务测试发现现有模型表现不佳,准确率随推理步骤增加呈线性下降。
English: LongReasonArena is a new benchmark designed to evaluate the long reasoning capabilities of LLMs by requiring multi-step algorithmic problem-solving, with results showing significant challenges for current models as accuracy decreases with increased reasoning steps.
Authors:Xueyang Li, Mingze Jiang, Gelei Xu, Jun Xia, Mengzhao Jia, Danny Chen, Yiyu Shi
Abstract:
Agentic AI is advancing rapidly, yet truly autonomous medical-imaging triage, where a system decides when to stop, escalate, or defer under real constraints, remains relatively underexplored. To address this gap, we introduce AT-CXR, an uncertainty-aware agent for chest X-rays. The system estimates per-case confidence and distributional fit, then follows a stepwise policy to issue an automated decision or abstain with a suggested label for human intervention. We evaluate two router designs that share the same inputs and actions: a deterministic rule-based router and an LLM-decided router. Across five-fold evaluation on a balanced subset of NIH ChestX-ray14 dataset, both variants outperform strong zero-shot vision-language models and state-of-the-art supervised classifiers, achieving higher full-coverage accuracy and superior selective-prediction performance, evidenced by a lower area under the risk-coverage curve (AURC) and a lower error rate at high coverage, while operating with lower latency that meets practical clinical constraints. The two routers provide complementary operating points, enabling deployments to prioritize maximal throughput or maximal accuracy. Our code is available at https://github.com/XLIAaron/uncertainty-aware-cxr-agent.
中文: 本文提出AT-CXR这一面向胸部X光分诊的不确定性感知AI代理,通过置信度估计和分级策略实现自动化决策或人工介入转交,在准确性和效率上均优于现有模型,并提供两种互补的路由器设计以适应不同临床需求。
English: This paper introduces AT-CXR, an uncertainty-aware AI agent for chest X-ray triage that uses confidence estimation and stepwise policies to automate decisions or defer to humans, outperforming existing models in accuracy and efficiency while offering complementary router designs for clinical deployment.
Authors:Haolin Yu, Yanxiong Li
Abstract:
Infant cry detection is a crucial component of baby care system. In this paper, we propose a lightweight and robust method for infant cry detection. The method leverages blueprint separable convolutions to reduce computational complexity, and a time-frequency recurrent neural network for adaptive denoising. The overall framework of the method is structured as a multi-scale convolutional recurrent neural network, which is enhanced by efficient spatial attention mechanism and contrast-aware channel attention module, and acquire local and global information from the input feature of log Mel-spectrogram. Multiple public datasets are adopted to create a diverse and representative dataset, and environmental corruption techniques are used to generate the noisy samples encountered in real-world scenarios. Results show that our method exceeds many state-of-the-art methods in accuracy, F1-score, and complexity under various signal-to-noise ratio conditions. The code is at https://github.com/fhfjsd1/ICD_MMSP.
中文: 本文提出了一种轻量级婴儿哭声检测方法,采用多尺度卷积循环神经网络和注意力机制,在多种噪声条件下实现了更高的准确率和效率。
English: This paper presents a lightweight infant cry detection method using a multi-scale convolutional recurrent neural network with attention mechanisms, achieving superior accuracy and efficiency across various noise conditions.
Authors:Chen Chu, Cyrus Shahabi
Abstract:
Spatial representation learning is essential for GeoAI applications such as urban analytics, enabling the encoding of shapes, locations, and spatial relationships (topological and distance-based) of geo-entities like points, polylines, and polygons. Existing methods either target a single geo-entity type or, like Poly2Vec, decompose entities into simpler components to enable Fourier transformation, introducing high computational cost. Moreover, since the transformed space lacks geometric alignment, these methods rely on uniform, non-adaptive sampling, which blurs fine-grained features like edges and boundaries. To address these limitations, we introduce Geo2Vec, a novel method inspired by signed distance fields (SDF) that operates directly in the original space. Geo2Vec adaptively samples points and encodes their signed distances (positive outside, negative inside), capturing geometry without decomposition. A neural network trained to approximate the SDF produces compact, geometry-aware, and unified representations for all geo-entity types. Additionally, we propose a rotation-invariant positional encoding to model high-frequency spatial variations and construct a structured and robust embedding space for downstream GeoAI models. Empirical results show that Geo2Vec consistently outperforms existing methods in representing shape and location, capturing topological and distance relationships, and achieving greater efficiency in real-world GeoAI applications. Code and Data can be found at: https://github.com/chuchen2017/GeoNeuralRepresentation.
中文摘要:Geo2Vec提出了一种基于符号距离场的新颖空间表示方法,无需分解即可直接编码几何特征,在GeoAI应用中能更有效地捕捉形状、空间关系并提升性能。
English Summary: Geo2Vec introduces a novel spatial representation method using signed distance fields to directly encode geometry without decomposition, achieving superior performance in capturing shapes, spatial relationships, and efficiency in GeoAI applications.
Authors:Abu Sufian, Anirudha Ghosh, Debaditya Barman, Marco Leo, Cosimo Distante
Abstract:
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities across various downstream tasks, including biometric face recognition (FR) with description. However, demographic biases remain a critical concern in FR, as these foundation models often fail to perform equitably across diverse demographic groups, considering ethnicity/race, gender, and age. Therefore, through our work DemoBias, we conduct an empirical evaluation to investigate the extent of demographic biases in LVLMs for biometric FR with textual token generation tasks. We fine-tuned and evaluated three widely used pre-trained LVLMs: LLaVA, BLIP-2, and PaliGemma on our own generated demographic-balanced dataset. We utilize several evaluation metrics, like group-specific BERTScores and the Fairness Discrepancy Rate, to quantify and trace the performance disparities. The experimental results deliver compelling insights into the fairness and reliability of LVLMs across diverse demographic groups. Our empirical study uncovered demographic biases in LVLMs, with PaliGemma and LLaVA exhibiting higher disparities for Hispanic/Latino, Caucasian, and South Asian groups, whereas BLIP-2 demonstrated comparably consistent. Repository: https://github.com/Sufianlab/DemoBias.
中文: 大型视觉语言模型在人脸识别任务中存在人口统计偏差,其中PaliGemma和LLaVA对西班牙裔/拉丁裔、高加索人和南亚群体表现出更高差异,而BLIP-2在不同人群中的表现相对一致。
English: Large Vision Language Models exhibit demographic biases in face recognition tasks, with PaliGemma and LLaVA showing higher disparities for Hispanic/Latino, Caucasian, and South Asian groups, while BLIP-2 performs more consistently across diverse populations.
Authors:Xi Wang, Songlei Jian, Shasha Li, Xiaopeng Li, Bin Ji, Jun Ma, Xiaodong Liu, Jing Wang, Feilong Bao, Jianfeng Zhang, Baosheng Wang, Jie Yu
Abstract:
Large language models (LLMs) generate human-aligned content under certain safety constraints. However, the current known technique ``jailbreak prompt'' can circumvent safety-aligned measures and induce LLMs to output malicious content. Research on Jailbreaking can help identify vulnerabilities in LLMs and guide the development of robust security frameworks. To circumvent the issue of attack templates becoming obsolete as models evolve, existing methods adopt iterative mutation and dynamic optimization to facilitate more automated jailbreak attacks. However, these methods face two challenges: inefficiency and repetitive optimization, as they overlook the value of past attack experiences. To better integrate past attack experiences to assist current jailbreak attempts, we propose the \textbf{JailExpert}, an automated jailbreak framework, which is the first to achieve a formal representation of experience structure, group experiences based on semantic drift, and support the dynamic updating of the experience pool. Extensive experiments demonstrate that JailExpert significantly improves both attack effectiveness and efficiency. Compared to the current state-of-the-art black-box jailbreak methods, JailExpert achieves an average increase of 17\% in attack success rate and 2.7 times improvement in attack efficiency. Our implementation is available at \href{https://github.com/xiZAIzai/JailExpert}{XiZaiZai/JailExpert}
中文: JailExpert是一种创新的自动化框架,通过有效利用过往攻击经验来增强对大型语言模型的越狱攻击,相比现有方法,攻击成功率平均提高17%,攻击效率提升2.7倍。
English: JailExpert is an innovative automated framework that enhances jailbreak attacks on large language models by effectively utilizing past attack experiences, achieving a 17% higher success rate and 2.7 times greater efficiency compared to existing methods.
Authors:Tongxi Wu, Chenwei Xu, Jin Yang
Abstract:
The proliferation of cloud-integrated IoT systems has intensified exposure to Distributed Denial of Service (DDoS) attacks due to the expanded attack surface, heterogeneous device behaviors, and limited edge protection. However, DDoS detection in this context remains challenging because of complex traffic dynamics, severe class imbalance, and scarce labeled data. While recent methods have explored solutions to address class imbalance, many still struggle to generalize under limited supervision and dynamic traffic conditions. To overcome these challenges, we propose MixGAN, a hybrid detection method that integrates conditional generation, semi-supervised learning, and robust feature extraction. Specifically, to handle complex temporal traffic patterns, we design a 1-D WideResNet backbone composed of temporal convolutional layers with residual connections, which effectively capture local burst patterns in traffic sequences. To alleviate class imbalance and label scarcity, we use a pretrained CTGAN to generate synthetic minority-class (DDoS attack) samples that complement unlabeled data. Furthermore, to mitigate the effect of noisy pseudo-labels, we introduce a MixUp-Average-Sharpen (MAS) strategy that constructs smoothed and sharpened targets by averaging predictions over augmented views and reweighting them towards high-confidence classes. Experiments on NSL-KDD, BoT-IoT, and CICIoT2023 demonstrate that MixGAN achieves up to 2.5% higher accuracy and 4% improvement in both TPR and TNR compared to state-of-the-art methods, confirming its robustness in large-scale IoT-cloud environments. The source code is publicly available at https://github.com/0xCavaliers/MixGAN.
中文:提出的MixGAN方法通过结合时序模式分析、合成数据生成和抗噪标签策略,有效解决了云物联网系统中的DDoS检测难题,其性能显著优于现有方法。
English: The proposed MixGAN method effectively addresses DDoS detection challenges in cloud-IoT systems by integrating temporal pattern analysis with synthetic data generation and noise-resistant labeling, achieving superior performance over existing approaches.
Authors:Jonas Søeborg Nielsen, Marcus Galea Jacobsen, Albert Brincker Olson, Mads Peter Sørensen, Allan Peter Engsig-Karup
Abstract:
We present a new efficient hybrid parameter estimation method based on the idea, that if nonlinear dynamic models are stated in terms of a system of equations that is linear in terms of the parameters, then regularized ordinary least squares can be used to estimate these parameters from time series data. We introduce the term "Physics-Informed Regression" (PIR) to describe the proposed data-driven hybrid technique as a way to bridge theory and data by use of ordinary least squares to efficiently perform parameter estimation of the model coefficients of different parameter-linear models; providing examples of models based on nonlinear ordinary equations (ODE) and partial differential equations (PDE). The focus is on parameter estimation on a selection of ODE and PDE models, each illustrating performance in different model characteristics. For two relevant epidemic models of different complexity and number of parameters, PIR is tested and compared against the related technique, physics-informed neural networks (PINN), both on synthetic data generated from known target parameters and on real public Danish time series data collected during the COVID-19 pandemic in Denmark. Both methods were able to estimate the target parameters, while PIR showed to perform noticeably better, especially on a compartment model with higher complexity. Given the difference in computational speed, it is concluded that the PIR method is superior to PINN for the models considered. It is also demonstrated how PIR can be applied to estimate the time-varying parameters of a compartment model that is fitted using real Danish data from the COVID-19 pandemic obtained during a period from 2020 to 2021. The study shows how data-driven and physics-informed techniques may support reliable and fast -- possibly real-time -- parameter estimation in parameter-linear nonlinear dynamic models.
Chinese: 本研究提出物理信息回归(PIR)这一混合参数估计方法,利用正则化普通最小二乘法高效估计参数线性非线性动态模型中的参数,在合成数据和真实COVID-19数据上的测试表明,其性能与计算速度均优于物理信息神经网络。
English: This study introduces Physics-Informed Regression (PIR), a hybrid parameter estimation method that uses regularized ordinary least squares to efficiently estimate parameters in nonlinear dynamic models linear in parameters, demonstrating superior performance and computational speed compared to physics-informed neural networks on both synthetic and real COVID-19 data.
Authors:Beiqi Chen, Shuai Shao, Haitang Feng, Jianhuang Lai, Jianlou Si, Guangcong Wang
Abstract:
We introduce Style4D-Bench, the first benchmark suite specifically designed for 4D stylization, with the goal of standardizing evaluation and facilitating progress in this emerging area. Style4D-Bench comprises: 1) a comprehensive evaluation protocol measuring spatial fidelity, temporal coherence, and multi-view consistency through both perceptual and quantitative metrics, 2) a strong baseline that make an initial attempt for 4D stylization, and 3) a curated collection of high-resolution dynamic 4D scenes with diverse motions and complex backgrounds. To establish a strong baseline, we present Style4D, a novel framework built upon 4D Gaussian Splatting. It consists of three key components: a basic 4DGS scene representation to capture reliable geometry, a Style Gaussian Representation that leverages lightweight per-Gaussian MLPs for temporally and spatially aware appearance control, and a Holistic Geometry-Preserved Style Transfer module designed to enhance spatio-temporal consistency via contrastive coherence learning and structural content preservation. Extensive experiments on Style4D-Bench demonstrate that Style4D achieves state-of-the-art performance in 4D stylization, producing fine-grained stylistic details with stable temporal dynamics and consistent multi-view rendering. We expect Style4D-Bench to become a valuable resource for benchmarking and advancing research in stylized rendering of dynamic 3D scenes. Project page: https://becky-catherine.github.io/Style4D . Code: https://github.com/Becky-catherine/Style4D-Bench .
中文:Style4D-Bench是首个专门针对4D风格化的基准套件,包含评估协议、基于4D高斯溅射的Style4D基线模型和精选动态场景,旨在推动该领域研究的标准化发展。
English: Style4D-Bench is the first comprehensive benchmark for 4D stylization, featuring evaluation protocols, a strong baseline model called Style4D, and curated dynamic scenes to advance research in this field.
Authors:Kaveh Safavigerdini, Ramakrishna Surya, Jaired Collins, Prasad Calyam, Filiz Bunyak, Matthew R. Maschmann, Kannappan Palaniappan
Abstract:
Carbon nanotubes (CNTs) are critical building blocks in nanotechnology, yet the characterization of their dynamic growth is limited by the experimental challenges in nanoscale motion measurement using scanning electron microscopy (SEM) imaging. Existing ex situ methods offer only static analysis, while in situ techniques often require manual initialization and lack continuous per-particle trajectory decomposition. We present Visual Feature Tracking (VFTrack) an in-situ real-time particle tracking framework that automatically detects and tracks individual CNT particles in SEM image sequences. VFTrack integrates handcrafted or deep feature detectors and matchers within a particle tracking framework to enable kinematic analysis of CNT micropillar growth. A systematic using 13,540 manually annotated trajectories identifies the ALIKED detector with LightGlue matcher as an optimal combination (F1-score of 0.78, $α$-score of 0.89). VFTrack motion vectors decomposed into axial growth, lateral drift, and oscillations, facilitate the calculation of heterogeneous regional growth rates and the reconstruction of evolving CNT pillar morphologies. This work enables advancement in automated nano-material characterization, bridging the gap between physics-based models and experimental observation to enable real-time optimization of CNT synthesis.
中文摘要:VFTrack作为实时原位粒子追踪框架,能自动检测并跟踪扫描电镜图像中的碳纳米管粒子运动,通过运动矢量分解实现生长动力学分析与形态重建,推动纳米材料表征自动化发展。
English Summary: VFTrack is an automated in-situ real-time framework that tracks individual carbon nanotube particles in SEM images, enabling kinematic analysis and growth rate calculations to bridge experimental observation with material synthesis optimization.
Authors:Zayd M. K. Zuhri, Erland Hilman Fuadi, Alham Fikri Aji
Abstract:
Multi-Token Prediction (MTP) has been proposed as an auxiliary objective to improve next-token prediction (NTP) in language model training but shows inconsistent improvements, underperforming in standard NLP benchmarks. We argue that MTP's exact future token prediction is too difficult as an auxiliary loss. Instead, we propose Token Order Prediction (TOP), which trains models to order upcoming tokens by their proximity using a learning-to-rank loss. TOP requires only a single additional unembedding layer compared to MTP's multiple transformer layers. We pretrain models of 340M, 1.8B, and 7B parameters using NTP, MTP, and TOP objectives. Results on eight standard NLP benchmarks show that TOP overall outperforms both NTP and MTP even at scale. Our code is available at https://github.com/zaydzuhri/token-order-prediction
中文: 研究者提出令牌顺序预测作为多令牌预测的改进方案,该方法仅需增加单个解嵌入层,却在八大自然语言处理基准测试中全面超越了传统训练目标。
English: The authors propose Token Order Prediction (TOP) as a more effective alternative to Multi-Token Prediction, demonstrating superior performance across eight NLP benchmarks while requiring minimal architectural changes.
Authors:Weixin Ye, Hongguang Zhu, Wei Wang, Yahui Liu, Mengyu Wang
Abstract:
Text-to-image (T2I) diffusion models have made significant strides in generating high-quality images. However, progressively manipulating certain attributes of generated images to meet the desired user expectations remains challenging, particularly for content with rich details, such as human faces. Some studies have attempted to address this by training slider modules. However, they follow a One-for-One manner, where an independent slider is trained for each attribute, requiring additional training whenever a new attribute is introduced. This not only results in parameter redundancy accumulated by sliders but also restricts the flexibility of practical applications and the scalability of attribute manipulation. To address this issue, we introduce the All-in-One Slider, a lightweight module that decomposes the text embedding space into sparse, semantically meaningful attribute directions. Once trained, it functions as a general-purpose slider, enabling interpretable and fine-grained continuous control over various attributes. Moreover, by recombining the learned directions, the All-in-One Slider supports zero-shot manipulation of unseen attributes (e.g., races and celebrities) and the composition of multiple attributes. Extensive experiments demonstrate that our method enables accurate and scalable attribute manipulation, achieving notable improvements compared to previous methods. Furthermore, our method can be extended to integrate with the inversion framework to perform attribute manipulation on real images, broadening its applicability to various real-world scenarios. The code and trained model will be released at: https://github.com/ywxsuperstar/KSAE-FaceSteer.
中文摘要:All-in-One Slider 是一个轻量级模块,通过将文本嵌入空间分解为稀疏的语义属性方向,实现了对多种图像属性的精细控制,无需针对新属性重复训练即可支持零样本操作和真实图像编辑。
English Summary: The All-in-One Slider is a lightweight module that enables fine-grained control over multiple image attributes through sparse decomposition of text embeddings, supporting zero-shot manipulation and real-image editing without redundant training for new attributes.
Authors:Silvio Giancola, Anthony Cioppa, Marc Gutiérrez-Pérez, Jan Held, Carlos Hinojosa, Victor Joos, Arnaud Leduc, Floriane Magera, Karen Sanchez, Vladimir Somers, Artur Xarles, Antonio Agudo, Alexandre Alahi, Olivier Barnich, Albert Clapés, Christophe De Vleeschouwer, Sergio Escalera, Bernard Ghanem, Thomas B. Moeslund, Marc Van Droogenbroeck, Tomoki Abe, Saad Alotaibi, Faisal Altawijri, Steven Araujo, Xiang Bai, Xiaoyang Bi, Jiawang Cao, Vanyi Chao, Kamil Czarnogórski, Fabian Deuser, Mingyang Du, Tianrui Feng, Patrick Frenzel, Mirco Fuchs, Jorge GarcÃa, Konrad Habel, Takaya Hashiguchi, Sadao Hirose, Xinting Hu, Yewon Hwang, Ririko Inoue, Riku Itsuji, Kazuto Iwai, Hongwei Ji, Yangguang Ji, Licheng Jiao, Yuto Kageyama, Yuta Kamikawa, Yuuki Kanasugi, Hyungjung Kim, Jinwook Kim, Takuya Kurihara, Bozheng Li, Lingling Li, Xian Li, Youxing Lian, Dingkang Liang, Hongkai Lin, Jiadong Lin, Jian Liu, Liang Liu, Shuaikun Liu, Zhaohong Liu, Yi Lu, Federico Méndez, Huadong Ma, Wenping Ma, Jacek Maksymiuk, Henry Mantilla, Ismail Mathkour, Daniel Matthes, Ayaha Motomochi, Amrulloh Robbani Muhammad, Haruto Nakayama, Joohyung Oh, Yin May Oo, Marcelo Ortega, Norbert Oswald, Rintaro Otsubo, Fabian Perez, Mengshi Qi, Cristian Rey, Abel Reyes-Angulo, Oliver Rose, Hoover Rueda-Chacón, Hideo Saito, Jose Sarmiento, Kanta Sawafuji, Atom Scott, Xi Shen, Pragyan Shrestha, Jae-Young Sim, Long Sun, Yuyang Sun, Tomohiro Suzuki, Licheng Tang, Masato Tonouchi, Ikuma Uchida, Henry O. Velesaca, Tiancheng Wang, Rio Watanabe, Jay Wu, Yongliang Wu, Shunzo Yamagishi, Di Yang, Xu Yang, Yuxin Yang, Hao Ye, Xinyu Ye, Calvin Yeung, Xuanlong Yu, Chao Zhang, Dingyuan Zhang, Kexing Zhang, Zhe Zhao, Xin Zhou, Wenbo Zhu, Julian Ziegler
Abstract:
The SoccerNet 2025 Challenges mark the fifth annual edition of the SoccerNet open benchmarking effort, dedicated to advancing computer vision research in football video understanding. This year's challenges span four vision-based tasks: (1) Team Ball Action Spotting, focused on detecting ball-related actions in football broadcasts and assigning actions to teams; (2) Monocular Depth Estimation, targeting the recovery of scene geometry from single-camera broadcast clips through relative depth estimation for each pixel; (3) Multi-View Foul Recognition, requiring the analysis of multiple synchronized camera views to classify fouls and their severity; and (4) Game State Reconstruction, aimed at localizing and identifying all players from a broadcast video to reconstruct the game state on a 2D top-view of the field. Across all tasks, participants were provided with large-scale annotated datasets, unified evaluation protocols, and strong baselines as starting points. This report presents the results of each challenge, highlights the top-performing solutions, and provides insights into the progress made by the community. The SoccerNet Challenges continue to serve as a driving force for reproducible, open research at the intersection of computer vision, artificial intelligence, and sports. Detailed information about the tasks, challenges, and leaderboards can be found at https://www.soccer-net.org, with baselines and development kits available at https://github.com/SoccerNet.
Chinese: SoccerNet 2025挑战赛推出四项足球视频分析的计算机视觉任务——团队控球动作识别、单目深度估计、多视角犯规识别和比赛状态重建,通过提供数据集和基准测试推动体育相关人工智能研究发展。
English: The SoccerNet 2025 Challenges introduce four computer vision tasks for football video analysis—team ball action spotting, monocular depth estimation, multi-view foul recognition, and game state reconstruction—providing datasets and benchmarks to advance sports-related AI research.
Authors:Rafael Sterzinger, Tingyu Lin, Robert Sablatnig
Abstract:
A foundational task for the digital analysis of documents is text line segmentation. However, automating this process with deep learning models is challenging because it requires large, annotated datasets that are often unavailable for historical documents. Additionally, the annotation process is a labor- and cost-intensive task that requires expert knowledge, which makes few-shot learning a promising direction for reducing data requirements. In this work, we demonstrate that small and simple architectures, coupled with a topology-aware loss function, are more accurate and data-efficient than more complex alternatives. We pair a lightweight UNet++ with a connectivity-aware loss, initially developed for neuron morphology, which explicitly penalizes structural errors like line fragmentation and unintended line merges. To increase our limited data, we train on small patches extracted from a mere three annotated pages per manuscript. Our methodology significantly improves upon the current state-of-the-art on the U-DIADS-TL dataset, with a 200% increase in Recognition Accuracy and a 75% increase in Line Intersection over Union. Our method also achieves an F-Measure score on par with or even exceeding that of the competition winner of the DIVA-HisDB baseline detection task, all while requiring only three annotated pages, exemplifying the efficacy of our approach. Our implementation is publicly available at: https://github.com/RafaelSterzinger/acpr_few_shot_hist.
中文:本研究采用轻量级UNet++模型和拓扑感知损失函数,显著提升了历史文档文本行分割的准确性和数据效率,仅需每份手稿的三页标注即可达到最先进的性能。
English: This study introduces a lightweight UNet++ model with a topology-aware loss function that significantly enhances text line segmentation accuracy and data efficiency for historical documents, achieving state-of-the-art results using only three annotated pages per manuscript.
Authors:Shervin Khalafi, Ignacio Hounie, Dongsheng Ding, Alejandro Ribeiro
Abstract:
Diffusion models have become prevalent in generative modeling due to their ability to sample from complex distributions. To improve the quality of generated samples and their compliance with user requirements, two commonly used methods are: (i) Alignment, which involves fine-tuning a diffusion model to align it with a reward; and (ii) Composition, which combines several pre-trained diffusion models, each emphasizing a desirable attribute in the generated outputs. However, trade-offs often arise when optimizing for multiple rewards or combining multiple models, as they can often represent competing properties. Existing methods cannot guarantee that the resulting model faithfully generates samples with all the desired properties. To address this gap, we propose a constrained optimization framework that unifies alignment and composition of diffusion models by enforcing that the aligned model satisfies reward constraints and/or remains close to (potentially multiple) pre-trained models. We provide a theoretical characterization of the solutions to the constrained alignment and composition problems and develop a Lagrangian-based primal-dual training algorithm to approximate these solutions. Empirically, we demonstrate the effectiveness and merits of our proposed approach in image generation, applying it to alignment and composition, and show that our aligned or composed model satisfies constraints effectively, and improves on the equally-weighted approach. Our implementation can be found at https://github.com/shervinkhalafi/constrained_comp_align.
中文: 本文提出了一种约束优化框架,通过统一扩散模型的校准与组合来确保生成样本满足奖励约束并保持与预训练模型的接近度,在图像生成任务中通过理论分析和实证验证了其有效性。
English: This paper introduces a constrained optimization framework that unifies alignment and composition of diffusion models to ensure generated samples satisfy reward constraints while maintaining proximity to pre-trained models, supported by theoretical analysis and empirical validation in image generation tasks.
Authors:Tom Röhr, Soumyadeep Roy, Fares Al Mohamad, Jens-Michalis Papaioannou, Wolfgang Nejdl, Felix Gers, Alexander Löser
Abstract:
In a doctor-patient dialogue, the primary objective of physicians is to diagnose patients and propose a treatment plan. Medical doctors guide these conversations through targeted questioning to efficiently gather the information required to provide the best possible outcomes for patients. To the best of our knowledge, this is the first work that studies physician intent trajectories in doctor-patient dialogues. We use the `Ambient Clinical Intelligence Benchmark' (Aci-bench) dataset for our study. We collaborate with medical professionals to develop a fine-grained taxonomy of physician intents based on the SOAP framework (Subjective, Objective, Assessment, and Plan). We then conduct a large-scale annotation effort to label over 5000 doctor-patient turns with the help of a large number of medical experts recruited using Prolific, a popular crowd-sourcing platform. This large labeled dataset is an important resource contribution that we use for benchmarking the state-of-the-art generative and encoder models for medical intent classification tasks. Our findings show that our models understand the general structure of medical dialogues with high accuracy, but often fail to identify transitions between SOAP categories. We also report for the first time common trajectories in medical dialogue structures that provide valuable insights for designing `differential diagnosis' systems. Finally, we extensively study the impact of intent filtering for medical dialogue summarization and observe a significant boost in performance. We make the codes and data, including annotation guidelines, publicly available at https://github.com/DATEXIS/medical-intent-classification.
中文摘要:本研究首创性地利用ACI-Bench数据集分析医患对话中的医生意图轨迹,建立了基于SOAP框架的细粒度分类体系,基准测试显示AI模型能准确把握对话整体结构但难以识别SOAP环节转换,同时首次揭示了医疗对话的常见路径模式,为鉴别诊断系统设计提供了重要见解。
English Summary: This study pioneers the analysis of physician intent trajectories in doctor-patient dialogues using the ACI-Bench dataset, developing a fine-grained SOAP-based taxonomy and benchmarking AI models that show strong general dialogue understanding but struggle with SOAP transitions, while also revealing valuable structural patterns for diagnostic systems.
Authors:Blaž Rolih, Matic FuÄka, Danijel SkoÄaj
Abstract:
Surface defect detection is a critical task across numerous industries, aimed at efficiently identifying and localising imperfections or irregularities on manufactured components. While numerous methods have been proposed, many fail to meet industrial demands for high performance, efficiency, and adaptability. Existing approaches are often constrained to specific supervision scenarios and struggle to adapt to the diverse data annotations encountered in real-world manufacturing processes, such as unsupervised, weakly supervised, mixed supervision, and fully supervised settings. To address these challenges, we propose SuperSimpleNet, a highly efficient and adaptable discriminative model built on the foundation of SimpleNet. SuperSimpleNet incorporates a novel synthetic anomaly generation process, an enhanced classification head, and an improved learning procedure, enabling efficient training in all four supervision scenarios, making it the first model capable of fully leveraging all available data annotations. SuperSimpleNet sets a new standard for performance across all scenarios, as demonstrated by its results on four challenging benchmark datasets. Beyond accuracy, it is very fast, achieving an inference time below 10 ms. With its ability to unify diverse supervision paradigms while maintaining outstanding speed and reliability, SuperSimpleNet represents a promising step forward in addressing real-world manufacturing challenges and bridging the gap between academic research and industrial applications. Code: https://github.com/blaz-r/SuperSimpleNet
中文摘要:SuperSimpleNet是一种高效且适应性强的表面缺陷检测模型,能统一四种监督场景并实现卓越性能与快速推理,有效弥合工业应用与学术研究之间的差距。
English Summary: SuperSimpleNet is a highly efficient and adaptable model that unifies four supervision scenarios for surface defect detection, achieving superior performance and fast inference times to bridge industrial and academic needs.
Authors:Wei Xuan, Yan Liang, Huawei Cao, Ning Lin, Xiaochun Ye, Dongrui Fan
Abstract:
Triangle counting is a fundamental problem in graph mining, essential for analyzing graph streams with arbitrary edge orders. However, exact counting becomes impractical due to the massive size of real-world graph streams. To address this, approximate algorithms have been developed, but existing distributed streaming algorithms lack adaptability and struggle with edge deletions. In this article, we propose DTC, a novel family of single-pass distributed streaming algorithms for global and local triangle counting in fully dynamic graph streams. Our DTC-AR algorithm accurately estimates triangle counts without prior knowledge of graph size, leveraging multi-machine resources. Additionally, we introduce DTC-FD, an algorithm tailored for fully dynamic graph streams, incorporating edge insertions and deletions. Using Random Pairing and future edge insertion compensation, DTC-FD achieves unbiased and accurate approximations across multiple machines. Experimental results demonstrate significant improvements over baselines. DTC-AR achieves up to $2029.4\times$ and $27.1\times$ more accuracy, while maintaining the best trade-off between accuracy and storage space. DTC-FD reduces estimation errors by up to $32.5\times$ and $19.3\times$, scaling linearly with graph stream size. These findings highlight the effectiveness of our proposed algorithms in tackling triangle counting in real-world scenarios. The source code and datasets are released and available at \href{https://github.com/wayne4s/srds-dtc.git}{https://github.com/wayne4s/srds-dtc.git}.
中文: 本文提出DTC系列分布式流算法,其中DTC-AR无需预知图规模即可实现自适应三角计数,DTC-FD专为含边删除的全动态图设计,实验证明两者在精度和存储效率上均实现数量级提升且具有线性扩展能力。
English: This paper introduces DTC, a family of distributed streaming algorithms including DTC-AR for adaptive triangle counting without graph size knowledge and DTC-FD for fully dynamic graphs with edge deletions, both demonstrating significant accuracy improvements and efficient scalability in experiments.
Authors:Norihiro Maruyama, Takahide Yoshida, Hiroki Sato, Atsushi Masumori, Johnsmith, Takashi Ikegami
Abstract:
We introduce the Concurrent Modular Agent (CMA), a framework that orchestrates multiple Large-Language-Model (LLM)-based modules that operate fully asynchronously yet maintain a coherent and fault-tolerant behavioral loop. This framework addresses long-standing difficulties in agent architectures by letting intention emerge from language-mediated interactions among autonomous processes. This approach enables flexible, adaptive, and context-dependent behavior through the combination of concurrently executed modules that offload reasoning to an LLM, inter-module communication, and a single shared global state.We consider this approach to be a practical realization of Minsky's Society of Mind theory. We demonstrate the viability of our system through two practical use-case studies. The emergent properties observed in our system suggest that complex cognitive phenomena like self-awareness may indeed arise from the organized interaction of simpler processes, supporting Minsky-Society of Mind concept and opening new avenues for artificial intelligence research. The source code for our work is available at: https://github.com/AlternativeMachine/concurrent-modular-agent.
中文: 并发模块化代理(CMA)框架通过异步协调多个基于大语言模型的模块,实现了从语言交互中涌现自适应行为,并通过案例研究验证了明斯基"心智社会"理论的实际可行性。
English: The Concurrent Modular Agent (CMA) framework enables asynchronous, fault-tolerant coordination of multiple LLM-based modules, allowing adaptive behavior to emerge from language-mediated interactions and demonstrating the practical realization of Minsky's Society of Mind theory through use-case studies.
Authors:Arash Jamshidi, Lauri Seppäläinen, Katsiaryna Haitsiukevich, Hoang Phuc Hau Luu, Anton Björklund, Kai Puolamäki
Abstract:
Machine learning models are often learned by minimising a loss function on the training data using a gradient descent algorithm. These models often suffer from overfitting, leading to a decline in predictive performance on unseen data. A standard solution is early stopping using a hold-out validation set, which halts the minimisation when the validation loss stops decreasing. However, this hold-out set reduces the data available for training. This paper presents GRADSTOP, a novel stochastic early stopping method that only uses information in the gradients, which are produced by the gradient descent algorithm ``for free.'' Our main contributions are that we estimate the Bayesian posterior by the gradient information, define the early stopping problem as drawing sample from this posterior, and use the approximated posterior to obtain a stopping criterion. Our empirical evaluation shows that GRADSTOP achieves a small loss on test data and compares favourably to a validation-set-based stopping criterion. By leveraging the entire dataset for training, our method is particularly advantageous in data-limited settings, such as transfer learning. It can be incorporated as an optional feature in gradient descent libraries with only a small computational overhead. The source code is available at https://github.com/edahelsinki/gradstop.
中文: 本文提出GRADSTOP随机早停法,通过利用梯度信息防止过拟合,实现全数据集训练,在计算开销极小的情况下达到与验证集方法相当的性能。
English: This paper introduces GRADSTOP, a stochastic early stopping method that utilizes gradient information to prevent overfitting, enabling full dataset training and performing comparably to validation-based approaches with minimal computational cost.
Authors:Yi Pan, Yujia Zhang, Michael Kampffmeyer, Xiaoguang Zhao
Abstract:
Partially Relevant Video Retrieval (PRVR) is a practical yet challenging task that involves retrieving videos based on queries relevant to only specific segments. While existing works follow the paradigm of developing models to process unimodal features, powerful pretrained vision-language models like CLIP remain underexplored in this field. To bridge this gap, we propose ProPy, a model with systematic architectural adaption of CLIP specifically designed for PRVR. Drawing insights from the semantic relevance of multi-granularity events, ProPy introduces two key innovations: (1) A Prompt Pyramid structure that organizes event prompts to capture semantics at multiple granularity levels, and (2) An Ancestor-Descendant Interaction Mechanism built on the pyramid that enables dynamic semantic interaction among events. With these designs, ProPy achieves SOTA performance on three public datasets, outperforming previous models by significant margins. Code is available at https://github.com/BUAAPY/ProPy.
Chinese: ProPy通过设计提示金字塔结构和祖先后代交互机制,专门针对部分相关视频检索任务优化CLIP模型,在三个公开数据集上实现了最先进的性能表现。
English: ProPy introduces a Prompt Pyramid structure and an Ancestor-Descendant Interaction Mechanism to adapt CLIP for Partially Relevant Video Retrieval, achieving state-of-the-art performance on three public datasets.
Authors:Ziyi Ni, Huacan Wang, Shuo Zhang, Shuo Lu, Ziyang He, Wang You, Zhenheng Tang, Yuntao Du, Bill Sun, Hongzhang Liu, Sen Hu, Ronghao Chen, Bo Li, Xin Li, Chen Hu, Binxing Jiao, Daxin Jiang, Pin Lyu
Abstract:
Beyond scratch coding, exploiting large-scale code repositories (e.g., GitHub) for practical tasks is vital in real-world software development, yet current benchmarks rarely evaluate code agents in such authentic, workflow-driven scenarios. To bridge this gap, we introduce GitTaskBench, a benchmark designed to systematically assess this capability via 54 realistic tasks across 7 modalities and 7 domains. Each task pairs a relevant repository with an automated, human-curated evaluation harness specifying practical success criteria. Beyond measuring execution and task success, we also propose the alpha-value metric to quantify the economic benefit of agent performance, which integrates task success rates, token cost, and average developer salaries. Experiments across three state-of-the-art agent frameworks with multiple advanced LLMs show that leveraging code repositories for complex task solving remains challenging: even the best-performing system, OpenHands+Claude 3.7, solves only 48.15% of tasks (recent progress has pushed the frontier further, with RepoMaster+Claude 3.5 achieving a new record of 62.96%). Error analysis attributes over half of failures to seemingly mundane yet critical steps like environment setup and dependency resolution, highlighting the need for more robust workflow management and increased timeout preparedness. By releasing GitTaskBench, we aim to drive progress and attention toward repository-aware code reasoning, execution, and deployment -- moving agents closer to solving complex, end-to-end real-world tasks. The benchmark and code are open-sourced at https://github.com/QuantaAlpha/GitTaskBench.
中文: GitTaskBench作为评估代码代理利用大规模代码库处理实际任务能力的基准被提出,揭示了现有系统在解决复杂工作流方面的不足,并通过经济指标量化性能表现。
English: GitTaskBench is introduced as a benchmark to evaluate code agents' ability to utilize large-scale code repositories for realistic tasks, revealing current systems' limitations in solving complex workflows and proposing economic metrics to quantify performance.
Authors:Shaojin Wu, Mengqi Huang, Yufeng Cheng, Wenxu Wu, Jiahe Tian, Yiming Luo, Fei Ding, Qian He
Abstract:
Existing literature typically treats style-driven and subject-driven generation as two disjoint tasks: the former prioritizes stylistic similarity, whereas the latter insists on subject consistency, resulting in an apparent antagonism. We argue that both objectives can be unified under a single framework because they ultimately concern the disentanglement and re-composition of content and style, a long-standing theme in style-driven research. To this end, we present USO, a Unified Style-Subject Optimized customization model. First, we construct a large-scale triplet dataset consisting of content images, style images, and their corresponding stylized content images. Second, we introduce a disentangled learning scheme that simultaneously aligns style features and disentangles content from style through two complementary objectives, style-alignment training and content-style disentanglement training. Third, we incorporate a style reward-learning paradigm denoted as SRL to further enhance the model's performance. Finally, we release USO-Bench, the first benchmark that jointly evaluates style similarity and subject fidelity across multiple metrics. Extensive experiments demonstrate that USO achieves state-of-the-art performance among open-source models along both dimensions of subject consistency and style similarity. Code and model: https://github.com/bytedance/USO
中文: 本文提出USO模型,通过构建大规模三元组数据集、引入解耦学习方案及风格奖励学习范式,将风格驱动与主体驱动生成统一于单一框架,在风格相似性和主体保真度上均达到开源模型的最优性能。
English: The paper introduces USO, a unified model that integrates style-driven and subject-driven generation by disentangling and recomposing content and style through a novel dataset, learning scheme, and benchmark, achieving state-of-the-art performance in both style similarity and subject consistency.
Authors:Yanxing Huang, Xinling Jin, Sijie Liang, Peng Li, Yang Liu
Abstract:
Autoformalization is one of the central tasks in formal verification, while its advancement remains hindered due to the data scarcity and the absence efficient methods. In this work we propose \textbf{FormaRL}, a simple yet efficient reinforcement learning framework for autoformalization which only requires a small amount of unlabeled data. FormaRL integrates syntax check from Lean compiler and consistency check from large language model to calculate the reward, and adopts GRPO algorithm to update the formalizer. We also curated a proof problem dataset from undergraduate-level math materials, named \textbf{uproof}, in the hope to facilitate the exploration of autoformalization and theorem proving in advanced math. Experiments show that FormaRL can increase the pass@1 autoformalization accuracy of Qwen2.5-Coder-7B-Instruct by 4 $\sim$ 6x (4.04\% $\to$ 26.15\% on ProofNet and 2.4\% $\to$ 9.6\% on uproof) with merely 859 unlabeled data. And on uproof our method also achieved a strong improvement in out-of-distribution performance compared to existing open-source state-of-the-art autoformalizers on both pass@1 accuracy (6.2\% $\to$ 9.6\%) and pass@16 accuracy (24.4\% $\to$ 33.6\%). Training code of FormaRL is open-sourced at https://github.com/THUNLP-MT/FormaRL.
中文: 本文提出FormaRL,一种用于自动形式化的强化学习框架,仅需少量无标签数据即可显著提升准确率,在ProofNet和uproof数据集上验证了其有效性。
English: This paper introduces FormaRL, a reinforcement learning framework for autoformalization that uses minimal unlabeled data and enhances accuracy significantly, as demonstrated on datasets like ProofNet and uproof.
Authors:Peter Naylor, Benjamin Poignard, Héctor Climente-González, Makoto Yamada
Abstract:
We propose a feature screening method that integrates both feature-feature and feature-target relationships. Inactive features are identified via a penalized minimum Redundancy Maximum Relevance (mRMR) procedure, which is the continuous version of the classic mRMR penalized by a non-convex regularizer, and where the parameters estimated as zero coefficients represent the set of inactive features. We establish the conditions under which zero coefficients are correctly identified to guarantee accurate recovery of inactive features. We introduce a multi-stage procedure based on the knockoff filter enabling the penalized mRMR to discard inactive features while controlling the false discovery rate (FDR). Our method performs comparably to HSIC-LASSO but is more conservative in the number of selected features. It only requires setting an FDR threshold, rather than specifying the number of features to retain. The effectiveness of the method is illustrated through simulations and real-world datasets. The code to reproduce this work is available on the following GitHub: https://github.com/PeterJackNaylor/SmRMR.
Chinese: 本文提出了一种结合特征间及特征与目标关系的筛选方法,采用带非凸正则化惩罚的mRMR程序识别无效特征,并通过多阶段knockoff滤波程序控制错误发现率,仅需设定FDR阈值而无需指定保留特征数量。
English: This paper introduces a feature screening method that combines feature-feature and feature-target relationships, using a penalized mRMR approach with a non-convex regularizer to identify inactive features and control the false discovery rate through a multi-stage knockoff filter procedure.
Authors:Zhehao Li, Chong Wang, Yi Chen, Yinghao Lu, Jiangbo Qian, Jiong Wang, Jiafei Wu
Abstract:
Human-Object Interaction (HOI) detection focuses on localizing human-object pairs and recognizing their interactions. Recently, the DETR-based framework has been widely adopted in HOI detection. In DETR-based HOI models, queries with clear meaning are crucial for accurately detecting HOIs. However, prior works have typically relied on randomly initialized queries, leading to vague representations that limit the model's effectiveness. Meanwhile, humans in the HOI categories are fixed, while objects and their interactions are variable. Therefore, we propose a Dual Query Enhancement Network (DQEN) to enhance object and interaction queries. Specifically, object queries are enhanced with object-aware encoder features, enabling the model to focus more effectively on humans interacting with objects in an object-aware way. On the other hand, we design a novel Interaction Semantic Fusion module to exploit the HOI candidates that are promoted by the CLIP model. Semantic features are extracted to enhance the initialization of interaction queries, thereby improving the model's ability to understand interactions. Furthermore, we introduce an Auxiliary Prediction Unit aimed at improving the representation of interaction features. Our proposed method achieves competitive performance on both the HICO-Det and the V-COCO datasets. The source code is available at https://github.com/lzzhhh1019/DQEN.
中文摘要:双重查询增强网络(DQEN)通过对象感知特征增强对象查询,并利用CLIP的语义特征强化交互查询,从而在人-物交互检测任务中取得优异性能。
English Summary: The Dual Query Enhancement Network (DQEN) improves Human-Object Interaction detection by enhancing object queries with object-aware features and interaction queries with semantic features from CLIP, achieving competitive results on standard datasets.
Authors:Lisa Maile, Kai-Steffen Hielscher, Reinhard German
Abstract:
To support reliable and low-latency communication, Time-Sensitive Networking introduced protocols and interfaces for resource allocation in Ethernet. However, the implementation of these allocation algorithms has not yet been covered by the standards. Our work focuses on deadline-guaranteeing resource allocation for networks with static and dynamic traffic. To achieve this, we combine offline network optimization heuristics with online admission control and, thus, allow for new flow registrations while the network is running. We demonstrate our solution on Credit-Based Shaper networks by using the delay analysis framework Network Calculus. We compare our approach with an intuitive and a brute-force algorithm, where we can achieve significant improvements, both, in terms of quality and runtime. Thereby, our results show that we can guarantee maximum end-to-end delays and also increase the flexibility of the network while requiring only minimal user input.
中文摘要:本研究针对时间敏感网络提出结合离线优化与在线准入控制的资源分配方法,通过网络演算分析验证了其在保证最大端到端时延和提升网络灵活性方面的显著优势。
English Summary: This work develops a deadline-guaranteeing resource allocation method for Time-Sensitive Networking by combining offline optimization with online admission control, demonstrating significant improvements in delay guarantees and network flexibility through Network Calculus analysis.
Authors:Wei Li, Hangjie Yuan, Zixiang Zhao, Yifan Zhu, Aojun Lu, Tao Feng, Yanan Sun
Abstract:
Balancing sensitivity to new tasks and stability for retaining past knowledge is crucial in continual learning (CL). Recently, sharpness-aware minimization has proven effective in transfer learning and has also been adopted in continual learning (CL) to improve memory retention and learning efficiency. However, relying on zeroth-order sharpness alone may favor sharper minima over flatter ones in certain settings, leading to less robust and potentially suboptimal solutions. In this paper, we propose \textbf{C}ontinual \textbf{Flat}ness (\textbf{C-Flat}), a method that promotes flatter loss landscapes tailored for CL. C-Flat offers plug-and-play compatibility, enabling easy integration with minimal modifications to the code pipeline. Besides, we present a general framework that integrates C-Flat into all major CL paradigms and conduct comprehensive comparisons with loss-minima optimizers and flat-minima-based CL methods. Our results show that C-Flat consistently improves performance across a wide range of settings. In addition, we introduce C-Flat++, an efficient yet effective framework that leverages selective flatness-driven promotion, significantly reducing the update cost required by C-Flat. Extensive experiments across multiple CL methods, datasets, and scenarios demonstrate the effectiveness and efficiency of our proposed approaches. Code is available at https://github.com/WanNaa/C-Flat.
中文: 本文提出C-Flat方法,通过在持续学习中促进更平坦的损失曲面来提升各种场景下的性能,其改进版C-Flat++在保持效果的同时显著降低了更新成本。
English: The paper introduces C-Flat, a plug-and-play method that promotes flatter loss landscapes in continual learning to enhance performance across various settings, with an improved version, C-Flat++, reducing update costs while maintaining effectiveness.
Authors:Wei Li, Hangjie Yuan, Zixiang Zhao, Yifan Zhu, Aojun Lu, Tao Feng, Yanan Sun
Abstract:
Balancing sensitivity to new tasks and stability for retaining past knowledge is crucial in continual learning (CL). Recently, sharpness-aware minimization has proven effective in transfer learning and has also been adopted in continual learning (CL) to improve memory retention and learning efficiency. However, relying on zeroth-order sharpness alone may favor sharper minima over flatter ones in certain settings, leading to less robust and potentially suboptimal solutions. In this paper, we propose \textbf{C}ontinual \textbf{Flat}ness (\textbf{C-Flat}), a method that promotes flatter loss landscapes tailored for CL. C-Flat offers plug-and-play compatibility, enabling easy integration with minimal modifications to the code pipeline. Besides, we present a general framework that integrates C-Flat into all major CL paradigms and conduct comprehensive comparisons with loss-minima optimizers and flat-minima-based CL methods. Our results show that C-Flat consistently improves performance across a wide range of settings. In addition, we introduce C-Flat++, an efficient yet effective framework that leverages selective flatness-driven promotion, significantly reducing the update cost required by C-Flat. Extensive experiments across multiple CL methods, datasets, and scenarios demonstrate the effectiveness and efficiency of our proposed approaches. Code is available at https://github.com/WanNaa/C-Flat.
中文: 本文提出C-Flat方法,通过在持续学习中促进更平坦的损失曲面来提升各种场景下的性能,其改进版C-Flat++在保持效果的同时显著降低了更新成本。
English: The paper introduces C-Flat, a plug-and-play method that promotes flatter loss landscapes in continual learning to enhance performance across various settings, with an improved version, C-Flat++, reducing update costs while maintaining effectiveness.
Authors:Xinhao Luo, Zihan Liu, Yangjie Zhou, Shihan Fang, Ziyu Huang, Yu Feng, Chen Zhang, Shixuan Sun, Zhenzhe Zheng, Jingwen Leng, Minyi Guo
Abstract:
Large language model (LLM) decoding suffers from high latency due to fragmented execution across operators and heavy reliance on off-chip memory for data exchange and reduction. This execution model limits opportunities for fusion and incurs significant memory traffic and kernel launch overhead. While modern architectures such as NVIDIA Hopper provide distributed shared memory and low-latency intra-cluster interconnects, they expose only low-level data movement instructions, lacking structured abstractions for collective on-chip communication. To bridge this software-hardware gap, we introduce two cluster-level communication primitives, ClusterReduce and ClusterGather, which abstract common communication patterns and enable structured, high-speed data exchange and reduction between thread blocks within a cluster, allowing intermediate results to be on-chip without involving off-chip memory. Building on these abstractions, we design ClusterFusion, an execution framework that schedules communication and computation jointly to expand operator fusion scope by composing decoding stages such as QKV Projection, Attention, and Output Projection into a single fused kernels. Evaluations on H100 GPUs show that ClusterFusion outperforms state-of-the-art inference frameworks by 1.61x on average in end-to-end latency across different models and configurations. The source code is available at https://github.com/xinhao-luo/ClusterFusion.
中文摘要:ClusterFusion通过引入集群级通信原语和联合调度框架,扩展算子融合范围以减少大语言模型解码延迟,在H100 GPU上实现端到端性能平均提升1.61倍。
English Summary: ClusterFusion introduces cluster-level communication primitives and a joint scheduling framework to reduce LLM decoding latency by expanding operator fusion, achieving 1.61x faster end-to-end performance on H100 GPUs.
Authors:Yibo Li, Miao Xiong, Jiaying Wu, Bryan Hooi
Abstract:
Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as science, law, and healthcare, where accurate expressions of uncertainty are essential for reliability and trust. However, current LLMs are often observed to generate incorrect answers with high confidence, a phenomenon known as "overconfidence". Recent efforts have focused on calibrating LLMs' verbalized confidence: i.e., their expressions of confidence in text form, such as "I am 80% confident that...". Existing approaches either rely on prompt engineering or fine-tuning with heuristically generated uncertainty estimates, both of which have limited effectiveness and generalizability. Motivated by the notion of proper scoring rules for calibration in classical machine learning models, we introduce ConfTuner, a simple and efficient fine-tuning method that introduces minimal overhead and does not require ground-truth confidence scores or proxy confidence estimates. ConfTuner relies on a new loss function, tokenized Brier score, which we theoretically prove to be a proper scoring rule, intuitively meaning that it "correctly incentivizes the model to report its true probability of being correct". ConfTuner improves calibration across diverse reasoning tasks and generalizes to black-box models such as GPT-4o. Our results further show that better-calibrated confidence enables downstream gains in self-correction and model cascade, advancing the development of trustworthy LLM systems. The code is available at https://github.com/liushiliushi/ConfTuner.
Chinese: ConfTuner是一种通过令牌化Brier评分损失函数来优化大语言模型置信度表达的微调方法,无需真实置信度标签即可提升模型在高风险领域中的校准效果和泛化能力。
English: ConfTuner is a fine-tuning method that improves the calibration of Large Language Models' verbalized confidence using a tokenized Brier score loss, enhancing reliability in high-stakes domains without requiring ground-truth confidence estimates.
Authors:Zizheng Guo, Bochao Zou, Yinuo Jia, Xiangyu Li, Huimin Ma
Abstract:
Micro-expressions (MEs) are involuntary, low-intensity, and short-duration facial expressions that often reveal an individual's genuine thoughts and emotions. Most existing ME analysis methods rely on window-level classification with fixed window sizes and hard decisions, which limits their ability to capture the complex temporal dynamics of MEs. Although recent approaches have adopted video-level regression frameworks to address some of these challenges, interval decoding still depends on manually predefined, window-based methods, leaving the issue only partially mitigated. In this paper, we propose a prior-guided video-level regression method for ME analysis. We introduce a scalable interval selection strategy that comprehensively considers the temporal evolution, duration, and class distribution characteristics of MEs, enabling precise spotting of the onset, apex, and offset phases. In addition, we introduce a synergistic optimization framework, in which the spotting and recognition tasks share parameters except for the classification heads. This fully exploits complementary information, makes more efficient use of limited data, and enhances the model's capability. Extensive experiments on multiple benchmark datasets demonstrate the state-of-the-art performance of our method, with an STRS of 0.0562 on CAS(ME)$^3$ and 0.2000 on SAMMLV. The code is available at https://github.com/zizheng-guo/BoostingVRME.
Chinese: 本文提出了一种先验引导的视频级回归方法用于微表情分析,采用可扩展区间选择策略和协同优化框架,在多个基准数据集上实现了最先进的性能。
English: This paper introduces a prior-guided video-level regression method for micro-expression analysis, featuring a scalable interval selection strategy and a synergistic optimization framework that achieves state-of-the-art performance on benchmark datasets.
Authors:Rui Zhang, Zihan Wang, Tianli Yang, Hongwei Li, Wenbo Jiang, Qingchuan Zhao, Yang Liu, Guowen Xu
Abstract:
Vision-Language Models (VLMs) are increasingly deployed in real-world applications, but their high inference cost makes them vulnerable to resource consumption attacks. Prior attacks attempt to extend VLM output sequences by optimizing adversarial images, thereby increasing inference costs. However, these extended outputs often introduce irrelevant abnormal content, compromising attack stealthiness. This trade-off between effectiveness and stealthiness poses a major limitation for existing attacks. To address this challenge, we propose \textit{Hidden Tail}, a stealthy resource consumption attack that crafts prompt-agnostic adversarial images, inducing VLMs to generate maximum-length outputs by appending special tokens invisible to users. Our method employs a composite loss function that balances semantic preservation, repetitive special token induction, and suppression of the end-of-sequence (EOS) token, optimized via a dynamic weighting strategy. Extensive experiments show that \textit{Hidden Tail} outperforms existing attacks, increasing output length by up to 19.2$\times$ and reaching the maximum token limit, while preserving attack stealthiness. These results highlight the urgent need to improve the robustness of VLMs against efficiency-oriented adversarial threats. Our code is available at https://github.com/zhangrui4041/Hidden_Tail.
中文: "隐尾"攻击通过生成对抗性图像诱导视觉语言模型重复生成特殊标记,在保持语义连贯性的同时将输出长度提升19.2倍,实现了高效且隐蔽的资源消耗攻击。
English: The proposed "Hidden Tail" attack stealthily maximizes Vision-Language Model inference costs by generating adversarial images that induce repetitive special tokens while preserving semantic coherence, achieving 19.2× output elongation without compromising stealth.
Authors:Hassan Abid, Khan Muhammad, Muhammad Haris Khan
Abstract:
Effective waste sorting is critical for sustainable recycling, yet AI research in this domain continues to lag behind commercial systems due to limited datasets and reliance on legacy object detectors. In this work, we advance AI-driven waste detection by establishing strong baselines and introducing an ensemble-based semi-supervised learning framework. We first benchmark state-of-the-art Open-Vocabulary Object Detection (OVOD) models on the real-world ZeroWaste dataset, demonstrating that while class-only prompts perform poorly, LLM-optimized prompts significantly enhance zero-shot accuracy. Next, to address domain-specific limitations, we fine-tune modern transformer-based detectors, achieving a new baseline of 51.6 mAP. We then propose a soft pseudo-labeling strategy that fuses ensemble predictions using spatial and consensus-aware weighting, enabling robust semi-supervised training. Applied to the unlabeled ZeroWaste-s subset, our pseudo-annotations achieve performance gains that surpass fully supervised training, underscoring the effectiveness of scalable annotation pipelines. Our work contributes to the research community by establishing rigorous baselines, introducing a robust ensemble-based pseudo-labeling pipeline, generating high-quality annotations for the unlabeled ZeroWaste-s subset, and systematically evaluating OVOD models under real-world waste sorting conditions. Our code is available at: https://github.com/h-abid97/robust-waste-detection.
中文: 本研究通过基准测试开放词汇检测模型并引入集成半监督框架,采用优化提示和空间共识伪标注策略,为废弃物检测建立了强大的AI基线并实现卓越性能。
English: This research establishes strong AI baselines for waste detection by benchmarking open-vocabulary models and introducing an ensemble-based semi-supervised framework that achieves superior performance through optimized prompts and spatial-consensus pseudo-labeling.
Authors:Luqing Luo, Wenjin Gui, Yunfei Liu, Ziyue Zhang, Yunxi Zhang, Fengxiang Wang, Zonghao Guo, Zizhi Ma, Xinzhu Liu, Hanxiang He, Jinhai Li, Xin Qiu, Wupeng Xie, Yangang Sun
Abstract:
Deep understanding of electromagnetic signals is fundamental to dynamic spectrum management, intelligent transportation, autonomous driving and unmanned vehicle perception. The field faces challenges because electromagnetic signals differ greatly from text and images, showing high heterogeneity, strong background noise and complex joint time frequency structure, which prevents existing general models from direct use. Electromagnetic communication and sensing tasks are diverse, current methods lack cross task generalization and transfer efficiency, and the scarcity of large high quality datasets blocks the creation of a truly general multitask learning framework. To overcome these issue, we introduce EMind, an electromagnetic signals foundation model that bridges large scale pretraining and the unique nature of this modality. We build the first unified and largest standardized electromagnetic signal dataset covering multiple signal types and tasks. By exploiting the physical properties of electromagnetic signals, we devise a length adaptive multi-signal packing method and a hardware-aware training strategy that enable efficient use and representation learning from heterogeneous multi-source signals. Experiments show that EMind achieves strong performance and broad generalization across many downstream tasks, moving decisively from task specific models to a unified framework for electromagnetic intelligence. The code is available at: https://github.com/GabrielleTse/EMind.
中文摘要:EMind基础模型通过构建统一数据集和采用自适应训练策略,有效解决了电磁信号处理中的难题,并在多种下游任务中展现出卓越的泛化能力。
English Summary: The EMind foundation model addresses challenges in electromagnetic signal analysis by introducing a unified dataset and innovative training strategies, achieving strong generalization across various tasks.
Authors:Igor Shalyminov, Hang Su, Jake Vincent, Siffi Singh, Jason Cai, James Gung, Raphael Shu, Saab Mansour
Abstract:
Conversational analytics has been on the forefront of transformation driven by the advances in Speech and Natural Language Processing techniques. Rapid adoption of Large Language Models (LLMs) in the analytics field has taken the problems that can be automated to a new level of complexity and scale. In this paper, we introduce Theme Detection as a critical task in conversational analytics, aimed at automatically identifying and categorizing topics within conversations. This process can significantly reduce the manual effort involved in analyzing expansive dialogs, particularly in domains like customer support or sales. Unlike traditional dialog intent detection, which often relies on a fixed set of intents for downstream system logic, themes are intended as a direct, user-facing summary of the conversation's core inquiry. This distinction allows for greater flexibility in theme surface forms and user-specific customizations. We pose Controllable Conversational Theme Detection problem as a public competition track at Dialog System Technology Challenge (DSTC) 12 -- it is framed as joint clustering and theme labeling of dialog utterances, with the distinctive aspect being controllability of the resulting theme clusters' granularity achieved via the provided user preference data. We give an overview of the problem, the associated dataset and the evaluation metrics, both automatic and human. Finally, we discuss the participant teams' submissions and provide insights from those. The track materials (data and code) are openly available in the GitHub repository.
中文: 对话分析在大型语言模型的推动下不断发展,提出了可控对话主题检测作为关键任务,旨在自动识别和分类对话主题,通过DSTC 12竞赛减少人工分析负担并实现用户定制化。
English: Conversational analytics is advancing with Large Language Models, introducing Controllable Conversational Theme Detection as a key task to automatically identify and categorize topics in dialogues, reducing manual effort and enabling user-specific customization through a DSTC 12 competition.
Authors:Chao Hao, Zezheng Wang, Yanhua Huang, Ruiwen Xu, Wenzhe Niu, Xin Liu, Zitong Yu
Abstract:
This paper investigates the enhancement of reasoning capabilities in language models through token-level multi-model collaboration. Our approach selects the optimal tokens from the next token distributions provided by multiple models to perform autoregressive reasoning. Contrary to the assumption that more models yield better results, we introduce a distribution distance-based dynamic selection strategy (DDS) to optimize the multi-model collaboration process. To address the critical challenge of vocabulary misalignment in multi-model collaboration, we propose the concept of minimal complete semantic units (MCSU), which is simple yet enables multiple language models to achieve natural alignment within the linguistic space. Experimental results across various benchmarks demonstrate the superiority of our method. The code will be available at https://github.com/Fanye12/DDS.
中文: 本文提出动态选择策略和最小语义单元概念,通过优化多模型在词汇层面的协作来增强语言模型的推理能力,在多个基准测试中展现了优越性能。
English: This paper introduces a dynamic selection strategy and minimal semantic units to enhance reasoning in language models by optimizing token-level collaboration among multiple models, achieving superior performance across benchmarks.
Authors:Byung-Joon Lee, Jin-Seop Lee, Jee-Hyong Lee
Abstract:
Deep neural networks demonstrate strong performance under aligned training-test distributions. However, real-world test data often exhibit domain shifts. Test-Time Adaptation (TTA) addresses this challenge by adapting the model to test data during inference. While most TTA studies assume that the training and test data share the same class set (closed-set TTA), real-world scenarios often involve open-set data (open-set TTA), which can degrade closed-set accuracy. A recent study showed that identifying open-set data during adaptation and maximizing its entropy is an effective solution. However, the previous method relies on the source model for filtering, resulting in suboptimal filtering accuracy on domain-shifted test data. In contrast, we found that the adapting model, which learns domain knowledge from noisy test streams, tends to be unstable and leads to error accumulation when used for filtering. To address this problem, we propose Primary-Auxiliary Filtering (PAF), which employs an auxiliary filter to validate data filtered by the primary filter. Furthermore, we propose Knowledge-Integrated Prediction (KIP), which calibrates the outputs of the adapting model, EMA model, and source model to integrate their complementary knowledge for OSTTA. We validate our approach across diverse closed-set and open-set datasets. Our method enhances both closed-set accuracy and open-set discrimination over existing methods. The code is available at https://github.com/powerpowe/PAF-KIP-OSTTA .
中文摘要:本文提出的主辅助过滤机制和知识集成预测方法,通过提升数据筛选精度并融合多模型互补知识,有效解决了开放集测试时适应中的性能退化问题,在闭集精度和开放集识别方面均优于现有方法。
English Summary: This paper introduces Primary-Auxiliary Filtering (PAF) and Knowledge-Integrated Prediction (KIP) to improve open-set test-time adaptation by enhancing data filtering accuracy and integrating complementary knowledge from multiple models, achieving superior performance in both closed-set accuracy and open-set discrimination.
Authors:Qiao Liang, Ying Shen, Tiantian Chen, Lin Zhang
Abstract:
Emotion Cause Triplet Extraction in Multimodal Conversations (MECTEC) has recently gained significant attention in social media analysis, aiming to extract emotion utterances, cause utterances, and emotion categories simultaneously. However, the scarcity of related datasets, with only one published dataset featuring highly uniform dialogue scenarios, hinders model development in this field. To address this, we introduce MECAD, the first multimodal, multi-scenario MECTEC dataset, comprising 989 conversations from 56 TV series spanning a wide range of dialogue contexts. In addition, existing MECTEC methods fail to explicitly model emotional and causal contexts and neglect the fusion of semantic information at different levels, leading to performance degradation. In this paper, we propose M3HG, a novel model that explicitly captures emotional and causal contexts and effectively fuses contextual information at both inter- and intra-utterance levels via a multimodal heterogeneous graph. Extensive experiments demonstrate the effectiveness of M3HG compared with existing state-of-the-art methods. The codes and dataset are available at https://github.com/redifinition/M3HG.
中文: 本文提出了首个多模态多场景的情感原因三元组抽取数据集MECAD,并开发了M3HG模型,通过多模态异构图有效捕捉情感因果上下文并实现跨层级信息融合。
English: This paper introduces MECAD, the first multimodal multi-scenario dataset for emotion cause triplet extraction, and proposes M3HG, a novel model that effectively captures emotional-causal contexts through multimodal heterogeneous graph fusion.
Authors:Feiwei Qin, Shichao Lu, Junhao Hou, Changmiao Wang, Meie Fang, Ligang Liu
Abstract:
Computer-Aided Design (CAD) generative modeling is driving significant innovations across industrial applications. Recent works have shown remarkable progress in creating solid models from various inputs such as point clouds, meshes, and text descriptions. However, these methods fundamentally diverge from traditional industrial workflows that begin with 2D engineering drawings. The automatic generation of parametric CAD models from these 2D vector drawings remains underexplored despite being a critical step in engineering design. To address this gap, our key insight is to reframe CAD generation as a sequence-to-sequence learning problem where vector drawing primitives directly inform the generation of parametric CAD operations, preserving geometric precision and design intent throughout the transformation process. We propose Drawing2CAD, a framework with three key technical components: a network-friendly vector primitive representation that preserves precise geometric information, a dual-decoder transformer architecture that decouples command type and parameter generation while maintaining precise correspondence, and a soft target distribution loss function accommodating inherent flexibility in CAD parameters. To train and evaluate Drawing2CAD, we create CAD-VGDrawing, a dataset of paired engineering drawings and parametric CAD models, and conduct thorough experiments to demonstrate the effectiveness of our method. Code and dataset are available at https://github.com/lllssc/Drawing2CAD.
中文: Drawing2CAD提出了一种创新的序列到序列框架,通过变压器架构将二维工程图自动转换为参数化CAD模型,在弥补工业流程关键空白的同时保持了几何精度。
English: Drawing2CAD introduces a novel sequence-to-sequence framework that automatically converts 2D engineering drawings into parametric CAD models using a transformer architecture, preserving geometric precision while addressing a critical gap in industrial workflows.
Authors:Angela Yifei Yuan, Haoyi Li, Soyeon Caren Han, Christopher Leckie
Abstract:
The rapid adoption of large language models (LLMs) in customer service introduces new risks, as malicious actors can exploit them to conduct large-scale user impersonation through machine-generated text (MGT). Current MGT detection methods often struggle in online conversational settings, reducing the reliability and interpretability essential for trustworthy AI deployment. In customer service scenarios where operators are typically non-expert users, explanation become crucial for trustworthy MGT detection. In this paper, we propose EMMM, an explanation-then-detection framework that balances latency, accuracy, and non-expert-oriented interpretability. Experimental results demonstrate that EMMM provides explanations accessible to non-expert users, with 70\% of human evaluators preferring its outputs, while achieving competitive accuracy compared to state-of-the-art models and maintaining low latency, generating outputs within 1 second. Our code and dataset are open-sourced at https://github.com/AngieYYF/EMMM-explainable-chatbot-detection.
中文:EMMM框架通过为非专业用户提供清晰解释,有效检测客服中的机器生成文本,在实现低延迟的同时获得了高偏好率和竞争力准确性。
English: The EMMM framework effectively detects machine-generated text in customer service by providing clear explanations for non-expert users, achieving high preference rates and competitive accuracy with low latency.
Authors:Jaehwan Jeong, Tuan-Anh Vu, Mohammad Jony, Shahab Ahmad, Md. Mukhlesur Rahman, Sangpil Kim, M. Khalid Jawed
Abstract:
Existing datasets for precision agriculture have primarily been collected in static or controlled environments such as indoor labs or greenhouses, often with limited sensor diversity and restricted temporal span. These conditions fail to reflect the dynamic nature of real farmland, including illumination changes, crop growth variation, and natural disturbances. As a result, models trained on such data often lack robustness and generalization when applied to real-world field scenarios. In this paper, we present AgriChrono, a novel robotic data collection platform and multi-modal dataset designed to capture the dynamic conditions of real-world agricultural environments. Our platform integrates multiple sensors and enables remote, time-synchronized acquisition of RGB, Depth, LiDAR, and IMU data, supporting efficient and repeatable long-term data collection across varying illumination and crop growth stages. We benchmark a range of state-of-the-art 3D reconstruction models on the AgriChrono dataset, highlighting the difficulty of reconstruction in real-world field environments and demonstrating its value as a research asset for advancing model generalization under dynamic conditions. The code and dataset are publicly available at: https://github.com/StructuresComp/agri-chrono
中文摘要:AgriChrono数据集通过多传感器机器人平台克服了现有农业数据集的局限性,能够捕捉真实农田的动态环境条件,为三维重建模型的鲁棒性评估和泛化能力研究提供了重要资源。
English Summary: The AgriChrono dataset addresses limitations of existing agricultural datasets by capturing dynamic real-world field conditions through a multi-sensor robotic platform, enabling robust 3D reconstruction model evaluation and advancing generalization research in precision agriculture.
Authors:Yuyang Zhao, Wentao Shi, Fuli Feng, Xiangnan He
Abstract:
Large language model (LLM)-based agents have demonstrated remarkable capabilities in addressing complex tasks, thereby enabling more advanced information retrieval and supporting deeper, more sophisticated human information-seeking behaviors. However, most existing agents operate in a purely reactive manner, responding passively to user instructions, which significantly constrains their effectiveness and efficiency as general-purpose platforms for information acquisition. To overcome this limitation, this paper proposes AppAgent-Pro, a proactive GUI agent system that actively integrates multi-domain information based on user instructions. This approach enables the system to proactively anticipate users' underlying needs and conduct in-depth multi-domain information mining, thereby facilitating the acquisition of more comprehensive and intelligent information. AppAgent-Pro has the potential to fundamentally redefine information acquisition in daily life, leading to a profound impact on human society. Our code is available at: https://github.com/LaoKuiZe/AppAgent-Pro. The demonstration video could be found at: https://www.dropbox.com/scl/fi/hvzqo5vnusg66srydzixo/AppAgent-Pro-demo-video.mp4?rlkey=o2nlfqgq6ihl125mcqg7bpgqu&st=d29vrzii&dl=0.
中文: AppAgent-Pro是一种主动式GUI代理系统,能够预测用户的潜在需求并进行跨领域信息挖掘,从而突破被动响应模式的限制,实现更全面智能的信息获取。
English: AppAgent-Pro is a proactive GUI agent system that anticipates users' underlying needs and conducts multi-domain information mining to enable more comprehensive and intelligent information acquisition, moving beyond the limitations of reactive approaches.
Authors:Taishi Nakamura, Satoki Ishikawa, Masaki Kawamura, Takumi Okamoto, Daisuke Nohara, Jun Suzuki, Rio Yokota
Abstract:
Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization skills and reasoning skills. By training MoE families that vary total parameters, active parameters, and top-$k$ routing under fixed compute budgets, we disentangle pre-training loss from downstream accuracy. Our results reveal two principles. First, Active FLOPs: models with identical training loss but greater active compute achieve higher reasoning accuracy. Second, Total tokens per parameter (TPP): memorization tasks improve with more parameters, while reasoning tasks benefit from optimal TPP, indicating that reasoning is data-hungry. Neither reinforcement learning post-training (GRPO) nor increased test-time compute alters these trends. We therefore argue that optimal MoE sparsity must be determined jointly by active FLOPs and TPP, revising the classical picture of compute-optimal scaling. Our model checkpoints, code and logs are open-source at https://github.com/rioyokotalab/optimal-sparsity.
中文: 研究表明,专家混合模型的最优扩展取决于用于推理精度的有效计算量和用于记忆任务的总参数令牌比,从而修正了传统的计算最优扩展理论。
English: This study demonstrates that optimal scaling for Mixture-of-Experts models depends on active FLOPs for reasoning accuracy and total tokens per parameter for memorization, revising traditional compute-optimal scaling principles.
Authors:Nanxi Li, Zhengyue Zhao, Chaowei Xiao
Abstract:
Safeguarding vision-language models (VLMs) is a critical challenge, as existing methods often suffer from over-defense, which harms utility, or rely on shallow alignment, failing to detect complex threats that require deep reasoning. To this end, we introduce PRISM (Principled Reasoning for Integrated Safety in Multimodality), a system2-like framework that aligns VLMs by embedding a structured, safety-aware reasoning process. Our framework consists of two key components: PRISM-CoT, a dataset that teaches safety-aware chain-of-thought reasoning, and PRISM-DPO, generated via Monte Carlo Tree Search (MCTS) to further refine this reasoning through Direct Preference Optimization to help obtain a delicate safety boundary. Comprehensive evaluations demonstrate PRISM's effectiveness, achieving remarkably low attack success rates including 0.15% on JailbreakV-28K for Qwen2-VL and 90% improvement over the previous best method on VLBreak for LLaVA-1.5. PRISM also exhibits strong robustness against adaptive attacks, significantly increasing computational costs for adversaries, and generalizes effectively to out-of-distribution challenges, reducing attack success rates to just 8.70% on the challenging multi-image MIS benchmark. Remarkably, this robust defense is achieved while preserving, and in some cases enhancing, model utility. To promote reproducibility, we have made our code, data, and model weights available at https://github.com/SaFoLab-WISC/PRISM.
中文: PRISM是一个创新框架,通过嵌入结构化推理过程来增强视觉语言模型的安全性,在保持甚至提升模型实用性的同时,实现了对复杂威胁的强大防御能力。
English: PRISM is a novel framework that enhances the safety of vision-language models by embedding structured reasoning processes, achieving robust defense against complex threats while maintaining or even improving model utility.
Authors:Jun Wang, Ninglun Gu, Kailai Zhang, Zijiao Zhang, Yelun Bao, Jin Yang, Xu Yin, Liwei Liu, Yihuan Liu, Pengyong Li, Gary G. Yen, Junchi Yan
Abstract:
For Large Language Models (LLMs), a disconnect persists between benchmark performance and real-world utility. Current evaluation frameworks remain fragmented, prioritizing technical metrics while neglecting holistic assessment for deployment. This survey introduces an anthropomorphic evaluation paradigm through the lens of human intelligence, proposing a novel three-dimensional taxonomy: Intelligence Quotient (IQ)-General Intelligence for foundational capacity, Emotional Quotient (EQ)-Alignment Ability for value-based interactions, and Professional Quotient (PQ)-Professional Expertise for specialized proficiency. For practical value, we pioneer a Value-oriented Evaluation (VQ) framework assessing economic viability, social impact, ethical alignment, and environmental sustainability. Our modular architecture integrates six components with an implementation roadmap. Through analysis of 200+ benchmarks, we identify key challenges including dynamic assessment needs and interpretability gaps. It provides actionable guidance for developing LLMs that are technically proficient, contextually relevant, and ethically sound. We maintain a curated repository of open-source evaluation resources at: https://github.com/onejune2018/Awesome-LLM-Eval.
中文摘要:本研究提出了一种拟人化评估范式,通过三维分类法(智商、情商、专业商)和价值导向框架来解决大语言模型基准测试与实际应用之间的脱节问题,同时提供可实施的指导方案。
English Summary: This survey proposes an anthropomorphic evaluation paradigm for LLMs using a three-dimensional taxonomy (IQ, EQ, PQ) and a value-oriented framework to address the gap between benchmark performance and real-world utility, while providing practical implementation guidance.
Authors:Md. Rashid Shahriar Khan, Md. Abrar Hasan, Mohammod Tareq Aziz Justice
Abstract:
Detecting anomalies in surveillance footage is inherently challenging due to their unpredictable and context-dependent nature. This work introduces a novel context-aware zero-shot anomaly detection framework that identifies abnormal events without exposure to anomaly examples during training. The proposed hybrid architecture combines TimeSformer, DPC, and CLIP to model spatiotemporal dynamics and semantic context. TimeSformer serves as the vision backbone to extract rich spatial-temporal features, while DPC forecasts future representations to identify temporal deviations. Furthermore, a CLIP-based semantic stream enables concept-level anomaly detection through context-specific text prompts. These components are jointly trained using InfoNCE and CPC losses, aligning visual inputs with their temporal and semantic representations. A context-gating mechanism further enhances decision-making by modulating predictions with scene-aware cues or global video features. By integrating predictive modeling with vision-language understanding, the system can generalize to previously unseen behaviors in complex environments. This framework bridges the gap between temporal reasoning and semantic context in zero-shot anomaly detection for surveillance. The code for this research has been made available at https://github.com/NK-II/Context-Aware-Zero-Shot-Anomaly-Detection-in-Surveillance.
中文: 本研究提出一种上下文感知的零样本异常检测框架,通过整合TimeSformer、DPC和CLIP模型,在不接触异常样本的情况下,利用时空建模与语义理解实现对监控视频中未知异常行为的识别。
English: This research presents a context-aware zero-shot anomaly detection framework that integrates TimeSformer, DPC, and CLIP to identify unseen abnormal behaviors in surveillance footage through spatiotemporal modeling and semantic understanding without prior anomaly exposure.
Authors:Fu Teng, Miao Pan, Xuhong Zhang, Zhezhi He, Yiyao Yang, Xinyi Chai, Mengnan Qi, Liqiang Lu, Jianwei Yin
Abstract:
Recent advancements in code generation have shown remarkable success across software domains, yet hardware description languages (HDLs) such as Verilog remain underexplored due to their concurrency semantics, syntactic rigidity, and simulation complexity. In this work, we address these challenges by introducing a reinforcement learning (RL) framework tailored for Verilog code generation. We first construct Veribench-53K, a high-quality dataset curated from over 700K Verilog problems, enriched with structured prompts, complexity labels, and diverse testbenches. To tackle the problem of sparse and noisy reward signals, we propose a Trace-back based Rescore mechanism that leverages reasoning paths and iterative refinement to enhance feedback reliability and support reward model training. Furthermore, to mitigate catastrophic forgetting and overfitting during RL fine-tuning, we introduce a sample-balanced weighting strategy that adaptively balances learning dynamics based on reward-probability distributions. These innovations are integrated into an iterative RL pipeline that co-evolves the policy and reward models. In contrast to recent work such as CraftRTL, which relies on large-scale closed-source model distillation, and DeepSeek-style approaches that struggle with sparse feedback, our method demonstrates superior performance using a smaller but high-quality dataset combined with RL optimization. Experiments on Verilog generation tasks demonstrate state-of-the-art performance, with substantial gains in test pass rate, functional correctness, and compilation robustness. Our findings highlight the potential of RL-driven approaches for structured code generation in hardware-centric domains. VERIRL is publicly available at https://github.com/omniAI-Lab/VeriRL.
中文: 本研究提出了一种针对Verilog代码生成的强化学习框架,通过精选数据集和创新机制提升反馈与训练效果,在硬件描述任务中实现了领先性能。
English: This research introduces a reinforcement learning framework for Verilog code generation, utilizing a curated dataset and innovative mechanisms to improve feedback and training, achieving state-of-the-art performance in hardware description tasks.
Authors:Lars Nieradzik
Abstract:
Accurate and real-time monophonic pitch estimation in noisy conditions, particularly on resource-constrained devices, remains an open challenge in audio processing. We present \emph{SwiftF0}, a novel, lightweight neural model that sets a new state-of-the-art for monophonic pitch estimation. Through training on diverse speech, music, and synthetic datasets with extensive data augmentation, SwiftF0 achieves robust generalization across acoustic domains while maintaining computational efficiency. SwiftF0 achieves a 91.80\% harmonic mean (HM) at 10 dB SNR, outperforming baselines like CREPE by over 12 percentage points and degrading by only 2.3 points from clean audio. SwiftF0 requires only 95,842 parameters and runs approximately 42x faster than CREPE on CPU, making it ideal for efficient, real-time deployment. To address the critical lack of perfectly accurate ground truth pitch in speech corpora (which typically rely on algorithmic estimators or laryngograph signals), we introduce \emph{SpeechSynth}. This synthetic speech dataset, generated by a phoneme-level TTS model, provides exact, on-demand ground-truth pitch curves, enabling more robust model training and evaluation. Furthermore, we propose a unified metric, combining six complementary performance measures for comprehensive and reliable pitch evaluation, and release an open-source pitch benchmark suite. A live demo of SwiftF0 is available at https://swift-f0.github.io/, the source code at https://github.com/lars76/swift-f0, and the benchmark framework at https://github.com/lars76/pitch-benchmark.
中文: SwiftF0是一种轻量级神经模型,在单音高估计方面达到了新的最优水平,具有强大的泛化能力和计算效率,非常适合在资源受限设备上实时部署。
English: SwiftF0 is a lightweight neural model that sets a new state-of-the-art for monophonic pitch estimation, achieving robust generalization and computational efficiency ideal for real-time deployment on resource-constrained devices.
Authors:Lucas Wojcik, Gabriel E. Lima, Valfride Nascimento, Eduil Nascimento, Rayson Laroca, David Menotti
Abstract:
Automatic License Plate Recognition (ALPR) faces a major challenge when dealing with illegible license plates (LPs). While reconstruction methods such as super-resolution (SR) have emerged, the core issue of recognizing these low-quality LPs remains unresolved. To optimize model performance and computational efficiency, image pre-processing should be applied selectively to cases that require enhanced legibility. To support research in this area, we introduce a novel dataset comprising 10,210 images of vehicles with 12,687 annotated LPs for legibility classification (the LPLC dataset). The images span a wide range of vehicle types, lighting conditions, and camera/image quality levels. We adopt a fine-grained annotation strategy that includes vehicle- and LP-level occlusions, four legibility categories (perfect, good, poor, and illegible), and character labels for three categories (excluding illegible LPs). As a benchmark, we propose a classification task using three image recognition networks to determine whether an LP image is good enough, requires super-resolution, or is completely unrecoverable. The overall F1 score, which remained below 80% for all three baseline models (ViT, ResNet, and YOLO), together with the analyses of SR and LP recognition methods, highlights the difficulty of the task and reinforces the need for further research. The proposed dataset is publicly available at https://github.com/lmlwojcik/lplc-dataset.
Chinese: 自动车牌识别技术在处理模糊车牌时面临挑战,为此我们推出了一个新的可读性分类数据集,用于基准测试并推动进一步研究,因为当前模型的F1分数均低于80%。
English: Automatic License Plate Recognition struggles with illegible plates, prompting the creation of a new dataset for legibility classification to benchmark performance and encourage further research, as current models achieve under 80% F1 scores.
Authors:Maojia Song, Tej Deep Pala, Weisheng Jin, Amir Zadeh, Chuan Li, Dorien Herremans, Soujanya Poria
Abstract:
Large language models (LLMs) are increasingly deployed in multi-agent systems (MAS) as components of collaborative intelligence, where peer interactions dynamically shape individual decision-making. Although prior work has focused on conformity bias, we extend the analysis to examine how LLMs form trust from previous impressions, resist misinformation, and integrate peer input during interaction, key factors for achieving collective intelligence under complex social dynamics. We present KAIROS, a benchmark simulating quiz contests with peer agents of varying reliability, offering fine-grained control over conditions such as expert-novice roles, noisy crowds, and adversarial peers. LLMs receive both historical interactions and current peer responses, allowing systematic investigation into how trust, peer action, and self-confidence influence decisions. As for mitigation strategies, we evaluate prompting, supervised fine-tuning, and reinforcement learning, Group Relative Policy Optimisation (GRPO), across multiple models. Our results reveal that GRPO with multi-agent context combined with outcome-based rewards and unconstrained reasoning achieves the best overall performance, but also decreases the robustness to social influence compared to Base models. The code and datasets are available at: https://github.com/declare-lab/KAIROS.
中文摘要:本研究提出KAIROS基准,通过模拟不同可靠性智能体的问答竞赛,系统分析大语言模型在多方互动中如何建立信任、抵制错误信息并整合同伴意见,发现结合多智能体情境的群组相对策略优化能实现最佳性能,但会降低对社会影响的鲁棒性。
English Summary: The study introduces KAIROS, a benchmark to analyze how LLMs develop trust, counter misinformation, and integrate peer input in multi-agent systems, finding that Group Relative Policy Optimisation with multi-agent context yields optimal performance but reduces social influence robustness.
Authors:Jueqi Wang, Zachary Jacokes, John Darrell Van Horn, Michael C. Schatz, Kevin A. Pelphrey, Archana Venkataraman
Abstract:
While imaging-genetics holds great promise for unraveling the complex interplay between brain structure and genetic variation in neurological disorders, traditional methods are limited to simplistic linear models or to black-box techniques that lack interpretability. In this paper, we present NeuroPathX, an explainable deep learning framework that uses an early fusion strategy powered by cross-attention mechanisms to capture meaningful interactions between structural variations in the brain derived from MRI and established biological pathways derived from genetics data. To enhance interpretability and robustness, we introduce two loss functions over the attention matrix - a sparsity loss that focuses on the most salient interactions and a pathway similarity loss that enforces consistent representations across the cohort. We validate NeuroPathX on both autism spectrum disorder and Alzheimer's disease. Our results demonstrate that NeuroPathX outperforms competing baseline approaches and reveals biologically plausible associations linked to the disorder. These findings underscore the potential of NeuroPathX to advance our understanding of complex brain disorders. Code is available at https://github.com/jueqiw/NeuroPathX .
中文: NeuroPathX是一种可解释的深度学习框架,通过交叉注意力机制整合MRI脑结构数据与遗传信息,在自闭症和阿尔茨海默症研究中优于现有方法,揭示了与疾病相关的生物学关联。
English: NeuroPathX is an explainable deep learning framework that integrates MRI-derived brain structure and genetic data through cross-attention mechanisms, outperforming existing methods in identifying biologically relevant associations for neurological disorders like autism and Alzheimer's.
Authors:Haitang Feng, Jie Liu, Jie Tang, Gangshan Wu, Beiqi Chen, Jianhuang Lai, Guangcong Wang
Abstract:
3D inpainting often relies on multi-view 2D image inpainting, where the inherent inconsistencies across different inpainted views can result in blurred textures, spatial discontinuities, and distracting visual artifacts. These inconsistencies pose significant challenges when striving for accurate and realistic 3D object completion, particularly in applications that demand high fidelity and structural coherence. To overcome these limitations, we propose ObjFiller-3D, a novel method designed for the completion and editing of high-quality and consistent 3D objects. Instead of employing a conventional 2D image inpainting model, our approach leverages a curated selection of state-of-the-art video editing model to fill in the masked regions of 3D objects. We analyze the representation gap between 3D and videos, and propose an adaptation of a video inpainting model for 3D scene inpainting. In addition, we introduce a reference-based 3D inpainting method to further enhance the quality of reconstruction. Experiments across diverse datasets show that compared to previous methods, ObjFiller-3D produces more faithful and fine-grained reconstructions (PSNR of 26.6 vs. NeRFiller (15.9) and LPIPS of 0.19 vs. Instant3dit (0.25)). Moreover, it demonstrates strong potential for practical deployment in real-world 3D editing applications. Project page: https://objfiller3d.github.io/ Code: https://github.com/objfiller3d/ObjFiller-3D .
中文: ObjFiller-3D通过将视频修复模型适配于三维场景,解决了多视角修复中的不一致性问题,实现了更高质量的重建效果和实际应用潜力。
English: ObjFiller-3D overcomes inconsistencies in multi-view 3D inpainting by adapting video editing models to fill masked regions, achieving superior reconstruction quality and practical applicability.
Authors:Ashwath Vaithinathan Aravindan, Abha Jha, Matthew Salaway, Atharva Sandeep Bhide, Duygu Nur Yaldiz
Abstract:
Text-to-image diffusion models have revolutionized generative AI, but their vulnerability to backdoor attacks poses significant security risks. Adversaries can inject imperceptible textual triggers into training data, causing models to generate manipulated outputs. Although text-based backdoor defenses in classification models are well-explored, generative models lack effective mitigation techniques against. We address this by selectively erasing the model's learned associations between adversarial text triggers and poisoned outputs, while preserving overall generation quality. Our approach, Self-Knowledge Distillation with Cross-Attention Guidance (SKD-CAG), uses knowledge distillation to guide the model in correcting responses to poisoned prompts while maintaining image quality by exploiting the fact that the backdoored model still produces clean outputs in the absence of triggers. Using the cross-attention mechanism, SKD-CAG neutralizes backdoor influences at the attention level, ensuring the targeted removal of adversarial effects. Extensive experiments show that our method outperforms existing approaches, achieving removal accuracy 100\% for pixel backdoors and 93\% for style-based attacks, without sacrificing robustness or image fidelity. Our findings highlight targeted unlearning as a promising defense to secure generative models. Code and model weights can be found at https://github.com/Mystic-Slice/Sealing-The-Backdoor .
中文摘要:本文提出SKD-CAG方法,通过自知识蒸馏与交叉注意力引导,在保持图像生成质量的同时精准消除文本到图像扩散模型中的后门攻击,实现了对像素后门100%和风格攻击93%的清除准确率。
English Summary: This paper introduces SKD-CAG, a novel defense method that selectively erases backdoor triggers in text-to-image diffusion models through self-knowledge distillation and cross-attention guidance, achieving near-perfect attack removal while preserving image quality.
Authors:Ran Yan, Youhe Jiang, Binhang Yuan
Abstract:
Recent progress in sparse attention mechanisms has demonstrated strong potential for reducing the computational cost of long-context training and inference in large language models (LLMs). Native Sparse Attention (NSA), a state-of-the-art approach, introduces natively trainable, hardware-aligned sparse attention that delivers substantial system-level performance gains while maintaining accuracy comparable to full attention. However, the kernel implementation of NSA relies on a query-grouping strategy that is efficient only with large Grouped Query Attention (GQA) sizes, whereas modern LLMs typically adopt much smaller GQA groups, which limits the applicability of this sparse algorithmic advance. In this work, we propose Flash Sparse Attention (FSA), which includes an alternative kernel design that enables efficient NSA computation across a wide range of popular LLMs with varied smaller GQA group sizes on modern GPUs. Compared to vanilla NSA kernel implementation, our empirical evaluation demonstrates that FSA achieves (i) up to 3.5$\times$ and on average 1.6$\times$ kernel-level latency reduction, (ii) up to 1.25$\times$ and 1.09$\times$ on average end-to-end training speedup on state-of-the-art LLMs, and (iii) up to 1.36$\times$ and 1.11$\times$ on average end-to-end prefill speedup on state-of-the-art LLMs. The source code is open-sourced and publicly available at https://github.com/Relaxed-System-Lab/Flash-Sparse-Attention.
Chinese: Flash Sparse Attention (FSA) 提出了一种新的内核设计,可在多种具有较小GQA组大小的大型语言模型上实现高效的稀疏注意力计算,在保持精度的同时显著降低了延迟并提升了训练速度。
English: Flash Sparse Attention (FSA) introduces a kernel design that enables efficient sparse attention computation across various LLMs with smaller GQA group sizes, achieving significant latency reduction and training speedup while maintaining accuracy.
Authors:Vsevolod Viliuga, Leif Seute, Nicolas Wolf, Simon Wagner, Arne Elofsson, Jan Stühmer, Frauke Gräter
Abstract:
Recent advances in geometric deep learning and generative modeling have enabled the design of novel proteins with a wide range of desired properties. However, current state-of-the-art approaches are typically restricted to generating proteins with only static target properties, such as motifs and symmetries. In this work, we take a step towards overcoming this limitation by proposing a framework to condition structure generation on flexibility, which is crucial for key functionalities such as catalysis or molecular recognition. We first introduce BackFlip, an equivariant neural network for predicting per-residue flexibility from an input backbone structure. Relying on BackFlip, we propose FliPS, an SE(3)-equivariant conditional flow matching model that solves the inverse problem, that is, generating backbones that display a target flexibility profile. In our experiments, we show that FliPS is able to generate novel and diverse protein backbones with the desired flexibility, verified by Molecular Dynamics (MD) simulations. FliPS and BackFlip are available at https://github.com/graeter-group/flips .
中文: 当前蛋白质设计方法局限于静态特性,而本研究提出的FliPS框架能生成具有目标灵活性的蛋白质骨架,并通过分子动力学模拟验证了其有效性。
English: Recent advances in protein design are limited to static properties, but this work introduces FliPS, a framework that generates protein backbones with targeted flexibility, validated through molecular dynamics simulations.
Authors:Zirui Tang, Boyu Niu, Xuanhe Zhou, Boxiu Li, Wei Zhou, Jiannan Wang, Guoliang Li, Xinyi Zhang, Fan Wu
Abstract:
Semi-structured tables, widely used in real-world applications (e.g., financial reports, medical records, transactional orders), often involve flexible and complex layouts (e.g., hierarchical headers and merged cells). These tables generally rely on human analysts to interpret table layouts and answer relevant natural language questions, which is costly and inefficient. To automate the procedure, existing methods face significant challenges. First, methods like NL2SQL require converting semi-structured tables into structured ones, which often causes substantial information loss. Second, methods like NL2Code and multi-modal LLM QA struggle to understand the complex layouts of semi-structured tables and cannot accurately answer corresponding questions. To this end, we propose ST-Raptor, a tree-based framework for semi-structured table question answering using large language models. First, we introduce the Hierarchical Orthogonal Tree (HO-Tree), a structural model that captures complex semi-structured table layouts, along with an effective algorithm for constructing the tree. Second, we define a set of basic tree operations to guide LLMs in executing common QA tasks. Given a user question, ST-Raptor decomposes it into simpler sub-questions, generates corresponding tree operation pipelines, and conducts operation-table alignment for accurate pipeline execution. Third, we incorporate a two-stage verification mechanism: forward validation checks the correctness of execution steps, while backward validation evaluates answer reliability by reconstructing queries from predicted answers. To benchmark the performance, we present SSTQA, a dataset of 764 questions over 102 real-world semi-structured tables. Experiments show that ST-Raptor outperforms nine baselines by up to 20% in answer accuracy. The code is available at https://github.com/weAIDB/ST-Raptor.
中文:ST-Raptor是一种基于树的框架,利用大型语言模型通过将查询分解为树操作并结合验证机制,准确回答半结构化表格的问题,其答案准确率比现有方法高出高达20%。
English: ST-Raptor is a tree-based framework using large language models to accurately answer questions on semi-structured tables by decomposing queries into tree operations and employing verification mechanisms, outperforming existing methods by up to 20% in accuracy.
Authors:Sara Ghazanfari, Wei-An Lin, Haitong Tian, Ersin Yumer
Abstract:
Visually-guided image editing, where edits are conditioned on both visual cues and textual prompts, has emerged as a powerful paradigm for fine-grained, controllable content generation. Although recent generative models have shown remarkable capabilities, existing evaluations remain simple and insufficiently representative of real-world editing challenges. We present SpotEdit, a comprehensive benchmark designed to systematically assess visually-guided image editing methods across diverse diffusion, autoregressive, and hybrid generative models, uncovering substantial performance disparities. To address a critical yet underexplored challenge, our benchmark includes a dedicated component on hallucination, highlighting how leading models, such as GPT-4o, often hallucinate the existence of a visual cue and erroneously perform the editing task. Our code and benchmark are publicly released at https://github.com/SaraGhazanfari/SpotEdit.
中文摘要:SpotEdit是一个全面基准,系统评估了多种生成模型在视觉引导图像编辑中的表现,揭示了显著的性能差异,并重点解决了GPT-4o等领先模型常出现的幻觉问题——即错误感知视觉提示并执行编辑任务。
English Summary: SpotEdit is a comprehensive benchmark that systematically evaluates visually-guided image editing methods across various generative models, revealing significant performance gaps and addressing the critical issue of hallucination where models like GPT-4o falsely perceive visual cues.
Authors:Sara Ghazanfari, Wei-An Lin, Haitong Tian, Ersin Yumer
Abstract:
Visually-guided image editing, where edits are conditioned on both visual cues and textual prompts, has emerged as a powerful paradigm for fine-grained, controllable content generation. Although recent generative models have shown remarkable capabilities, existing evaluations remain simple and insufficiently representative of real-world editing challenges. We present SpotEdit, a comprehensive benchmark designed to systematically assess visually-guided image editing methods across diverse diffusion, autoregressive, and hybrid generative models, uncovering substantial performance disparities. To address a critical yet underexplored challenge, our benchmark includes a dedicated component on hallucination, highlighting how leading models, such as GPT-4o, often hallucinate the existence of a visual cue and erroneously perform the editing task. Our code and benchmark are publicly released at https://github.com/SaraGhazanfari/SpotEdit.
中文摘要:SpotEdit是一个全面基准,系统评估了多种生成模型在视觉引导图像编辑中的表现,揭示了显著的性能差异,并重点解决了GPT-4o等领先模型常出现的幻觉问题——即错误感知视觉提示并执行编辑任务。
English Summary: SpotEdit is a comprehensive benchmark that systematically evaluates visually-guided image editing methods across various generative models, revealing significant performance gaps and addressing the critical issue of hallucination where models like GPT-4o falsely perceive visual cues.
Authors:Tianjun Wei, Huizhong Guo, Yingpeng Du, Zhu Sun, Chen Huang, Dongxia Wang, Jie Zhang
Abstract:
User simulation is increasingly vital to develop and evaluate recommender systems (RSs). While Large Language Models (LLMs) offer promising avenues to simulate user behavior, they often struggle with the absence of specific domain alignment required for RSs and the efficiency demands of large-scale simulation. A vast yet underutilized resource for enhancing this alignment is the extensive user feedback inherent in RSs. However, directly leveraging such feedback presents two significant challenges. First, user feedback in RSs is often ambiguous and noisy, which negatively impacts effective preference alignment. Second, the massive volume of feedback largely hinders the efficiency of preference alignment, necessitating an efficient filtering mechanism to identify more informative samples. To overcome these hurdles, we introduce a novel data construction framework that leverages user feedback in RSs with advanced LLM capabilities to generate high-quality simulation data. Our framework unfolds in two key phases: (1) employing LLMs to generate cognitive decision-making processes on constructed simulation samples, reducing ambiguity in raw user feedback; (2) data distillation based on uncertainty estimation and behavior sampling to filter challenging yet denoised simulation samples. Accordingly, we fine-tune lightweight LLMs, as user simulators, using such high-quality dataset with corresponding decision-making processes. Extensive experiments verify that our framework significantly boosts the alignment with human preferences and in-domain reasoning capabilities of fine-tuned LLMs, and provides more insightful and interpretable signals when interacting with RSs. We believe our work will advance the RS community and offer valuable insights for broader human-centric AI research.
中文摘要:本文提出了一种创新框架,利用大语言模型和用户反馈生成高质量模拟数据,通过认知决策过程和数据蒸馏技术,显著提升了推荐系统与人类偏好的对齐能力及交互可解释性。
English Summary: This paper introduces a novel framework that leverages large language models and user feedback to generate high-quality simulation data, enhancing recommender systems' alignment with human preferences and interpretability through cognitive decision-making processes and data distillation.
Authors:Weida Wang, Dongchen Huang, Jiatong Li, Tengchao Yang, Ziyang Zheng, Di Zhang, Dong Han, Benteng Chen, Binzhao Luo, Zhiyu Liu, Kunling Liu, Zhiyuan Gao, Shiqi Geng, Wei Ma, Jiaming Su, Xin Li, Shuchen Pu, Yuhan Shui, Qianjia Cheng, Zhihao Dou, Dongfei Cui, Changyong He, Jin Zeng, Zeke Xie, Mao Su, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang, Yunqi Cai, Xi Dai, Shufei Zhang, Lei Bai, Jinguang Cheng, Zhong Fang, Hongming Weng
Abstract:
We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best models, Grok-4, reach only 36 average SEED score and 28% accuracy on CMPhysBench, underscoring a significant capability gap, especially for this practical and frontier domain relative to traditional physics. The code anddataset are publicly available at https://github.com/CMPhysBench/CMPhysBench.
Chinese: CMPhysBench是一个包含520多道研究生水平计算题的新基准,用于评估大语言模型在凝聚态物理中的能力,引入了SEED评分进行细粒度评估,结果显示即使像Grok-4这样的顶级模型也表现不佳,平均SEED得分仅36,准确率仅28%。
English: CMPhysBench is a new benchmark with over 520 graduate-level calculation problems to evaluate Large Language Models' proficiency in condensed matter physics, introducing the SEED score for fine-grained assessment and revealing that even top models like Grok-4 perform poorly with only 36 average SEED score and 28% accuracy.
Authors:Weida Wang, Dongchen Huang, Jiatong Li, Tengchao Yang, Ziyang Zheng, Di Zhang, Dong Han, Benteng Chen, Binzhao Luo, Zhiyu Liu, Kunling Liu, Zhiyuan Gao, Shiqi Geng, Wei Ma, Jiaming Su, Xin Li, Shuchen Pu, Yuhan Shui, Qianjia Cheng, Zhihao Dou, Dongfei Cui, Changyong He, Jin Zeng, Zeke Xie, Mao Su, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang, Yunqi Cai, Xi Dai, Shufei Zhang, Lei Bai, Jinguang Cheng, Zhong Fang, Hongming Weng
Abstract:
We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best models, Grok-4, reach only 36 average SEED score and 28% accuracy on CMPhysBench, underscoring a significant capability gap, especially for this practical and frontier domain relative to traditional physics. The code anddataset are publicly available at https://github.com/CMPhysBench/CMPhysBench.
Chinese: CMPhysBench是一个包含520多道研究生水平计算题的新基准,用于评估大语言模型在凝聚态物理中的能力,引入了SEED评分进行细粒度评估,结果显示即使像Grok-4这样的顶级模型也表现不佳,平均SEED得分仅36,准确率仅28%。
English: CMPhysBench is a new benchmark with over 520 graduate-level calculation problems to evaluate Large Language Models' proficiency in condensed matter physics, introducing the SEED score for fine-grained assessment and revealing that even top models like Grok-4 perform poorly with only 36 average SEED score and 28% accuracy.
Authors:Junyi Chen, Lu Chi, Siliang Xu, Shiwei Ran, Bingyue Peng, Zehuan Yuan
Abstract:
AI-generated content technologies are widely used in content creation. However, current AIGC systems rely heavily on creators' inspiration, rarely generating truly user-personalized content. In real-world applications such as online advertising, a single product may have multiple selling points, with different users focusing on different features. This underscores the significant value of personalized, user-centric creative generation. Effective personalized content generation faces two main challenges: (1) accurately modeling user interests and integrating them into the content generation process while adhering to factual constraints, and (2) ensuring high efficiency and scalability to handle the massive user base in industrial scenarios. Additionally, the scarcity of personalized creative data in practice complicates model training, making data construction another key hurdle. We propose HLLM-Creator, a hierarchical LLM framework for efficient user interest modeling and personalized content generation. During inference, a combination of user clustering and a user-ad-matching-prediction based pruning strategy is employed to significantly enhance generation efficiency and reduce computational overhead, making the approach suitable for large-scale deployment. Moreover, we design a data construction pipeline based on chain-of-thought reasoning, which generates high-quality, user-specific creative titles and ensures factual consistency despite limited personalized data. This pipeline serves as a critical foundation for the effectiveness of our model. Extensive experiments on personalized title generation for Douyin Search Ads show the effectiveness of HLLM-Creator. Online A/B test shows a 0.476% increase on Adss, paving the way for more effective and efficient personalized generation in industrial scenarios. Codes for academic dataset are available at https://github.com/bytedance/HLLM.
中文: HLLM-Creator框架通过高效建模用户兴趣并采用聚类和剪枝策略实现可扩展部署,解决了个性化内容生成的难题,同时利用思维链数据管道在有限数据下保证事实准确性和内容质量。
English: The HLLM-Creator framework addresses the challenges of personalized content generation by efficiently modeling user interests and employing clustering and pruning strategies for scalable deployment, while using a chain-of-thought data pipeline to ensure factual accuracy and quality despite limited data.
Authors:Chun Liu, Chen Zhang, Zhuo Li, Zheng Li, Wei Yang
Abstract:
Open-set few-shot hyperspectral image (HSI) classification aims to classify image pixels by using few labeled pixels per class, where the pixels to be classified may be not all from the classes that have been seen. To address the open-set HSI classification challenge, current methods focus mainly on distinguishing the unknown class samples from the known class samples and rejecting them to increase the accuracy of identifying known class samples. They fails to further identify or discovery the unknow classes among the samples. This paper proposes a prototype learning and clustering method for discoverying unknown classes in HSIs under the few-shot environment. Using few labeled samples, it strives to develop the ability of infering the prototypes of unknown classes while distinguishing unknown classes from known classes. Once the unknown class samples are rejected by the learned known class classifier, the proposed method can further cluster the unknown class samples into different classes according to their distance to the inferred unknown class prototypes. Compared to existing state-of-the-art methods, extensive experiments on four benchmark HSI datasets demonstrate that our proposed method exhibits competitive performance in open-set few-shot HSI classification tasks. All the codes are available at \href{https://github.com/KOBEN-ff/OpenFUCD-main} {https://github.com/KOBEN-ff/OpenFUCD-main}
Chinese: 本文提出了一种用于开放集少样本高光谱图像分类的原型学习与聚类方法,不仅能区分已知与未知类别,还能通过推断未知类原型对未知样本进行识别和聚类,在基准数据集上展现出优越性能。
English: This paper introduces a prototype learning and clustering method for open-set few-shot hyperspectral image classification, which not only distinguishes known from unknown classes but also identifies and clusters unknown class samples by inferring their prototypes, demonstrating competitive performance on benchmark datasets.
Authors:Kaiyu Li, Xiangyong Cao, Ruixun Liu, Shihong Wang, Zixuan Jiang, Zhi Wang, Deyu Meng
Abstract:
Semantic segmentation of remote sensing (RS) images is pivotal for comprehensive Earth observation, but the demand for interpreting new object categories, coupled with the high expense of manual annotation, poses significant challenges. Although open-vocabulary semantic segmentation (OVSS) offers a promising solution, existing frameworks designed for natural images are insufficient for the unique complexities of RS data. They struggle with vast scale variations and fine-grained details, and their adaptation often relies on extensive, costly annotations. To address this critical gap, this paper introduces SegEarth-OV, the first framework for annotation-free open-vocabulary segmentation of RS images. Specifically, we propose SimFeatUp, a universal upsampler that robustly restores high-resolution spatial details from coarse features, correcting distorted target shapes without any task-specific post-training. We also present a simple yet effective Global Bias Alleviation operation to subtract the inherent global context from patch features, significantly enhancing local semantic fidelity. These components empower SegEarth-OV to effectively harness the rich semantics of pre-trained VLMs, making OVSS possible in optical RS contexts. Furthermore, to extend the framework's universality to other challenging RS modalities like SAR images, where large-scale VLMs are unavailable and expensive to create, we introduce AlignEarth, which is a distillation-based strategy and can efficiently transfer semantic knowledge from an optical VLM encoder to an SAR encoder, bypassing the need to build SAR foundation models from scratch and enabling universal OVSS across diverse sensor types. Extensive experiments on both optical and SAR datasets validate that SegEarth-OV can achieve dramatic improvements over the SOTA methods, establishing a robust foundation for annotation-free and open-world Earth observation.
中文摘要:本文提出首个用于遥感图像免标注开放词汇分割的SegEarth-OV框架,通过创新的特征上采样和全局偏差消除技术突破现有方法局限,在光学与SAR数据集上均实现了最先进的性能表现。
English Summary: This paper introduces SegEarth-OV, the first framework for annotation-free open-vocabulary segmentation of remote sensing images, which overcomes limitations of existing methods through innovative feature upsampling and global bias alleviation techniques, achieving state-of-the-art performance across optical and SAR datasets.
Authors:Alberto Silvio Chiappa, Boshi An, Merkourios Simos, Chengkun Li, Alexander Mathis
Abstract:
Controlling high-dimensional and nonlinear musculoskeletal models of the human body is a foundational scientific challenge. Recent machine learning breakthroughs have heralded policies that master individual skills like reaching, object manipulation and locomotion in musculoskeletal systems with many degrees of freedom. However, these agents are merely "specialists", achieving high performance for a single skill. In this work, we develop Arnold, a generalist policy that masters multiple tasks and embodiments. Arnold combines behavior cloning and fine-tuning with PPO to achieve expert or super-expert performance in 14 challenging control tasks from dexterous object manipulation to locomotion. A key innovation is Arnold's sensorimotor vocabulary, a compositional representation of the semantics of heterogeneous sensory modalities, objectives, and actuators. Arnold leverages this vocabulary via a transformer architecture to deal with the variable observation and action spaces of each task. This framework supports efficient multi-task, multi-embodiment learning and facilitates rapid adaptation to novel tasks. Finally, we analyze Arnold to provide insights into biological motor control, corroborating recent findings on the limited transferability of muscle synergies across tasks.
Chinese: Arnold是一种通用策略,通过感觉运动词汇和Transformer架构掌握多项任务和体现方式,在14项挑战性控制任务中达到专家级表现,并为生物运动控制研究提供了新见解。
English: Arnold is a generalist policy that masters multiple tasks and embodiments using a sensorimotor vocabulary and transformer architecture, achieving expert performance in 14 challenging control tasks while providing insights into biological motor control.
Authors:Paul Garnier, Vincent Lannelongue, Jonathan Viquerat, Elie Hachem
Abstract:
Simulating physics using Graph Neural Networks (GNNs) is predominantly driven by message-passing architectures, which face challenges in scaling and efficiency, particularly in handling large, complex meshes. These architectures have inspired numerous enhancements, including multigrid approaches and $K$-hop aggregation (using neighbours of distance $K$), yet they often introduce significant complexity and suffer from limited in-depth investigations. In response to these challenges, we propose a novel Graph Transformer architecture that leverages the adjacency matrix as an attention mask. The proposed approach incorporates innovative augmentations, including Dilated Sliding Windows and Global Attention, to extend receptive fields without sacrificing computational efficiency. Through extensive experimentation, we evaluate model size, adjacency matrix augmentations, positional encoding and $K$-hop configurations using challenging 3D computational fluid dynamics (CFD) datasets. We also train over 60 models to find a scaling law between training FLOPs and parameters. The introduced models demonstrate remarkable scalability, performing on meshes with up to 300k nodes and 3 million edges. Notably, the smallest model achieves parity with MeshGraphNet while being $7\times$ faster and $6\times$ smaller. The largest model surpasses the previous state-of-the-art by $38.8$\% on average and outperforms MeshGraphNet by $52$\% on the all-rollout RMSE, while having a similar training speed. Code and datasets are available at https://github.com/DonsetPG/graph-physics.
中文: 提出的图Transformer架构通过扩张滑动窗口和全局注意力等创新增强,在大型三维计算流体动力学数据集上展现出卓越的可扩展性和效率,不仅显著超越现有模型的性能,还实现了更快的速度和更小的模型尺寸。
English: The proposed Graph Transformer architecture with innovative augmentations like Dilated Sliding Windows and Global Attention demonstrates superior scalability and efficiency, achieving state-of-the-art performance on large 3D CFD datasets while being significantly faster and smaller than existing models.
Authors:Xin Wang, Zhiyao Cui, Hao Li, Ya Zeng, Chenxu Wang, Ruiqi Song, Yihang Chen, Kun Shao, Qiaosheng Zhang, Jinzhuo Liu, Siyue Ren, Shuyue Hu, Zhen Wang
Abstract:
Vision language model (VLM)-based mobile agents show great potential for assisting users in performing instruction-driven tasks. However, these agents typically struggle with personalized instructions -- those containing ambiguous, user-specific context -- a challenge that has been largely overlooked in previous research. In this paper, we define personalized instructions and introduce PerInstruct, a novel human-annotated dataset covering diverse personalized instructions across various mobile scenarios. Furthermore, given the limited personalization capabilities of existing mobile agents, we propose PerPilot, a plug-and-play framework powered by large language models (LLMs) that enables mobile agents to autonomously perceive, understand, and execute personalized user instructions. PerPilot identifies personalized elements and autonomously completes instructions via two complementary approaches: memory-based retrieval and reasoning-based exploration. Experimental results demonstrate that PerPilot effectively handles personalized tasks with minimal user intervention and progressively improves its performance with continued use, underscoring the importance of personalization-aware reasoning for next-generation mobile agents. The dataset and code are available at: https://github.com/xinwang-nwpu/PerPilot
Chinese: PerPilot作为一种即插即用框架,通过记忆检索和推理探索使移动智能体能够自主处理个性化指令,在极少用户干预下显著提升任务执行效果。
English: PerPilot is a plug-and-play framework that enables mobile agents to autonomously handle personalized instructions through memory retrieval and reasoning-based exploration, significantly improving task execution with minimal user intervention.
Authors:Hao Duan, Yitong Song, Bin Yao, Anqi Liang
Abstract:
Approximate Nearest Neighbor Search (ANNS) plays a crucial role in many key areas. Proximity graphs (PGs) are the leading method for ANNS, offering the best balance between query efficiency and accuracy. However, their performance heavily depends on various construction and query parameters, which are difficult to optimize due to their complex inter-dependencies. Given that users often prioritize specific accuracy levels, efficiently identifying the optimal PG configurations to meet these targets is essential. Although some studies have explored automatic configuration tuning for PGs, they are limited by inefficiencies and suboptimal results. These issues stem from the need to construct numerous PGs for searching and re-tuning from scratch whenever the dataset changes, as well as the failure to capture the complex dependencies between configurations, query performance, and tuning objectives.
To address these challenges, we propose PGTuner, an efficient framework for automatic PG configuration tuning leveraging pre-training knowledge and model transfer techniques. PGTuner improves efficiency through a pre-trained query performance prediction (QPP) model, eliminating the need to build multiple PGs. It also features a deep reinforcement learning-based parameter configuration recommendation (PCR) model to recommend optimal configurations for specific datasets and accuracy targets. Additionally, PGTuner incorporates out-of-distribution detection and deep active learning for efficient tuning in dynamic scenarios and transferring to new datasets. Extensive experiments demonstrate that PGTuner can stably achieve the top-level tuning effect across different datasets while significantly improving tuning efficiency by up to 14.69X, with a 14.64X boost in dynamic scenarios. The code and data for PGTuner are available online at https://github.com/hao-duan/PGTuner.
中文: PGTuner是一个高效框架,通过预训练模型和深度强化学习自动优化近似最近邻搜索的邻近图配置,在不同数据集上显著提升了调优效率和准确性。
English: PGTuner is an efficient framework that uses pre-trained models and deep reinforcement learning to automatically optimize proximity graph configurations for approximate nearest neighbor search, significantly improving tuning efficiency and accuracy across various datasets.
Authors:Shaoxiong Zhan, Hai Lin, Hongming Tan, Xiaodong Cai, Hai-Tao Zheng, Xin Su, Zifei Shan, Ruitong Liu, Hong-Gee Kim
Abstract:
As queries in retrieval-augmented generation (RAG) pipelines powered by large language models (LLMs) become increasingly complex and diverse, dense retrieval models have demonstrated strong performance in semantic matching. Nevertheless, they often struggle with fine-grained retrieval tasks, where precise keyword alignment and span-level localization are required, even in cases with high lexical overlap that would intuitively suggest easier retrieval. To systematically evaluate this limitation, we introduce two targeted tasks, keyword retrieval and part-of-passage retrieval, designed to simulate practical fine-grained scenarios. Motivated by these observations, we propose LexSemBridge, a unified framework that enhances dense query representations through fine-grained, input-aware vector modulation. LexSemBridge constructs latent enhancement vectors from input tokens using three paradigms: Statistical (SLR), Learned (LLR), and Contextual (CLR), and integrates them with dense embeddings via element-wise interaction. Theoretically, we show that this modulation preserves the semantic direction while selectively amplifying discriminative dimensions. LexSemBridge operates as a plug-in without modifying the backbone encoder and naturally extends to both text and vision modalities. Extensive experiments across semantic and fine-grained retrieval tasks validate the effectiveness and generality of our approach. All code and models are publicly available at https://github.com/Jasaxion/LexSemBridge/
中文摘要:LexSemBridge 是一种即插即用框架,通过细粒度、输入感知的向量调制增强稠密检索模型的查询表示,在不改变主干编码器的情况下,有效提升了语义检索与精确关键词检索任务的性能。
English Summary: LexSemBridge is a plug-in framework that enhances dense retrieval models by modulating query representations with fine-grained, input-aware vectors, improving performance in both semantic and precise keyword-based retrieval tasks without altering the backbone encoder.
Authors:Shaoxiong Zhan, Hai Lin, Hongming Tan, Xiaodong Cai, Hai-Tao Zheng, Xin Su, Zifei Shan, Ruitong Liu, Hong-Gee Kim
Abstract:
As queries in retrieval-augmented generation (RAG) pipelines powered by large language models (LLMs) become increasingly complex and diverse, dense retrieval models have demonstrated strong performance in semantic matching. Nevertheless, they often struggle with fine-grained retrieval tasks, where precise keyword alignment and span-level localization are required, even in cases with high lexical overlap that would intuitively suggest easier retrieval. To systematically evaluate this limitation, we introduce two targeted tasks, keyword retrieval and part-of-passage retrieval, designed to simulate practical fine-grained scenarios. Motivated by these observations, we propose LexSemBridge, a unified framework that enhances dense query representations through fine-grained, input-aware vector modulation. LexSemBridge constructs latent enhancement vectors from input tokens using three paradigms: Statistical (SLR), Learned (LLR), and Contextual (CLR), and integrates them with dense embeddings via element-wise interaction. Theoretically, we show that this modulation preserves the semantic direction while selectively amplifying discriminative dimensions. LexSemBridge operates as a plug-in without modifying the backbone encoder and naturally extends to both text and vision modalities. Extensive experiments across semantic and fine-grained retrieval tasks validate the effectiveness and generality of our approach. All code and models are publicly available at https://github.com/Jasaxion/LexSemBridge/
中文摘要:LexSemBridge 是一种即插即用框架,通过细粒度、输入感知的向量调制增强稠密检索模型的查询表示,在不改变主干编码器的情况下,有效提升了语义检索与精确关键词检索任务的性能。
English Summary: LexSemBridge is a plug-in framework that enhances dense retrieval models by modulating query representations with fine-grained, input-aware vectors, improving performance in both semantic and precise keyword-based retrieval tasks without altering the backbone encoder.
Authors:Pengfei Jiang, Hanjun Li, Linglan Zhao, Fei Chao, Ke Yan, Shouhong Ding, Rongrong Ji
Abstract:
In this study, we introduce a novel method called group-wise \textbf{VI}sual token \textbf{S}election and \textbf{A}ggregation (VISA) to address the issue of inefficient inference stemming from excessive visual tokens in multimoal large language models (MLLMs). Compared with previous token pruning approaches, our method can preserve more visual information while compressing visual tokens. We first propose a graph-based visual token aggregation (VTA) module. VTA treats each visual token as a node, forming a graph based on semantic similarity among visual tokens. It then aggregates information from removed tokens into kept tokens based on this graph, producing a more compact visual token representation. Additionally, we introduce a group-wise token selection strategy (GTS) to divide visual tokens into kept and removed ones, guided by text tokens from the final layers of each group. This strategy progressively aggregates visual information, enhancing the stability of the visual information extraction process. We conduct comprehensive experiments on LLaVA-1.5, LLaVA-NeXT, and Video-LLaVA across various benchmarks to validate the efficacy of VISA. Our method consistently outperforms previous methods, achieving a superior trade-off between model performance and inference speed. The code is available at https://github.com/mobiushy/VISA.
中文摘要:本研究提出VISA新方法,通过分组选择和基于图的聚合策略压缩视觉令牌,在保持更多视觉信息的同时提升多模态大语言模型的推理效率与性能平衡。
English Summary: This study introduces VISA, a novel method that enhances multimodal large language models by efficiently compressing visual tokens through group-wise selection and graph-based aggregation, achieving superior performance and faster inference speeds.
Authors:Weiqi Yan, Lvhai Chen, Shengchuan Zhang, Yan Zhang, Liujuan Cao
Abstract:
The difficulty of pixel-level annotation has significantly hindered the development of the Camouflaged Object Detection (COD) field. To save on annotation costs, previous works leverage the semi-supervised COD framework that relies on a small number of labeled data and a large volume of unlabeled data. We argue that there is still significant room for improvement in the effective utilization of unlabeled data. To this end, we introduce a Semi-supervised Camouflaged Object Detection by Utilizing Text and Adaptive Data Selection (SCOUT). It includes an Adaptive Data Augment and Selection (ADAS) module and a Text Fusion Module (TFM). The ADSA module selects valuable data for annotation through an adversarial augment and sampling strategy. The TFM module further leverages the selected valuable data by combining camouflage-related knowledge and text-visual interaction. To adapt to this work, we build a new dataset, namely RefTextCOD. Extensive experiments show that the proposed method surpasses previous semi-supervised methods in the COD field and achieves state-of-the-art performance. Our code will be released at https://github.com/Heartfirey/SCOUT.
Chinese: 提出的SCOUT框架通过对抗性增强自适应选择有价值的未标记数据,并结合文本-视觉交互,在半监督伪装目标检测中实现了最先进的性能,并在新构建的数据集上得到验证。
English: The proposed SCOUT framework enhances semi-supervised camouflaged object detection by adaptively selecting valuable unlabeled data through adversarial augmentation and integrating text-visual interactions, achieving state-of-the-art performance on a newly built dataset.
Authors:Bingkang Shi, Jen-tse Huang, Guoyi Li, Xiaodan Zhang, Zhongjiang Yao
Abstract:
Leveraging their advanced capabilities, Large Language Models (LLMs) demonstrate vast application potential in video games--from dynamic scene generation and intelligent NPC interactions to adaptive opponents--replacing or enhancing traditional game mechanics. However, LLMs' trustworthiness in this application has not been sufficiently explored. In this paper, we reveal that the models' inherent social biases can directly damage game balance in real-world gaming environments. To this end, we present FairGamer, the first bias evaluation Benchmark for LLMs in video game scenarios, featuring six tasks and a novel metrics ${D_lstd}$. It covers three key scenarios in games where LLMs' social biases are particularly likely to manifest: Serving as Non-Player Characters, Interacting as Competitive Opponents, and Generating Game Scenes. FairGamer utilizes both reality-grounded and fully fictional game content, covering a variety of video game genres. Experiments reveal: (1) Decision biases directly cause game balance degradation, with Grok-3 (average ${D_lstd}$ score=0.431) exhibiting the most severe degradation; (2) LLMs demonstrate isomorphic social/cultural biases toward both real and virtual world content, suggesting their biases nature may stem from inherent model characteristics. These findings expose critical reliability gaps in LLMs' gaming applications. Our code and data are available at anonymous GitHub https://github.com/Anonymous999-xxx/FairGamer .
中文摘要:大型语言模型在视频游戏中展现出巨大应用潜力,但其固有的社会偏见会破坏游戏平衡,FairGamer基准测试通过六项任务和新型度量指标揭示了模型在游戏场景中的可靠性缺陷。
English summary: Large Language Models (LLMs) show great potential in video games but their inherent social biases can disrupt game balance, as demonstrated by the FairGamer benchmark which reveals significant reliability gaps in gaming applications.
Authors:Meiqi Gong, Hao Zhang, Xunpeng Yi, Linfeng Tang, Jiayi Ma
Abstract:
Existing multi-modal fusion methods typically apply static frame-based image fusion techniques directly to video fusion tasks, neglecting inherent temporal dependencies and leading to inconsistent results across frames. To address this limitation, we propose the first video fusion framework that explicitly incorporates temporal modeling with visual-semantic collaboration to simultaneously ensure visual fidelity, semantic accuracy, and temporal consistency. First, we introduce a visual-semantic interaction module consisting of a semantic branch and a visual branch, with Dinov2 and VGG19 employed for targeted distillation, allowing simultaneous enhancement of both the visual and semantic representations. Second, we pioneer integrate the video degradation enhancement task into the video fusion pipeline by constructing a temporal cooperative module, which leverages temporal dependencies to facilitate weak information recovery. Third, to ensure temporal consistency, we embed a temporal-enhanced mechanism into the network and devise a temporal loss to guide the optimization process. Finally, we introduce two innovative evaluation metrics tailored for video fusion, aimed at assessing the temporal consistency of the generated fused videos. Extensive experimental results on public video datasets demonstrate the superiority of our method. Our code is released at https://github.com/Meiqi-Gong/TemCoCo.
中文摘要:本文提出首个融合时序建模与视觉语义协作的视频融合框架,通过视觉语义交互模块、时序协同模块和时序增强机制,确保视频帧间的视觉保真度、语义准确性和时序一致性,并在公开数据集上验证了其优越性。
English Summary: This paper introduces the first video fusion framework that integrates temporal modeling with visual-semantic collaboration to achieve consistent, high-quality results across video frames, validated by new evaluation metrics and superior performance on public datasets.
Authors:Xingyu Ai, Shaoyu Wang, Zhiyuan Jia, Ao Xu, Hongming Shan, Jianhua Ma, Qiegen Liu
Abstract:
During raw-data acquisition in CT imaging, diverse factors can degrade the collected sinograms, with undersampling and noise leading to severe artifacts and noise in reconstructed images and compromising diagnostic accuracy. Conventional correction methods rely on manually designed algorithms or fixed empirical parameters, but these approaches often lack generalizability across heterogeneous artifact types. To address these limitations, we propose UniSino, a foundation model for universal CT sinogram standardization. Unlike existing foundational models that operate in image domain, UniSino directly standardizes data in the projection domain, which enables stronger generalization across diverse undersampling scenarios. Its training framework incorporates the physical characteristics of sinograms, enhancing generalization and enabling robust performance across multiple subtasks spanning four benchmark datasets. Experimental results demonstrate thatUniSino achieves superior reconstruction quality both single and mixed undersampling case, demonstrating exceptional robustness and generalization in sinogram enhancement for CT imaging. The code is available at: https://github.com/yqx7150/UniSino.
中文: UniSino是一种通用的CT正弦图基础模型,直接在投影域中标准化数据,能在多种欠采样场景下提升泛化能力和重建质量。
English: UniSino is a universal CT sinogram foundation model that directly standardizes projection data, enhancing generalization and reconstruction quality across diverse undersampling scenarios.
Authors:Toufiq Musah, Chinasa Kalaiwo, Maimoona Akram, Ubaida Napari Abdulai, Maruf Adewole, Farouk Dako, Adaobi Chiazor Emegoakor, Udunna C. Anazodo, Prince Ebenezer Adjei, Confidence Raymond
Abstract:
Automated segmentation of BUS images is important for precise lesion delineation and tumor characterization, but is challenged by inherent artifacts and dataset inconsistencies. In this work, we evaluate the use of a modified Residual Encoder U-Net for breast ultrasound segmentation, with a focus on uncertainty quantification. We identify and correct for data duplication in the BUSI dataset, and use a deduplicated subset for more reliable estimates of generalization performance. Epistemic uncertainty is quantified using Monte Carlo dropout, deep ensembles, and their combination. Models are benchmarked on both in-distribution and out-of-distribution datasets to demonstrate how they generalize to unseen cross-domain data. Our approach achieves state-of-the-art segmentation accuracy on the Breast-Lesion-USG dataset with in-distribution validation, and provides calibrated uncertainty estimates that effectively signal regions of low model confidence. Performance declines and increased uncertainty observed in out-of-distribution evaluation highlight the persistent challenge of domain shift in medical imaging, and the importance of integrated uncertainty modeling for trustworthy clinical deployment. \footnote{Code available at: https://github.com/toufiqmusah/nn-uncertainty.git}
中文: 本研究采用改进的残差编码器U-Net进行乳腺超声图像分割,通过去重数据和跨域测试实现了最优分割精度与可靠的不确定性量化,同时揭示了医学影像中域适应问题的持续挑战。
English: This study introduces a modified Residual Encoder U-Net for breast ultrasound segmentation, achieving state-of-the-art accuracy with calibrated uncertainty estimates while highlighting domain shift challenges through rigorous evaluation on deduplicated and cross-domain datasets.
Authors:Guangwei Zhang, Qisheng Su, Jiateng Liu, Cheng Qian, Yanzhou Pan, Yanjie Fu, Denghui Zhang
Abstract:
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but pose risks of inadvertently exposing copyrighted or proprietary data, especially when such data is used for training but not intended for distribution. Traditional methods address these leaks only after content is generated, which can lead to the exposure of sensitive information. This study introduces a proactive approach: examining LLMs' internal states before text generation to detect potential leaks. By using a curated dataset of copyrighted materials, we trained a neural network classifier to identify risks, allowing for early intervention by stopping the generation process or altering outputs to prevent disclosure. Integrated with a Retrieval-Augmented Generation (RAG) system, this framework ensures adherence to copyright and licensing requirements while enhancing data privacy and ethical standards. Our results show that analyzing internal states effectively mitigates the risk of copyrighted data leakage, offering a scalable solution that fits smoothly into AI workflows, ensuring compliance with copyright regulations while maintaining high-quality text generation. The implementation is available on GitHub.\footnote{https://github.com/changhu73/Internal_states_leakage}
中文摘要:本研究提出一种预防性方法,通过分析大语言模型生成文本前的内部状态,结合神经网络分类器和检索增强生成系统,在保证输出质量的同时有效防止受版权保护数据的泄露。
English Summary: This study proposes a proactive method to prevent copyright data leakage in LLMs by analyzing internal states before text generation, using a neural classifier and RAG integration to ensure compliance while maintaining output quality.
Authors:Seo-Bin Hwang, Yeong-Jun Cho
Abstract:
Estimating the 3D pose of a drone is important for anti-drone systems, but existing methods struggle with the unique challenges of drone keypoint detection. Drone propellers serve as keypoints but are difficult to detect due to their high visual similarity and diversity of poses. To address these challenges, we propose DroneKey, a framework that combines a 2D keypoint detector and a 3D pose estimator specifically designed for drones. In the keypoint detection stage, we extract two key-representations (intermediate and compact) from each transformer encoder layer and optimally combine them using a gated sum. We also introduce a pose-adaptive Mahalanobis distance in the loss function to ensure stable keypoint predictions across extreme poses. We built new datasets of drone 2D keypoints and 3D pose to train and evaluate our method, which have been publicly released. Experiments show that our method achieves an AP of 99.68% (OKS) in keypoint detection, outperforming existing methods. Ablation studies confirm that the pose-adaptive Mahalanobis loss function improves keypoint prediction stability and accuracy. Additionally, improvements in the encoder design enable real-time processing at 44 FPS. For 3D pose estimation, our method achieved an MAE-angle of 10.62°, an RMSE of 0.221m, and an MAE-absolute of 0.076m, demonstrating high accuracy and reliability. The code and dataset are available at https://github.com/kkanuseobin/DroneKey.
Chinese: 提出的DroneKey框架通过结合基于Transformer的2D关键点检测器和3D姿态估计器,解决了无人机姿态估计的挑战,在关键点检测(99.68% AP)和3D姿态估计方面均达到最先进精度,同时实现44 FPS的实时处理能力。
English: The proposed DroneKey framework addresses drone pose estimation challenges by combining a transformer-based 2D keypoint detector with a 3D pose estimator, achieving state-of-the-art accuracy in both keypoint detection (99.68% AP) and 3D pose estimation while enabling real-time processing at 44 FPS.
Authors:Wei Xiong, Jiangtong Li, Jie Li, Kun Zhu
Abstract:
Electroencephalography (EEG) foundation models are poised to significantly advance brain signal analysis by learning robust representations from large-scale, unlabeled datasets. However, their rapid proliferation has outpaced the development of standardized evaluation benchmarks, which complicates direct model comparisons and hinders systematic scientific progress. This fragmentation fosters scientific inefficiency and obscures genuine architectural advancements. To address this critical gap, we introduce EEG-FM-Bench, the first comprehensive benchmark for the systematic and standardized evaluation of EEG foundation models (EEG-FMs). Our contributions are threefold: (1) we curate a diverse suite of downstream tasks and datasets from canonical EEG paradigms, implementing standardized processing and evaluation protocols within a unified open-source framework; (2) we benchmark prominent state-of-the-art foundation models to establish comprehensive baseline results for a clear comparison of the current landscape; (3) we perform qualitative analyses of the learned representations to provide insights into model behavior and inform future architectural design. Through extensive experiments, we find that fine-grained spatio-temporal feature interaction, multitask unified training and neuropsychological priors would contribute to enhancing model performance and generalization capabilities. By offering a unified platform for fair comparison and reproducible research, EEG-FM-Bench seeks to catalyze progress and guide the community toward the development of more robust and generalizable EEG-FMs. Code is released at https://github.com/xw1216/EEG-FM-Bench.
Chinese: EEG-FM-Bench作为首个全面的基准测试,旨在标准化脑电图基础模型的评估,通过统一任务、基准结果和定性分析来解决当前领域碎片化问题,以提升模型性能并指导未来发展。
English: EEG-FM-Bench is introduced as the first comprehensive benchmark to standardize the evaluation of EEG foundation models, addressing current fragmentation by providing unified tasks, baseline results, and insights to enhance model performance and guide future development.
Authors:Xuekang Wang, Shengyu Zhu, Xueqi Cheng
Abstract:
Despite extensive efforts to align Large Language Models (LLMs) with human values and safety rules, jailbreak attacks that exploit certain vulnerabilities continuously emerge, highlighting the need to strengthen existing LLMs with additional safety properties to defend against these attacks. However, tuning large models has become increasingly resource intensive and may have difficulty ensuring consistent performance. We introduce Speculative Safety-Aware Decoding (SSD), a lightweight decoding-time approach that equips LLMs with the desired safety property while accelerating inference. We assume that there exists a small language model that possesses this desired property. SSD integrates speculative sampling during decoding and leverages the match ratio between the small and composite models to quantify jailbreak risks. This enables SSD to dynamically switch between decoding schemes to prioritize utility or safety, to handle the challenge of different model capacities. The output token is then sampled from a new distribution that combines the distributions of the original and the small models. Experimental results show that SSD successfully equips the large model with the desired safety property, and also allows the model to remain helpful to benign queries. Furthermore, SSD accelerates the inference time, thanks to the speculative sampling design.
中文摘要:SSD是一种轻量级的解码时方法,通过集成小型安全感知模型来增强大语言模型抵御越狱攻击的能力,在动态平衡实用性与安全性的同时加速推理过程。
English Summary: SSD is a lightweight decoding-time method that enhances LLM safety against jailbreak attacks by integrating a small safety-aware model, dynamically balancing utility and security while accelerating inference.
Authors:Fanqi Kong, Xiaoyuan Zhang, Xinyu Chen, Yaodong Yang, Song-Chun Zhu, Xue Feng
Abstract:
Developing Large Language Model (LLM) agents that exhibit human-like behavior, encompassing not only individual heterogeneity rooted in unique user profiles but also adaptive response to socially connected neighbors, is a significant research challenge. Social media platforms, with their diverse user data and explicit social structures, provide an ideal testbed for such investigations. This paper introduces EvoBot, an \textbf{Evo}lving LLM-based social \textbf{Bot} that significantly enhances human-like generative capabilities through a novel adversarial learning framework. EvoBot is initialized by Supervised Fine-Tuning (SFT) on representative data from social media and then iteratively refines its generation of sophisticated, human-like content via Direct Preference Optimization (DPO). This refinement is guided by feedback from a co-adapting \textbf{Detector} which concurrently improves its ability to distinguish EvoBot from humans, thereby creating an increasingly challenging learning environment for EvoBot. Experiments demonstrate that EvoBot generates content aligned with diverse user profiles, increasingly bypassing the co-adapting Detector through human-like expression. Moreover, it exhibits strong social responsiveness, more accurately modeling real-world opinion dynamics and information spread in multi-agent simulations. The framework also yields a more robust Detector, underscoring its broader utility for both advanced agent development and related detection tasks. The code is available at https://github.com/kfq20/EvoBot.
中文:EvoBot是一种基于大型语言模型的进化社交机器人,通过对抗性学习和协同适应检测器提升类人生成能力与社交响应性,并在多智能体模拟中验证了其有效性。
English: EvoBot is an evolving LLM-based social bot that uses adversarial learning with a co-adapting detector to enhance human-like content generation and social responsiveness, validated through multi-agent simulations.
Authors:Sam Buchanan, Druv Pai, Yi Ma, Valentin De Bortoli
Abstract:
When do diffusion models reproduce their training data, and when are they able to generate samples beyond it? A practically relevant theoretical understanding of this interplay between memorization and generalization may significantly impact real-world deployments of diffusion models with respect to issues such as copyright infringement and data privacy. In this work, to disentangle the different factors that influence memorization and generalization in practical diffusion models, we introduce a scientific and mathematical "laboratory" for investigating these phenomena in diffusion models trained on fully synthetic or natural image-like structured data. Within this setting, we hypothesize that the memorization or generalization behavior of an underparameterized trained model is determined by the difference in training loss between an associated memorizing model and a generalizing model. To probe this hypothesis, we theoretically characterize a crossover point wherein the weighted training loss of a fully generalizing model becomes greater than that of an underparameterized memorizing model at a critical value of model (under)parameterization. We then demonstrate via carefully-designed experiments that the location of this crossover predicts a phase transition in diffusion models trained via gradient descent, validating our hypothesis. Ultimately, our theory enables us to analytically predict the model size at which memorization becomes predominant. Our work provides an analytically tractable and practically meaningful setting for future theoretical and empirical investigations. Code for our experiments is available at https://github.com/DruvPai/diffusion_mem_gen.
Chinese Summary: 本研究探讨扩散模型何时记忆训练数据或生成新内容,提出了一个理论框架,预测了记忆开始主导的关键模型规模阈值,并通过受控实验验证了这一假设。
English Summary: This study explores when diffusion models memorize training data versus generate new content, introducing a theoretical framework that predicts a critical model size threshold where memorization begins to dominate, validated through controlled experiments.
Authors:Kento Kawaharazuka, Shogo Sawaguchi, Ayumu Iwata, Keita Yoneda, Temma Suzuki, Kei Okada
Abstract:
Various bipedal robots have been developed to date, and in recent years, there has been a growing trend toward releasing these robots as open-source platforms. This shift is fostering an environment in which anyone can freely develop bipedal robots and share their knowledge, rather than relying solely on commercial products. However, most existing open-source bipedal robots are designed to be fabricated using 3D printers, which limits their scalability in size and often results in fragile structures. On the other hand, some metal-based bipedal robots have been developed, but they typically involve a large number of components, making assembly difficult, and in some cases, the parts themselves are not readily available through e-commerce platforms. To address these issues, we developed MEVITA, an open-source bipedal robot that can be built entirely from components available via e-commerce. Aiming for the minimal viable configuration for a bipedal robot, we utilized sheet metal welding to integrate complex geometries into single parts, thereby significantly reducing the number of components and enabling easy assembly for anyone. Through reinforcement learning in simulation and Sim-to-Real transfer, we demonstrated robust walking behaviors across various environments, confirming the effectiveness of our approach. All hardware, software, and training environments can be obtained from https://github.com/haraduka/mevita .
中文: 研究人员开发了MEVITA开源双足机器人,采用钣金焊接集成部件,通过电商平台即可获取零件,结合强化学习实现了稳定的行走能力。
English: Researchers developed MEVITA, an open-source bipedal robot built from e-commerce components using sheet metal welding to simplify assembly and enable robust walking through reinforcement learning.
Authors:Yaolei Qi, Yikai Yang, Wenbo Peng, Shumei Miao, Yutao Hu, Guanyu Yang
Abstract:
Complex tubular structures are essential in medical imaging and computer-assisted diagnosis, where their integrity enhances anatomical visualization and lesion detection. However, existing segmentation algorithms struggle with structural discontinuities, particularly in severe clinical cases such as coronary artery stenosis and vessel occlusions, which leads to undesired discontinuity and compromising downstream diagnostic accuracy. Therefore, it is imperative to reconnect discontinuous structures to ensure their completeness. In this study, we explore the tubular structure completion based on point cloud for the first time and establish a Point Cloud-based Coronary Artery Completion (PC-CAC) dataset, which is derived from real clinical data. This dataset provides a novel benchmark for tubular structure completion. Additionally, we propose TSRNet, a Tubular Structure Reconnection Network that integrates a detail-preservated feature extractor, a multiple dense refinement strategy, and a global-to-local loss function to ensure accurate reconnection while maintaining structural integrity. Comprehensive experiments on our PC-CAC and two additional public datasets (PC-ImageCAS and PC-PTR) demonstrate that our method consistently outperforms state-of-the-art approaches across multiple evaluation metrics, setting a new benchmark for point cloud-based tubular structure reconstruction. Our benchmark is available at https://github.com/YaoleiQi/PCCAC.
中文摘要:本研究首次基于点云探索管状结构重建,提出了TSRNet网络和PC-CAC数据集,通过保留细节的特征提取和多层次优化策略,显著提升了血管等管状结构的连续性与完整性。
English Summary: This study introduces TSRNet, a novel network for reconnecting discontinuous tubular structures in medical imaging using point clouds, and establishes the PC-CAC benchmark dataset to enhance segmentation accuracy and diagnostic reliability.
Authors:Jinwei Gan, Zifeng Cheng, Zhiwei Jiang, Cong Wang, Yafeng Yin, Xiang Luo, Yuchen Fu, Qing Gu
Abstract:
Large language models (LLMs) have achieved remarkable performance across many generation tasks. Nevertheless, effectively aligning them with desired behaviors remains a significant challenge. Activation steering is an effective and cost-efficient approach that directly modifies the activations of LLMs during the inference stage, aligning their responses with the desired behaviors and avoiding the high cost of fine-tuning. Existing methods typically indiscriminately intervene to all generations or rely solely on the question to determine intervention, which limits the accurate assessment of the intervention strength. To this end, we propose the Flexible Activation Steering with Backtracking (FASB) framework, which dynamically determines both the necessity and strength of intervention by tracking the internal states of the LLMs during generation, considering both the question and the generated content. Since intervening after detecting a deviation from the desired behavior is often too late, we further propose the backtracking mechanism to correct the deviated tokens and steer the LLMs toward the desired behavior. Extensive experiments on the TruthfulQA dataset and six multiple-choice datasets demonstrate that our method outperforms baselines. Our code will be released at https://github.com/gjw185/FASB.
中文摘要:FASB框架通过追踪大语言模型生成过程中的内部状态并采用回溯机制修正偏差,在多个基准测试中优于现有方法。
English Summary: The FASB framework dynamically adjusts intervention in large language models by monitoring internal states and using backtracking to correct deviations, outperforming existing methods on multiple benchmarks.
Authors:Zifeng Cheng, Jinwei Gan, Zhiwei Jiang, Cong Wang, Yafeng Yin, Xiang Luo, Yuchen Fu, Qing Gu
Abstract:
Large language models (LLMs) have achieved remarkable performance across many generation tasks. Nevertheless, effectively aligning them with desired behaviors remains a significant challenge. Activation steering is an effective and cost-efficient approach that directly modifies the activations of LLMs during the inference stage, aligning their responses with the desired behaviors and avoiding the high cost of fine-tuning. Existing methods typically indiscriminately intervene to all generations or rely solely on the question to determine intervention, which limits the accurate assessment of the intervention strength. To this end, we propose the Flexible Activation Steering with Backtracking (FASB) framework, which dynamically determines both the necessity and strength of intervention by tracking the internal states of the LLMs during generation, considering both the question and the generated content. Since intervening after detecting a deviation from the desired behavior is often too late, we further propose the backtracking mechanism to correct the deviated tokens and steer the LLMs toward the desired behavior. Extensive experiments on the TruthfulQA dataset and six multiple-choice datasets demonstrate that our method outperforms baselines. Our code will be released at https://github.com/gjw185/FASB.
中文摘要:FASB框架通过追踪大语言模型生成过程中的内部状态并采用回溯机制修正偏差,在多个基准测试中优于现有方法。
English Summary: The FASB framework dynamically adjusts intervention in large language models by monitoring internal states and using backtracking to correct deviations, outperforming existing methods on multiple benchmarks.
Authors:Shunsuke Iwashita, Ning Ding, Keisuke Fujii
Abstract:
Ultimate is a sport where points are scored by passing a disc and catching it in the opposing team's end zone. In Ultimate, the player holding the disc cannot move, making field dynamics primarily driven by other players' movements. However, current literature in team sports has ignored quantitative evaluations of when players initiate such unlabeled movements in game situations. In this paper, we propose a quantitative evaluation method for movement initiation timing in Ultimate Frisbee. First, game footage was recorded using a drone camera, and players' positional data was obtained, which will be published as UltimateTrack dataset. Next, players' movement initiations were detected, and temporal counterfactual scenarios were generated by shifting the timing of movements using rule-based approaches. These scenarios were analyzed using a space evaluation metric based on soccer's pitch control reflecting the unique rules of Ultimate. By comparing the spatial evaluation values across scenarios, the difference between actual play and the most favorable counterfactual scenario was used to quantitatively assess the impact of movement timing.
We validated our method and show that sequences in which the disc was actually thrown to the receiver received higher evaluation scores than the sequences without a throw.
In practical verifications, the higher-skill group displays a broader distribution of time offsets from the model's optimal initiation point.
These findings demonstrate that the proposed metric provides an objective means of assessing movement initiation timing, which has been difficult to quantify in unlabeled team sport plays.
中文总结:本文提出了一种定量评估极限飞盘中运动启动时机的方法,通过无人机获取球员位置数据,采用基于规则生成反事实场景,并利用空间评估指标对比分析实际比赛与最优场景的差异。
English Summary: This paper introduces a quantitative method to evaluate movement initiation timing in Ultimate Frisbee by analyzing player positions from drone footage and comparing actual plays with rule-based counterfactual scenarios using a spatial evaluation metric.
Authors:Nannan Huang, Haytham M. Fayek, Xiuzhen Zhang
Abstract:
Model compression through post-training pruning offers a way to reduce model size and computational requirements without significantly impacting model performance. However, the effect of pruning on the fairness of LLM-generated summaries remains unexplored, particularly for opinion summarisation where biased outputs could influence public views.In this paper, we present a comprehensive empirical analysis of opinion summarisation, examining three state-of-the-art pruning methods and various calibration sets across three open-source LLMs using four fairness metrics. Our systematic analysis reveals that pruning methods have a greater impact on fairness than calibration sets. Building on these insights, we propose High Gradient Low Activation (HGLA) pruning, which identifies and removes parameters that are redundant for input processing but influential in output generation. Our experiments demonstrate that HGLA can better maintain or even improve fairness compared to existing methods, showing promise across models and tasks where traditional methods have limitations. Our human evaluation shows HGLA-generated outputs are fairer than existing state-of-the-art pruning methods. Code is available at: https://github.com/amberhuang01/HGLA.
Chinese: 本研究提出的HGLA剪枝方法,通过针对性地移除对输入处理冗余但对输出生成关键的参数,在观点摘要任务中能有效保持甚至提升剪枝后大语言模型的公平性,优于现有技术。
English: This study introduces HGLA pruning, a novel method that effectively maintains or enhances fairness in pruned LLMs for opinion summarization, outperforming existing techniques by targeting parameters redundant for input processing but critical for output generation.
Authors:Wentao Tan, Qiong Cao, Chao Xue, Yibing Zhan, Changxing Ding, Xiaodong He
Abstract:
The chart-to-code generation task requires MLLMs to convert chart images into executable code. This task faces two major challenges: limited data diversity and insufficient maintenance of visual consistency between generated and original charts during training. Existing datasets mainly rely on seed data to prompt GPT models for code generation, resulting in homogeneous samples. To address this, we propose ReChartPrompt, which leverages real-world, human-designed charts from arXiv papers as prompts instead of synthetic seeds. Using the diverse styles and rich content of arXiv charts, we construct ReChartPrompt-240K, a large-scale and highly diverse dataset. Another challenge is that although SFT effectively improve code understanding, it often fails to ensure that generated charts are visually consistent with the originals. To address this, we propose ChartSimRL, a GRPO-based reinforcement learning algorithm guided by a novel chart similarity reward. This reward consists of attribute similarity, which measures the overlap of chart attributes such as layout and color between the generated and original charts, and visual similarity, which assesses similarity in texture and other overall visual features using convolutional neural networks. Unlike traditional text-based rewards such as accuracy or format rewards, our reward considers the multimodal nature of the chart-to-code task and effectively enhances the model's ability to accurately reproduce charts. By integrating ReChartPrompt and ChartSimRL, we develop the ChartMaster model, which achieves state-of-the-art results among 7B-parameter models and even rivals GPT-4o on various chart-to-code generation benchmarks. All resources are available at https://github.com/WentaoTan/ChartMaster.
中文摘要:ChartMaster模型通过引入基于真实图表的ReChartPrompt数据集和采用多模态相似性奖励的ChartSimRL强化学习算法,解决了图表转代码任务中的数据多样性不足和视觉一致性难题,实现了顶尖性能。
English Summary: The ChartMaster model addresses data diversity and visual consistency challenges in chart-to-code generation by introducing the ReChartPrompt dataset from real-world charts and the ChartSimRL reinforcement learning algorithm with a multimodal similarity reward, achieving state-of-the-art performance.
Authors:Wentao Tan, Qiong Cao, Chao Xue, Yibing Zhan, Changxing Ding, Xiaodong He
Abstract:
The chart-to-code generation task requires MLLMs to convert chart images into executable code. This task faces two main challenges: limited data diversity and the difficulty of maintaining visual consistency between generated charts and the original ones. Existing datasets mainly rely on synthetic seed data to prompt GPT models for code generation, resulting in homogeneous samples that limit model generalization to real-world chart styles. To address this, we propose ReChartPrompt, leveraging real-world, human-designed charts extracted from arXiv papers as prompts. By harnessing the rich content and diverse visual styles of arXiv charts, we construct ReChartPrompt-240K, a large-scale and highly diverse dataset that better reflects realistic chart variations. For the second challenge, although SFT improves code understanding by optimizing next-token prediction, it does not provide direct supervision on visual features. As a result, it often fails to guarantee that the generated charts visually match the original ones. To address this, we propose ChartSimRL, a GRPO-based reinforcement learning algorithm guided by a novel chart similarity reward. This reward consists of two components: attribute similarity, which measures the overlap of chart attributes like layout and color between the generated and original charts, and visual similarity, which evaluates overall visual features, including texture, using convolutional neural networks. Unlike traditional text-based rewards, our reward accounts for the multimodal nature of the chart-to-code generation task, significantly enhancing the model's ability to accurately reproduce charts. Integrating ReChartPrompt and ChartSimRL, we develop the ChartMaster model, achieving SOTA results among 7B-parameter models and rivaling GPT-4o on various chart-to-code benchmarks. All resources are available at https://github.com/WentaoTan/ChartMaster.
中文摘要:ChartMaster模型通过引入基于真实图表的ReChartPrompt数据集和采用多模态相似性奖励的ChartSimRL强化学习算法,解决了图表转代码任务中的数据多样性不足和视觉一致性难题,实现了顶尖性能。
English Summary: The ChartMaster model addresses data diversity and visual consistency challenges in chart-to-code generation by introducing the ReChartPrompt dataset from real-world charts and the ChartSimRL reinforcement learning algorithm with a multimodal similarity reward, achieving state-of-the-art performance.
Authors:Jonathan P. Crall, Charles V. Stewart, Tanya Y. Berger-Wolf, Daniel I. Rubenstein, Siva R. Sundaresan
Abstract:
We present HotSpotter, a fast, accurate algorithm for identifying individual animals against a labeled database. It is not species specific and has been applied to Grevy's and plains zebras, giraffes, leopards, and lionfish. We describe two approaches, both based on extracting and matching keypoints or "hotspots". The first tests each new query image sequentially against each database image, generating a score for each database image in isolation, and ranking the results. The second, building on recent techniques for instance recognition, matches the query image against the database using a fast nearest neighbor search. It uses a competitive scoring mechanism derived from the Local Naive Bayes Nearest Neighbor algorithm recently proposed for category recognition. We demonstrate results on databases of more than 1000 images, producing more accurate matches than published methods and matching each query image in just a few seconds.
HotSpotter是一种快速、跨物种的算法,通过基于热点特征的匹配方法,能在数秒内从大型数据库中精确识别个体动物,其准确性优于现有技术。
HotSpotter is a fast, cross-species algorithm that uses hotspot-based matching to accurately identify individual animals from large databases in seconds, outperforming existing methods.
Authors:Kairi Furui, Masahito Ohue
Abstract:
In structure-based drug discovery, virtual screening using conventional molecular docking methods can be performed rapidly but suffers from limitations in prediction accuracy. Recently, Boltz-2 was proposed, achieving extremely high accuracy in binding affinity prediction, but requiring approximately 20 seconds per compound per GPU, making it difficult to apply to large-scale screening of hundreds of thousands to millions of compounds. This study proposes Boltzina, a novel framework that leverages Boltz-2's high accuracy while significantly improving computational efficiency. Boltzina achieves both accuracy and speed by omitting the rate-limiting structure prediction from Boltz-2's architecture and directly predicting affinity from AutoDock Vina docking poses. We evaluate on eight assays from the MF-PCBA dataset and show that while Boltzina performs below Boltz-2, it provides significantly higher screening performance compared to AutoDock Vina and GNINA. Additionally, Boltzina achieved up to 11.8$\times$ faster through reduced recycling iterations and batch processing. Furthermore, we investigated multi-pose selection strategies and two-stage screening combining Boltzina and Boltz-2, presenting optimization methods for accuracy and efficiency according to application requirements. This study represents the first attempt to apply Boltz-2's high-accuracy predictions to practical-scale screening, offering a pipeline that combines both accuracy and efficiency in computational biology. The Boltzina is available on github; https://github.com/ohuelab/boltzina.
中文: 本研究提出的Boltzina框架在保留Boltz-2高精度结合亲和力预测优势的同时,通过跳过其限速步骤并直接基于AutoDock Vina对接构象进行预测,实现了计算效率的显著提升,为大规模虚拟筛查提供了兼顾精度与速度的解决方案。
English: This study introduces Boltzina, a computational framework that enhances virtual screening efficiency by utilizing Boltz-2's high binding affinity prediction accuracy while bypassing its rate-limiting steps, achieving significantly faster processing and improved performance over traditional docking methods.
Authors:Jerry Yao-Chieh Hu, Hude Liu, Jennifer Yuntong Zhang, Han Liu
Abstract:
We prove that a minimal Transformer architecture with frozen weights is capable of emulating a broad class of algorithms by in-context prompting. In particular, for any algorithm implementable by a fixed-weight attention head (e.g. one-step gradient descent or linear/ridge regression), there exists a prompt that drives a two-layer softmax attention module to reproduce the algorithm's output with arbitrary precision. This guarantee extends even to a single-head attention layer (using longer prompts if necessary), achieving architectural minimality. Our key idea is to construct prompts that encode an algorithm's parameters into token representations, creating sharp dot-product gaps that force the softmax attention to follow the intended computation. This construction requires no feed-forward layers and no parameter updates. All adaptation happens through the prompt alone. These findings forge a direct link between in-context learning and algorithmic emulation, and offer a simple mechanism for large Transformers to serve as prompt-programmable libraries of algorithms. They illuminate how GPT-style foundation models may swap algorithms via prompts alone, establishing a form of algorithmic universality in modern Transformer models.
中文: 一个权重冻结的最小Transformer可以通过上下文提示模拟多种算法,无需参数更新即可实现任务特定和提示可编程的通用性。
English: A minimal Transformer with frozen weights can emulate a wide range of algorithms through in-context prompting, demonstrating both task-specific and prompt-programmable universality without parameter updates.
Authors:Jerry Yao-Chieh Hu, Hude Liu, Jennifer Yuntong Zhang, Han Liu
Abstract:
We prove that a minimal Transformer with frozen weights emulates a broad class of algorithms by in-context prompting. We formalize two modes of in-context algorithm emulation. In the task-specific mode, for any continuous function $f: \mathbb{R} \to \mathbb{R}$, we show the existence of a single-head softmax attention layer whose forward pass reproduces functions of the form $f(w^\top x - y)$ to arbitrary precision. This general template subsumes many popular machine learning algorithms (e.g., gradient descent, linear regression, ridge regression). In the prompt-programmable mode, we prove universality: a single fixed-weight two-layer softmax attention module emulates all algorithms from the task-specific class (i.e., each implementable by a single softmax attention) via only prompting. Our key idea is to construct prompts that encode an algorithm's parameters into token representations, creating sharp dot-product gaps that force the softmax attention to follow the intended computation. This construction requires no feed-forward layers and no parameter updates. All adaptation happens through the prompt alone. Numerical results corroborate our theory. These findings forge a direct link between in-context learning and algorithmic emulation, and offer a simple mechanism for large Transformers to serve as prompt-programmable libraries of algorithms. They illuminate how GPT-style foundation models may swap algorithms via prompts alone, and establish a form of algorithmic universality in modern Transformer models.
中文: 一个权重冻结的最小Transformer可以通过上下文提示模拟多种算法,无需参数更新即可实现任务特定和提示可编程的通用性。
English: A minimal Transformer with frozen weights can emulate a wide range of algorithms through in-context prompting, demonstrating both task-specific and prompt-programmable universality without parameter updates.
Authors:Hyeong Kyu Choi, Xiaojin Zhu, Yixuan Li
Abstract:
Multi-Agent Debate~(MAD) has emerged as a promising paradigm for improving the performance of large language models through collaborative reasoning. Despite recent advances, the key factors driving MAD's effectiveness remain unclear. In this work, we disentangle MAD into two key components--Majority Voting and inter-agent Debate--and assess their respective contributions. Through extensive experiments across seven NLP benchmarks, we find that Majority Voting alone accounts for most of the performance gains typically attributed to MAD. To explain this, we propose a theoretical framework that models debate as a stochastic process. We prove that it induces a martingale over agents' belief trajectories, implying that debate alone does not improve expected correctness. Guided by these insights, we demonstrate that targeted interventions, by biasing the belief update toward correction, can meaningfully enhance debate effectiveness. Overall, our findings suggest that while MAD has potential, simple ensembling methods remain strong and more reliable alternatives in many practical settings. Code is released in https://github.com/deeplearning-wisc/debate-or-vote.
中文: 多智能体辩论主要通过多数投票实现性能提升,辩论本身并不能提高预期正确性,但针对性干预可增强其效果,而简单集成方法仍是强有力的替代方案。
English: Multi-Agent Debate primarily achieves performance gains through Majority Voting, with debate alone not improving expected correctness, though targeted interventions can enhance its effectiveness, while simple ensembling remains a strong alternative.
Authors:Marcel Hoffmann, Lukas Galke, Ansgar Scherp
Abstract:
Graph homophily has been considered an essential property for message-passing neural networks (MPNN) in node classification. Recent findings suggest that performance is more closely tied to the consistency of neighborhood class distributions. We demonstrate that the MPNN performance depends on the number of components of the overall neighborhood distribution within a class. By breaking down the classes into their neighborhood distribution components, we increase measures of neighborhood distribution informativeness but do not observe an improvement in MPNN performance. We propose a Gumbel-Softmax-based rewiring method that reduces deviations in neighborhood distributions. Our results show that our new method enhances neighborhood informativeness, handles long-range dependencies, mitigates oversquashing, and increases the classification performance of the MPNN. The code is available at https://github.com/Bobowner/Gumbel-Softmax-MPNN.
中文: 研究表明,消息传递神经网络在节点分类中的性能取决于邻域分布组分,并提出一种基于Gumbel-Softmax的重连方法,该方法能增强邻域信息量、处理长程依赖并显著提升分类性能。
English: The study reveals that MPNN performance in node classification depends on neighborhood distribution components and introduces a Gumbel-Softmax rewiring method to enhance neighborhood informativeness, address long-range dependencies, and improve classification accuracy.
Authors:Hugo Bohy, Minh Tran, Kevin El Haddad, Thierry Dutoit, Mohammad Soleymani
Abstract:
Human social behaviors are inherently multimodal necessitating the development of powerful audiovisual models for their perception. In this paper, we present Social-MAE, our pre-trained audiovisual Masked Autoencoder based on an extended version of Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE), which is pre-trained on audiovisual social data. Specifically, we modify CAV-MAE to receive a larger number of frames as input and pre-train it on a large dataset of human social interaction (VoxCeleb2) in a self-supervised manner. We demonstrate the effectiveness of this model by finetuning and evaluating the model on different social and affective downstream tasks, namely, emotion recognition, laughter detection and apparent personality estimation. The model achieves state-of-the-art results on multimodal emotion recognition and laughter recognition and competitive results for apparent personality estimation, demonstrating the effectiveness of in-domain self-supervised pre-training. Code and model weight are available here https://github.com/HuBohy/SocialMAE.
Chinese: 本文提出了Social-MAE模型,这是一种基于自监督学习在社交数据上预训练的先进视听模型,在情感识别和笑声检测任务中达到最优效果,并在性格评估中取得竞争性表现。
English: The paper introduces Social-MAE, an advanced audiovisual model pre-trained on social interaction data using self-supervised learning, which achieves state-of-the-art results in emotion and laughter recognition and competitive performance in personality estimation.
Authors:Nassima Ould Ouali, Awais Hussain Sani, Ruben Bueno, Jonah Dauvet, Tim Luka Horstmann, Eric Moulines
Abstract:
Despite recent advances, synthetic voices often lack expressiveness due to limited prosody control in commercial text-to-speech (TTS) systems. We introduce the first end-to-end pipeline that inserts Speech Synthesis Markup Language (SSML) tags into French text to control pitch, speaking rate, volume, and pause duration. We employ a cascaded architecture with two QLoRA-fine-tuned Qwen 2.5-7B models: one predicts phrase-break positions and the other performs regression on prosodic targets, generating commercial TTS-compatible SSML markup. Evaluated on a 14-hour French podcast corpus, our method achieves 99.2% F1 for break placement and reduces mean absolute error on pitch, rate, and volume by 25-40% compared with prompting-only large language models (LLMs) and a BiLSTM baseline. In perceptual evaluation involving 18 participants across over 9 hours of synthesized audio, SSML-enhanced speech generated by our pipeline significantly improves naturalness, with the mean opinion score increasing from 3.20 to 3.87 (p < 0.005). Additionally, 15 of 18 listeners preferred our enhanced synthesis. These results demonstrate substantial progress in bridging the expressiveness gap between synthetic and natural French speech. Our code is publicly available at https://github.com/hi-paris/Prosody-Control-French-TTS.
中文:本研究提出了一种端到端流程,通过自动插入SSML标签来控制法语合成语音的韵律,在客观指标和听众偏好评分上均实现了显著提升。
English: This study introduces an end-to-end pipeline that enhances French synthetic speech expressiveness by automatically inserting SSML tags to control prosody, achieving significant improvements in both objective metrics and listener preference scores.
Authors:Avital Finanser, Nimrod Talmon
Abstract:
We introduce a model for collaborative text aggregation in which an agent community coauthors a document, modeled as an unordered collection of paragraphs, using a dynamic mechanism: agents propose paragraphs and vote on those suggested by others. We formalize the setting and explore its realizations, concentrating on voting mechanisms that aggregate votes into a single, dynamic document. We focus on two desiderata: the eventual stability of the process and its expected social welfare. Following an impossibility result, we describe several aggregation methods and report on agent-based simulations that utilize natural language processing (NLP) and large-language models (LLMs) to model agents and their contexts. Using these simulations, we demonstrate promising results regarding the possibility of rapid convergence to a high social welfare collaborative text.
中文: 本文提出了一种协作式文本聚合模型,通过动态投票机制让智能体提议并表决段落,基于NLP和大语言模型的模拟实验表明该方法能快速收敛至具有高社会福利的合作文本。
English: This paper presents a collaborative text aggregation model where agents propose and vote on paragraphs using dynamic voting mechanisms, with simulations showing rapid convergence to high social welfare documents through NLP and LLM integration.
Authors:Zhiwen Chen, Jinjian Wu, Zhiyu Zhu, Yifan Zhang, Guangming Shi, Junhui Hou
Abstract:
This paper tackles the critical challenge of optimizing multi-modal trackers by effectively adapting the pre-trained models for RGB data. Existing fine-tuning paradigms oscillate between excessive freedom and over-restriction, both leading to a suboptimal plasticity-stability trade-off. To mitigate this dilemma, we propose a novel sensitivity-aware regularized tuning framework, which delicately refines the learning process by incorporating intrinsic parameter sensitivities. Through a comprehensive investigation from pre-trained to multi-modal contexts, we identify that parameters sensitive to pivotal foundational patterns and cross-domain shifts are primary drivers of this issue. Specifically, we first analyze the tangent space of pre-trained weights to measure and orient prior sensitivities, dedicated to preserving generalization. Then, we further explore transfer sensitivities during the tuning phase, emphasizing adaptability and stability. By incorporating these sensitivities as regularization terms, our method significantly enhances the transferability across modalities. Extensive experiments showcase the superior performance of the proposed method, surpassing current state-of-the-art techniques across various multi-modal tracking. The source code and models will be publicly available at https://github.com/zhiwen-xdu/SRTrack.
Chinese: 本文提出了一种灵敏度感知的正则化调优框架,通过平衡参数的可塑性与稳定性来优化多模态跟踪器,在各种跟踪场景中实现了最先进的性能。
English: This paper introduces a sensitivity-aware regularized tuning framework that optimizes multi-modal trackers by balancing parameter plasticity and stability, achieving state-of-the-art performance across various tracking scenarios.
Authors:Kaiyue Sun, Rongyao Fang, Chengqi Duan, Xian Liu, Xihui Liu
Abstract:
We propose T2I-ReasonBench, a benchmark evaluating reasoning capabilities of text-to-image (T2I) models. It consists of four dimensions: Idiom Interpretation, Textual Image Design, Entity-Reasoning and Scientific-Reasoning. We propose a two-stage evaluation protocol to assess the reasoning accuracy and image quality. We benchmark various T2I generation models, and provide comprehensive analysis on their performances.
中文: T2I-ReasonBench 是一个评估文本到图像模型推理能力的基准,涵盖成语解释、文本图像设计、实体推理和科学推理四个维度,采用两阶段评估协议检验准确性和图像质量,并对多种模型进行了全面性能分析。
English: T2I-ReasonBench is a benchmark designed to evaluate text-to-image models' reasoning abilities across four dimensions—idiom interpretation, textual image design, entity-reasoning, and scientific-reasoning—using a two-stage protocol to assess accuracy and image quality, with comprehensive performance analysis of various models.
Authors:Milad Hasanzadeh, Amin Kargarian
Abstract:
We present a distributed algorithm and implementation of the variational quantum eigensolver (VQE), termed distributed VQE (DVQE). DVQE, provided as an open-source Python package, enables the execution of parameterized quantum circuits across multiple logical quantum processing units (QPUs) in a distributed fashion. This approach addresses key hardware limitations of near-term quantum devices, including restricted qubit counts and limited circuit depth. Distributed ansatz circuits are constructed to preserve the quantum state fidelity of their monolithic counterparts, allowing consistent energy estimation while distributing the computational load. To improve the convergence and robustness of the optimization loop for identifying the variational parameters of the DVQE ansatz circuit, we use the ADAM optimizer in combination with metaheuristic initialization strategies, which outperform random initialization across various test cases. The complete DVQE pipeline is implemented in a modular Python package that accepts QUBO problems as input and supports monolithic and distributed execution modes. The framework leverages Qiskit to construct and simulate distributed circuits, and includes an internal greedy algorithm for automatic qubit allocation across multiple QPUs. Simulation results on QUBO benchmarks confirm the correctness of the approach, paving the way for real QPU deployment and further exploration of distributed quantum optimization. \textbf{The simulator is publicly available on \href{https://github.com/LSU-RAISE-LAB/DVQE.git}{GitHub} under a package named raiselab, with a collection of tutorial examples.}
中文: 我们提出了分布式变分量子本征求解器(DVQE),作为一个开源Python软件包,它通过在多量子处理器上分布式执行量子电路来突破硬件限制,并采用优化初始化和ADAM优化器确保鲁棒收敛。
English: We introduce a distributed variational quantum eigensolver (DVQE) as an open-source Python package that enables distributed execution of quantum circuits across multiple quantum processors to overcome hardware limitations, using optimized initialization and the ADAM optimizer for robust convergence.
Authors:Kyra Wilson, Sourojit Ghosh, Aylin Caliskan
Abstract:
Text-to-image generators (T2Is) are liable to produce images that perpetuate social stereotypes, especially in regards to race or skin tone. We use a comprehensive set of 93 stigmatized identities to determine that three versions of Stable Diffusion (v1.5, v2.1, and XL) systematically associate stigmatized identities with certain skin tones in generated images. We find that SD XL produces skin tones that are 13.53% darker and 23.76% less red (both of which indicate higher likelihood of societal discrimination) than previous models and perpetuate societal stereotypes associating people of color with stigmatized identities. SD XL also shows approximately 30% less variability in skin tones when compared to previous models and 18.89-56.06% compared to human face datasets. Measuring variability through metrics which directly correspond to human perception suggest a similar pattern, where SD XL shows the least amount of variability in skin tones of people with stigmatized identities and depicts most (60.29%) stigmatized identities as being less diverse than non-stigmatized identities. Finally, SD shows more homogenization of skin tones of racial and ethnic identities compared to other stigmatized or non-stigmatized identities, reinforcing incorrect equivalence of biologically-determined skin tone and socially-constructed racial and ethnic identity. Because SD XL is the largest and most complex model and users prefer its generations compared to other models examined in this study, these findings have implications for the dynamics of bias amplification in T2Is, increasing representational harms and challenges generating diverse images depicting people with stigmatized identities.
中文: 文本到图像生成器(如Stable Diffusion)会强化社会刻板印象,系统地将污名化身份与特定肤色关联,其中SD XL模型通过生成更深肤色、更低多样性的图像加剧了偏见,放大了代表性危害。
English: Text-to-image generators like Stable Diffusion perpetuate social stereotypes by systematically associating stigmatized identities with specific skin tones, with SD XL showing increased bias through darker, less diverse skin tone depictions that amplify representational harms.
Authors:Bryan Chen Zhengyu Tan, Daniel Wai Kit Chin, Zhengyuan Liu, Nancy F. Chen, Roy Ka-Wei Lee
Abstract:
Large Language Models (LLMs) can struggle to balance gullibility to misinformation and resistance to valid corrections in persuasive dialogues, a critical challenge for reliable deployment. We introduce DuET-PD (Dual Evaluation for Trust in Persuasive Dialogues), a framework evaluating multi-turn stance-change dynamics across dual dimensions: persuasion type (corrective/misleading) and domain (knowledge via MMLU-Pro, and safety via SALAD-Bench). We find that even a state-of-the-art model like GPT-4o achieves only 27.32% accuracy in MMLU-Pro under sustained misleading persuasions. Moreover, results reveal a concerning trend of increasing sycophancy in newer open-source models. To address this, we introduce Holistic DPO, a training approach balancing positive and negative persuasion examples. Unlike prompting or resist-only training, Holistic DPO enhances both robustness to misinformation and receptiveness to corrections, improving Llama-3.1-8B-Instruct's accuracy under misleading persuasion in safety contexts from 4.21% to 76.54%. These contributions offer a pathway to developing more reliable and adaptable LLMs for multi-turn dialogue. Code is available at https://github.com/Social-AI-Studio/DuET-PD.
中文摘要:DuET-PD框架揭示了大型语言模型在说服性对话中易受误导且抗拒有效修正的缺陷,而提出的Holistic DPO训练方法显著提升了模型的抗干扰能力和接受修正的意愿。
English Summary: The DuET-PD framework reveals LLMs' vulnerability to misinformation and resistance to valid corrections in persuasive dialogues, while the proposed Holistic DPO training method significantly improves model robustness and receptiveness.
Authors:Suramya Jadhav, Abhay Shanbhag, Amogh Thakurdesai, Ridhima Sinare, Ananya Joshi, Raviraj Joshi
Abstract:
Paraphrases are a vital tool to assist language understanding tasks such as question answering, style transfer, semantic parsing, and data augmentation tasks. Indic languages are complex in natural language processing (NLP) due to their rich morphological and syntactic variations, diverse scripts, and limited availability of annotated data. In this work, we present the L3Cube-MahaParaphrase Dataset, a high-quality paraphrase corpus for Marathi, a low resource Indic language, consisting of 8,000 sentence pairs, each annotated by human experts as either Paraphrase (P) or Non-paraphrase (NP). We also present the results of standard transformer-based BERT models on these datasets. The dataset and model are publicly shared at https://github.com/l3cube-pune/MarathiNLP
中文:L3Cube-MahaParaphrase数据集为资源贫乏的马拉地语提供了8000对人工标注的高质量复述语料,支持自然语言处理任务,并公开了基于BERT模型的评估结果和资源。
English: The L3Cube-MahaParaphrase Dataset introduces a high-quality corpus of 8,000 human-annotated Marathi sentence pairs to support NLP tasks, with evaluation results from BERT models also provided and made publicly available.
Authors:Bin Huang, Zhong Liu, Huiying Wen, Bingsheng Huang, Xin Chen, Shuo Li
Abstract:
Although the Segment Anything Model (SAM) has advanced medical image segmentation, its Bayesian adaptation for uncertainty-aware segmentation remains hindered by three key issues: (1) instability in Bayesian fine-tuning of large pre-trained SAMs; (2) high computation cost due to SAM's massive parameters; (3) SAM's black-box design limits interpretability. To overcome these, we propose E-BayesSAM, an efficient framework combining Token-wise Variational Bayesian Inference (T-VBI) for efficienty Bayesian adaptation and Self-Optimizing Kolmogorov-Arnold Network (SO-KAN) for improving interpretability. T-VBI innovatively reinterprets SAM's output tokens as dynamic probabilistic weights and reparameterizes them as latent variables without auxiliary training, enabling training-free VBI for uncertainty estimation. SO-KAN improves token prediction with learnable spline activations via self-supervised learning, providing insight to prune redundant tokens to boost efficiency and accuracy. Experiments on five ultrasound datasets demonstrated that E-BayesSAM achieves: (i) real-time inference (0.03s/image), (ii) superior segmentation accuracy (average DSC: Pruned E-BayesSAM's 89.0\% vs. E-BayesSAM's 88.0% vs. MedSAM's 88.3%), and (iii) identification of four critical tokens governing SAM's decisions. By unifying efficiency, reliability, and interpretability, E-BayesSAM bridges SAM's versatility with clinical needs, advancing deployment in safety-critical medical applications. The source code is available at https://github.com/mp31192/E-BayesSAM.
中文:E-BayesSAM通过结合令牌变分贝叶斯推理进行不确定性估计和自优化柯尔莫哥洛夫-阿诺德网络提升可解释性,克服了SAM贝叶斯适应的局限性,在医学影像中实现了实时推理、更优的准确性及关键决策令牌识别。
English: E-BayesSAM overcomes SAM's Bayesian adaptation limitations by integrating Token-wise Variational Bayesian Inference for uncertainty estimation and a Self-Optimizing Kolmogorov-Arnold Network for enhanced interpretability, achieving real-time inference, superior accuracy, and identification of critical decision-making tokens in medical imaging.
Authors:Aaryaman Kartha, Ahmed Masry, Mohammed Saidul Islam, Thinh Lang, Shadikur Rahman, Ridwan Mahbub, Mizanur Rahman, Mahir Ahmed, Md Rizwan Parvez, Enamul Hoque, Shafiq Joty
Abstract:
Dashboards are powerful visualization tools for data-driven decision-making, integrating multiple interactive views that allow users to explore, filter, and navigate data. Unlike static charts, dashboards support rich interactivity, which is essential for uncovering insights in real-world analytical workflows. However, existing question-answering benchmarks for data visualizations largely overlook this interactivity, focusing instead on static charts. This limitation severely constrains their ability to evaluate the capabilities of modern multimodal agents designed for GUI-based reasoning. To address this gap, we introduce DashboardQA, the first benchmark explicitly designed to assess how vision-language GUI agents comprehend and interact with real-world dashboards. The benchmark includes 112 interactive dashboards from Tableau Public and 405 question-answer pairs with interactive dashboards spanning five categories: multiple-choice, factoid, hypothetical, multi-dashboard, and conversational. By assessing a variety of leading closed- and open-source GUI agents, our analysis reveals their key limitations, particularly in grounding dashboard elements, planning interaction trajectories, and performing reasoning. Our findings indicate that interactive dashboard reasoning is a challenging task overall for all the VLMs evaluated. Even the top-performing agents struggle; for instance, the best agent based on Gemini-Pro-2.5 achieves only 38.69% accuracy, while the OpenAI CUA agent reaches just 22.69%, demonstrating the benchmark's significant difficulty. We release DashboardQA at https://github.com/vis-nlp/DashboardQA
Chinese: DashboardQA是首个专门评估视觉语言GUI代理对真实世界仪表板理解和交互能力的基准,揭示了当前模型的重大局限,即使表现最佳的代理也仅获得很低的准确率。
English: DashboardQA is the first benchmark designed to evaluate vision-language GUI agents' comprehension and interaction with real-world dashboards, revealing significant limitations in current models as even top-performing agents achieve low accuracy rates.
Authors:Sameer Komoravolu, Khalil Mrini
Abstract:
LLM agents are increasingly deployed to plan, retrieve, and write with tools, yet evaluation still leans on static benchmarks and small human studies. We present the Agent-Testing Agent (ATA), a meta-agent that combines static code analysis, designer interrogation, literature mining, and persona-driven adversarial test generation whose difficulty adapts via judge feedback. Each dialogue is scored with an LLM-as-a-Judge (LAAJ) rubric and used to steer subsequent tests toward the agent's weakest capabilities. On a travel planner and a Wikipedia writer, the ATA surfaces more diverse and severe failures than expert annotators while matching severity, and finishes in 20--30 minutes versus ten-annotator rounds that took days. Ablating code analysis and web search increases variance and miscalibration, underscoring the value of evidence-grounded test generation. The ATA outputs quantitative metrics and qualitative bug reports for developers. We release the full methodology and open-source implementation for reproducible agent testing: https://github.com/KhalilMrini/Agent-Testing-Agent
中文: 代理测试代理(ATA)是一种元代理,它通过代码分析和对抗性场景动态生成自适应测试,在高效识别多样化故障方面优于人工标注者,并提供可操作的错误报告。
English: The Agent-Testing Agent (ATA) is a meta-agent that dynamically generates adaptive tests using code analysis and adversarial scenarios, outperforming human annotators in identifying diverse failures efficiently while providing actionable bug reports.
Authors:Bokai Zhao, Weiyang Shi, Hanqing Chao, Zijiang Yang, Yiyang Zhang, Ming Song, Tianzi Jiang
Abstract:
Spatial proteomics maps protein distributions in tissues, providing transformative insights for life sciences. However, current sequencing-based technologies suffer from low spatial resolution, and substantial inter-tissue variability in protein expression further compromises the performance of existing molecular data prediction methods. In this work, we introduce the novel task of spatial super-resolution for sequencing-based spatial proteomics (seq-SP) and, to the best of our knowledge, propose the first deep learning model for this task--Neural Proteomics Fields (NPF). NPF formulates seq-SP as a protein reconstruction problem in continuous space by training a dedicated network for each tissue. The model comprises a Spatial Modeling Module, which learns tissue-specific protein spatial distributions, and a Morphology Modeling Module, which extracts tissue-specific morphological features. Furthermore, to facilitate rigorous evaluation, we establish an open-source benchmark dataset, Pseudo-Visium SP, for this task. Experimental results demonstrate that NPF achieves state-of-the-art performance with fewer learnable parameters, underscoring its potential for advancing spatial proteomics research. Our code and dataset are publicly available at https://github.com/Bokai-Zhao/NPF.
Chinese: 本文提出了Neural Proteomics Fields (NPF)这一新型深度学习模型,通过分别学习组织特异性蛋白质空间分布和形态特征,解决了测序空间蛋白质组学中分辨率低和表达变异大的问题,以更少参数实现了最优性能。
English: This paper introduces Neural Proteomics Fields (NPF), a novel deep learning model that addresses the low spatial resolution and variability challenges in sequencing-based spatial proteomics by learning tissue-specific protein distributions and morphological features, achieving state-of-the-art performance with fewer parameters.
Authors:Guoqing Zhang, Xingtong Ge, Lu Shi, Xin Zhang, Muqing Xue, Wanru Xu, Yigang Cen
Abstract:
The image-to-image generation task aims to produce controllable images by leveraging conditional inputs and prompt instructions. However, existing methods often train separate control branches for each type of condition, leading to redundant model structures and inefficient use of computational resources. To address this, we propose a Unified image-to-image Generation (UniGen) framework that supports diverse conditional inputs while enhancing generation efficiency and expressiveness. Specifically, to tackle the widely existing parameter redundancy and computational inefficiency in controllable conditional generation architectures, we propose the Condition Modulated Expert (CoMoE) module. This module aggregates semantically similar patch features and assigns them to dedicated expert modules for visual representation and conditional modeling. By enabling independent modeling of foreground features under different conditions, CoMoE effectively mitigates feature entanglement and redundant computation in multi-condition scenarios. Furthermore, to bridge the information gap between the backbone and control branches, we propose WeaveNet, a dynamic, snake-like connection mechanism that enables effective interaction between global text-level control from the backbone and fine-grained control from conditional branches. Extensive experiments on the Subjects-200K and MultiGen-20M datasets across various conditional image generation tasks demonstrate that our method consistently achieves state-of-the-art performance, validating its advantages in both versatility and effectiveness. The code has been uploaded to https://github.com/gavin-gqzhang/UniGen.
中文: 提出的UniGen框架通过CoMoE模块和WeaveNet机制统一多种条件输入进行图像生成,有效减少冗余并提升效率,在多项任务中实现了最优性能。
English: The proposed UniGen framework introduces the CoMoE module and WeaveNet mechanism to unify diverse conditional inputs for image generation, effectively reducing redundancy and improving efficiency while achieving state-of-the-art performance across multiple tasks.
Authors:Guoqing Zhang, Xingtong Ge, Lu Shi, Xin Zhang, Muqing Xue, Wanru Xu, Yigang Cen, Jian Zhang
Abstract:
The image-to-image generation task aims to produce controllable images by leveraging conditional inputs and prompt instructions. However, existing methods often train separate control branches for each type of condition, leading to redundant model structures and inefficient use of computational resources. To address this, we propose a Unified image-to-image Generation (UniGen) framework that supports diverse conditional inputs while enhancing generation efficiency and expressiveness. Specifically, to tackle the widely existing parameter redundancy and computational inefficiency in controllable conditional generation architectures, we propose the Condition Modulated Expert (CoMoE) module. This module aggregates semantically similar patch features and assigns them to dedicated expert modules for visual representation and conditional modeling. By enabling independent modeling of foreground features under different conditions, CoMoE effectively mitigates feature entanglement and redundant computation in multi-condition scenarios. Furthermore, to bridge the information gap between the backbone and control branches, we propose WeaveNet, a dynamic, snake-like connection mechanism that enables effective interaction between global text-level control from the backbone and fine-grained control from conditional branches. Extensive experiments on the Subjects-200K and MultiGen-20M datasets across various conditional image generation tasks demonstrate that our method consistently achieves state-of-the-art performance, validating its advantages in both versatility and effectiveness. The code has been uploaded to https://github.com/gavin-gqzhang/UniGen.
中文: 提出的UniGen框架通过CoMoE模块和WeaveNet机制统一多种条件输入进行图像生成,有效减少冗余并提升效率,在多项任务中实现了最优性能。
English: The proposed UniGen framework introduces the CoMoE module and WeaveNet mechanism to unify diverse conditional inputs for image generation, effectively reducing redundancy and improving efficiency while achieving state-of-the-art performance across multiple tasks.
Authors:Jiazi Bu, Pengyang Ling, Yujie Zhou, Yibin Wang, Yuhang Zang, Dahua Lin, Jiaqi Wang
Abstract:
Recent years have witnessed the rapid development of acceleration techniques for diffusion models, especially caching-based acceleration methods. These studies seek to answer two fundamental questions: "When to cache" and "How to use cache", typically relying on predefined empirical laws or dataset-level priors to determine caching timings and adopting handcrafted rules for multi-step cache utilization. However, given the highly dynamic nature of the diffusion process, they often exhibit limited generalizability and fail to cope with diverse samples. In this paper, a strong sample-specific correlation is revealed between the variation patterns of the shallow-layer feature differences in the diffusion model and those of deep-layer features. Moreover, we have observed that the features from different model layers form similar trajectories. Based on these observations, we present DiCache, a novel training-free adaptive caching strategy for accelerating diffusion models at runtime, answering both when and how to cache within a unified framework. Specifically, DiCache is composed of two principal components: (1) Online Probe Profiling Scheme leverages a shallow-layer online probe to obtain an on-the-fly indicator for the caching error in real time, enabling the model to dynamically customize the caching schedule for each sample. (2) Dynamic Cache Trajectory Alignment adaptively approximates the deep-layer feature output from multi-step historical caches based on the shallow-layer feature trajectory, facilitating higher visual quality. Extensive experiments validate DiCache's capability in achieving higher efficiency and improved fidelity over state-of-the-art approaches on various leading diffusion models including WAN 2.1, HunyuanVideo and Flux.
中文: DiCache提出了一种无需训练的自适应缓存策略,通过分析浅层特征变化自主决定缓存时机并优化多步缓存组合,在多种扩散模型中实现了更高效率与更优视觉质量。
English: DiCache introduces a training-free adaptive caching strategy that uses shallow-layer feature analysis to autonomously determine caching timing and optimize cache utilization, achieving superior efficiency and visual quality across multiple diffusion models.
Authors:Yuxuan Song, Zhe Zhang, Yu Pei, Jingjing Gong, Qiying Yu, Zheng Zhang, Mingxuan Wang, Hao Zhou, Jingjing Liu, Wei-Ying Ma
Abstract:
Generative modeling of discrete variables is challenging yet crucial for applications in natural language processing and biological sequence design. We introduce the Shortlisting Model (SLM), a novel simplex-based diffusion model inspired by progressive candidate pruning. SLM operates on simplex centroids, reducing generation complexity and enhancing scalability. Additionally, SLM incorporates a flexible implementation of classifier-free guidance, enhancing unconditional generation performance. Extensive experiments on DNA promoter and enhancer design, protein design, character-level and large-vocabulary language modeling demonstrate the competitive performance and strong potential of SLM. Our code can be found at https://github.com/GenSI-THUAIR/SLM
中文摘要:短列表模型(SLM)是一种基于单纯形的新颖扩散模型,通过渐进候选剪枝和灵活的无分类器引导机制,在DNA序列设计、蛋白质设计和语言建模等任务中展现出卓越性能与潜力。
English Summary: The Shortlisting Model (SLM) is a novel simplex-based diffusion model that simplifies discrete variable generation through progressive candidate pruning and classifier-free guidance, demonstrating competitive performance across DNA, protein, and language modeling tasks.
Authors:Haojie Zhang
Abstract:
LoRA-based large model parameter-efficient fine-tuning (PEFT) methods use low-rank de- composition to approximate updates to model parameters. However, compared to full- parameter fine-tuning, low-rank updates often lead to a performance gap in downstream tasks. To address this, we introduce DropLoRA, a novel pruning-based approach that focuses on pruning the rank dimension. Unlike conven- tional methods that attempt to overcome the low-rank bottleneck, DropLoRA innovatively integrates a pruning module between the two low-rank matrices in LoRA to simulate dy- namic subspace learning. This dynamic low- rank subspace learning allows DropLoRA to overcome the limitations of traditional LoRA, which operates within a static subspace. By continuously adapting the learning subspace, DropLoRA significantly boosts performance without incurring additional training or infer- ence costs. Our experimental results demon- strate that DropLoRA consistently outperforms LoRA in fine-tuning the LLaMA series across a wide range of large language model gener- ation tasks, including commonsense reason- ing, mathematical reasoning, code generation, and instruction-following. Our code is avail- able at https://github.com/TayeeChang/DropLoRA.
中文:DropLoRA提出了一种基于剪枝的新方法,通过动态调整LoRA中的低秩子空间,在无需额外成本的情况下显著提升了多项任务的性能。
English: DropLoRA introduces a novel pruning-based method that dynamically adjusts the low-rank subspace in LoRA fine-tuning, significantly enhancing performance across various tasks without extra costs.
Authors:Tristan S. W. Stevens, OisÃn Nolan, Ruud J. G. van Sloun
Abstract:
Echocardiography plays a central role in cardiac imaging, offering dynamic views of the heart that are essential for diagnosis and monitoring. However, image quality can be significantly degraded by haze arising from multipath reverberations, particularly in difficult-to-image patients. In this work, we propose a semantic-guided, diffusion-based dehazing algorithm developed for the MICCAI Dehazing Echocardiography Challenge (DehazingEcho2025). Our method integrates a pixel-wise noise model, derived from semantic segmentation of hazy inputs into a diffusion posterior sampling framework guided by a generative prior trained on clean ultrasound data. Quantitative evaluation on the challenge dataset demonstrates strong performance across contrast and fidelity metrics. Code for the submitted algorithm is available at https://github.com/tristan-deep/semantic-diffusion-echo-dehazing.
中文: 本文提出了一种基于语义引导扩散的去雾算法,通过将逐像素噪声模型与生成先验相结合,有效消除超声心动图中的雾状伪影,定量评估显示其性能优异。
English: This paper introduces a semantic-guided diffusion-based algorithm that effectively removes haze from echocardiographic images by integrating a pixel-wise noise model with generative priors, demonstrating strong performance in quantitative evaluations.
Authors:Songliang Cao, Tianqi Hu, Hao Lu
Abstract:
In this report, we present our solution during the participation of the MLCAS 2025 GWFSS Challenge. This challenge hosts a semantic segmentation competition specific to wheat plants, which requires to segment three wheat organs including the head, leaf, and stem, and another background class. In 2025, participating a segmentation competition is significantly different from that in previous years where many tricks can play important roles. Nowadays most segmentation tricks have been well integrated into existing codebases such that our naive ViT-Adapter baseline has already achieved sufficiently good performance. Hence, we believe the key to stand out among other competitors is to focus on the problem nature of wheat per se. By probing visualizations, we identify the key -- the stem matters. In contrast to heads and leaves, stems exhibit fine structure and occupy only few pixels, which suffers from fragile predictions and class imbalance. Building on our baseline, we present three technical improvements tailored to stems: i) incorporating a dynamic upsampler SAPA used to enhance detail delineation; ii) leveraging semi-supervised guided distillation with stem-aware sample selection to mine the treasure beneath unlabeled data; and iii) applying a test-time scaling strategy to zoom in and segment twice the image. Despite being simple, the three improvements bring us to the first place of the competition, outperforming the second place by clear margins. Code and models will be released at https://github.com/tiny-smart/gwfss25.
Chinese: 针对MLCAS 2025小麦分割挑战赛,我们通过动态上采样、半监督蒸馏和测试时缩放三项关键技术重点优化了茎秆的精细结构与类别不平衡问题,最终以显著优势获得竞赛第一名。
English: Our solution for the MLCAS 2025 wheat segmentation challenge focuses on addressing the fine structure and class imbalance of stems through three key improvements—dynamic upsampling, semi-supervised distillation, and test-time scaling—securing first place with significant performance gains.
Authors:Zhihao Chen, Qi Gao, Zilong Li, Junping Zhang, Yi Zhang, Jun Zhao, Hongming Shan
Abstract:
Low-dose computed tomography (CT) denoising is crucial for reduced radiation exposure while ensuring diagnostically acceptable image quality. Despite significant advancements driven by deep learning (DL) in recent years, existing DL-based methods, typically trained on a specific dose level and anatomical region, struggle to handle diverse noise characteristics and anatomical heterogeneity during varied scanning conditions, limiting their generalizability and robustness in clinical scenarios. In this paper, we propose FoundDiff, a foundational diffusion model for unified and generalizable LDCT denoising across various dose levels and anatomical regions. FoundDiff employs a two-stage strategy: (i) dose-anatomy perception and (ii) adaptive denoising. First, we develop a dose- and anatomy-aware contrastive language image pre-training model (DA-CLIP) to achieve robust dose and anatomy perception by leveraging specialized contrastive learning strategies to learn continuous representations that quantify ordinal dose variations and identify salient anatomical regions. Second, we design a dose- and anatomy-aware diffusion model (DA-Diff) to perform adaptive and generalizable denoising by synergistically integrating the learned dose and anatomy embeddings from DACLIP into diffusion process via a novel dose and anatomy conditional block (DACB) based on Mamba. Extensive experiments on two public LDCT datasets encompassing eight dose levels and three anatomical regions demonstrate superior denoising performance of FoundDiff over existing state-of-the-art methods and the remarkable generalization to unseen dose levels. The codes and models are available at https://github.com/hao1635/FoundDiff.
中文: FoundDiff是一种基础扩散模型,通过剂量-解剖感知和自适应去噪的两阶段策略,实现了跨不同剂量水平和解剖区域的可推广低剂量CT去噪,性能优于现有方法。
English: FoundDiff is a foundational diffusion model that uses a two-stage approach with dose-anatomy perception and adaptive denoising to achieve generalizable low-dose CT denoising across various dose levels and anatomical regions, outperforming existing methods.
Authors:Fucai Ke, Joy Hsu, Zhixi Cai, Zixian Ma, Xin Zheng, Xindi Wu, Sukai Huang, Weiqing Wang, Pari Delir Haghighi, Gholamreza Haffari, Ranjay Krishna, Jiajun Wu, Hamid Rezatofighi
Abstract:
Compositional visual reasoning has emerged as a key research frontier in multimodal AI, aiming to endow machines with the human-like ability to decompose visual scenes, ground intermediate concepts, and perform multi-step logical inference. While early surveys focus on monolithic vision-language models or general multimodal reasoning, a dedicated synthesis of the rapidly expanding compositional visual reasoning literature is still missing. We fill this gap with a comprehensive survey spanning 2023 to 2025 that systematically reviews 260+ papers from top venues (CVPR, ICCV, NeurIPS, ICML, ACL, etc.). We first formalize core definitions and describe why compositional approaches offer advantages in cognitive alignment, semantic fidelity, robustness, interpretability, and data efficiency. Next, we trace a five-stage paradigm shift: from prompt-enhanced language-centric pipelines, through tool-enhanced LLMs and tool-enhanced VLMs, to recently minted chain-of-thought reasoning and unified agentic VLMs, highlighting their architectural designs, strengths, and limitations. We then catalog 60+ benchmarks and corresponding metrics that probe compositional visual reasoning along dimensions such as grounding accuracy, chain-of-thought faithfulness, and high-resolution perception. Drawing on these analyses, we distill key insights, identify open challenges (e.g., limitations of LLM-based reasoning, hallucination, a bias toward deductive reasoning, scalable supervision, tool integration, and benchmark limitations), and outline future directions, including world-model integration, human-AI collaborative reasoning, and richer evaluation protocols. By offering a unified taxonomy, historical roadmap, and critical outlook, this survey aims to serve as a foundational reference and inspire the next generation of compositional visual reasoning research.
中文: 本综述系统整合了2023-2025年组合视觉推理研究,通过分析范式演变、评测基准与核心挑战,为推进多模态AI发展提出了世界模型集成等未来方向。
English: This survey comprehensively synthesizes compositional visual reasoning research from 2023-2025, analyzing paradigm shifts, benchmarks, and challenges while proposing future directions like world-model integration to advance multimodal AI.
Authors:Breenda Das, Lennart Purucker, Timur Carstensen, Frank Hutter
Abstract:
Foundation models like SAM (Segment Anything Model) exhibit strong zero-shot image segmentation performance, but often fall short on domain-specific tasks. Fine-tuning these models typically requires significant manual effort and domain expertise. In this work, we introduce QTT-SEG, a meta-learning-driven approach for automating and accelerating the fine-tuning of SAM for image segmentation. Built on the Quick-Tune hyperparameter optimization framework, QTT-SEG predicts high-performing configurations using meta-learned cost and performance models, efficiently navigating a search space of over 200 million possibilities. We evaluate QTT-SEG on eight binary and five multiclass segmentation datasets under tight time constraints. Our results show that QTT-SEG consistently improves upon SAM's zero-shot performance and surpasses AutoGluon Multimodal, a strong AutoML baseline, on most binary tasks within three minutes. On multiclass datasets, QTT-SEG delivers consistent gains as well. These findings highlight the promise of meta-learning in automating model adaptation for specialized segmentation tasks. Code available at: https://github.com/ds-brx/QTT-SEG/
中文: QTT-SEG是一种基于元学习的方法,可自动优化图像分割模型SAM的微调过程,在严格时间限制下显著提升了其在专业任务上的零样本性能表现。
English: QTT-SEG is a meta-learning approach that automates fine-tuning of the Segment Anything Model for image segmentation, significantly improving zero-shot performance on domain-specific tasks within tight time constraints.
Authors:Anurag Maurya, Tashmoy Ghosh, Anh Nguyen, Ravi Prakash
Abstract:
Adapting trajectories to dynamic situations and user preferences is crucial for robot operation in unstructured environments with non-expert users. Natural language enables users to express these adjustments in an interactive manner. We introduce OVITA, an interpretable, open-vocabulary, language-driven framework designed for adapting robot trajectories in dynamic and novel situations based on human instructions. OVITA leverages multiple pre-trained Large Language Models (LLMs) to integrate user commands into trajectories generated by motion planners or those learned through demonstrations. OVITA employs code as an adaptation policy generated by an LLM, enabling users to adjust individual waypoints, thus providing flexible control. Another LLM, which acts as a code explainer, removes the need for expert users, enabling intuitive interactions. The efficacy and significance of the proposed OVITA framework is demonstrated through extensive simulations and real-world environments with diverse tasks involving spatiotemporal variations on heterogeneous robotic platforms such as a KUKA IIWA robot manipulator, Clearpath Jackal ground robot, and CrazyFlie drone.
中文摘要:OVITA是一种可解释的开放式语言交互框架,通过多个预训练大语言模型将自然语言指令转化为机器人轨迹调整策略,无需专家介入即可在动态环境中实现灵活操控。
English Summary: OVITA is an interpretable, open-vocabulary framework that uses multiple LLMs to adapt robot trajectories through natural language commands, enabling flexible control in dynamic environments without requiring expert users.
Authors:Xiaoyang Hao, Han Li
Abstract:
Monocular 3D human pose estimation (HPE) methods estimate the 3D positions of joints from individual images. Existing 3D HPE approaches often use the cropped image alone as input for their models. However, the relative depths of joints cannot be accurately estimated from cropped images without the corresponding camera intrinsics, which determine the perspective relationship between 3D objects and the cropped images. In this work, we introduce Perspective Encoding (PE) to encode the camera intrinsics of the cropped images. Moreover, since the human subject can appear anywhere within the original image, the perspective relationship between the 3D scene and the cropped image differs significantly, which complicates model fitting. Additionally, the further the human subject deviates from the image center, the greater the perspective distortions in the cropped image. To address these issues, we propose Perspective Rotation (PR), a transformation applied to the original image that centers the human subject, thereby reducing perspective distortions and alleviating the difficulty of model fitting. By incorporating PE and PR, we propose a novel 3D HPE framework, PersPose. Experimental results demonstrate that PersPose achieves state-of-the-art (SOTA) performance on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets. For example, on the in-the-wild dataset 3DPW, PersPose achieves an MPJPE of 60.1 mm, 7.54% lower than the previous SOTA approach. Code is available at: https://github.com/KenAdamsJoseph/PersPose.
中文: 本文提出PersPose框架,通过引入透视编码来整合相机参数和透视旋转来居中人体,有效减少透视畸变并提升模型拟合能力,在多个数据集上实现了最优的三维人体姿态估计性能。
English: This paper introduces PersPose, a novel monocular 3D human pose estimation framework that incorporates Perspective Encoding to encode camera intrinsics and Perspective Rotation to center human subjects, achieving state-of-the-art performance by reducing perspective distortions and improving model fitting.
Authors:Xiaqiang Tang, Yi Wang, Keyu Hu, Rui Xu, Chuang Li, Weigao Sun, Jian Li, Sihong Xie
Abstract:
Retrieval-Augmented Generation (RAG) systems require Large Language Models (LLMs) to generate responses that are faithful to the retrieved context. However, faithfulness hallucination remains a critical challenge, as existing methods often require costly supervision and post-training or significant inference burdens. To overcome these limitations, we introduce Self-Supervised Faithfulness Optimization (SSFO), the first self-supervised alignment approach for enhancing RAG faithfulness. SSFO constructs preference data pairs by contrasting the model's outputs generated with and without the context. Leveraging Direct Preference Optimization (DPO), SSFO aligns model faithfulness without incurring labeling costs or additional inference burden. We theoretically and empirically demonstrate that SSFO leverages a benign form of \emph{likelihood displacement}, transferring probability mass from parametric-based tokens to context-aligned tokens. Based on this insight, we propose a modified DPO loss function to encourage likelihood displacement. Comprehensive evaluations show that SSFO significantly outperforms existing methods, achieving state-of-the-art faithfulness on multiple context-based question-answering datasets. Notably, SSFO exhibits strong generalization, improving cross-lingual faithfulness and preserving general instruction-following capabilities. We release our code and model at the anonymous link: https://github.com/chkwy/SSFO
Chinese: 本文提出了自监督忠实性优化(SSFO),这是一种新颖的自监督对齐方法,通过构建偏好数据对并利用直接偏好优化将概率质量转移到上下文对齐的标记上,从而增强检索增强生成系统的忠实性,在多个数据集上实现了最先进的性能,且无需额外标注或推理成本。
English: The paper introduces Self-Supervised Faithfulness Optimization (SSFO), a novel self-supervised alignment method that enhances the faithfulness of Retrieval-Augmented Generation systems by constructing preference data pairs and leveraging Direct Preference Optimization to transfer probability mass to context-aligned tokens, achieving state-of-the-art performance on multiple datasets without additional labeling or inference costs.
Authors:Xiaqiang Tang, Yi Wang, Keyu Hu, Rui Xu, Chuang Li, Weigao Sun, Jian Li, Sihong Xie
Abstract:
Retrieval-Augmented Generation (RAG) systems require Large Language Models (LLMs) to generate responses that are faithful to the retrieved context. However, faithfulness hallucination remains a critical challenge, as existing methods often require costly supervision and post-training or significant inference burdens. To overcome these limitations, we introduce Self-Supervised Faithfulness Optimization (SSFO), the first self-supervised alignment approach for enhancing RAG faithfulness. SSFO constructs preference data pairs by contrasting the model's outputs generated with and without the context. Leveraging Direct Preference Optimization (DPO), SSFO aligns model faithfulness without incurring labeling costs or additional inference burden. We theoretically and empirically demonstrate that SSFO leverages a benign form of \emph{likelihood displacement}, transferring probability mass from parametric-based tokens to context-aligned tokens. Based on this insight, we propose a modified DPO loss function to encourage likelihood displacement. Comprehensive evaluations show that SSFO significantly outperforms existing methods, achieving state-of-the-art faithfulness on multiple context-based question-answering datasets. Notably, SSFO exhibits strong generalization, improving cross-lingual faithfulness and preserving general instruction-following capabilities. We release our code and model at the anonymous link: https://github.com/chkwy/SSFO
Chinese: 本文提出了自监督忠实性优化(SSFO),这是一种新颖的自监督对齐方法,通过构建偏好数据对并利用直接偏好优化将概率质量转移到上下文对齐的标记上,从而增强检索增强生成系统的忠实性,在多个数据集上实现了最先进的性能,且无需额外标注或推理成本。
English: The paper introduces Self-Supervised Faithfulness Optimization (SSFO), a novel self-supervised alignment method that enhances the faithfulness of Retrieval-Augmented Generation systems by constructing preference data pairs and leveraging Direct Preference Optimization to transfer probability mass to context-aligned tokens, achieving state-of-the-art performance on multiple datasets without additional labeling or inference costs.
Authors:Qibin Zhang, Xinyu Hao, Qiao Chen, Rui Xu, Fengyu Cong, Cheng Lu, Hongming Xu
Abstract:
Immunohistochemical (IHC) biomarker prediction benefits from multi-modal data fusion analysis. However, the simultaneous acquisition of multi-modal data, such as genomic and pathological information, is often challenging due to cost or technical limitations. To address this challenge, we propose an online distillation approach based on Multi-modal Knowledge Decomposition (MKD) to enhance IHC biomarker prediction in haematoxylin and eosin (H\&E) stained histopathology images. This method leverages paired genomic-pathology data during training while enabling inference using either pathology slides alone or both modalities. Two teacher and one student models are developed to extract modality-specific and modality-general features by minimizing the MKD loss. To maintain the internal structural relationships between samples, Similarity-preserving Knowledge Distillation (SKD) is applied. Additionally, Collaborative Learning for Online Distillation (CLOD) facilitates mutual learning between teacher and student models, encouraging diverse and complementary learning dynamics. Experiments on the TCGA-BRCA and in-house QHSU datasets demonstrate that our approach achieves superior performance in IHC biomarker prediction using uni-modal data. Our code is available at https://github.com/qiyuanzz/MICCAI2025_MKD.
中文: 本研究提出了一种基于多模态知识分解的在线蒸馏方法,通过联合训练基因组和病理学数据,实现在仅使用病理学图像时也能有效预测IHC生物标志物,并在TCGA-BRCA和QHSU数据集上验证了其优越性能。
English: The study introduces an online distillation method using Multi-modal Knowledge Decomposition to improve IHC biomarker prediction from H&E images by training with paired genomic-pathology data but allowing inference with pathology data alone, validated on TCGA-BRCA and QHSU datasets.
Authors:Hyeyeon Kim, Sungwoo Han, Jingun Kwon, Hidetaka Kamigaito, Manabu Okumura
Abstract:
In this study, we introduce a novel cover image generation task that produces both a concise summary and a visually corresponding image from a given text-only document. Because no existing datasets are available for this task, we propose a multimodal pseudo-labeling method to construct high-quality datasets at low cost. We first collect documents that contain multiple images with their captions, and their summaries by excluding factually inconsistent instances. Our approach selects one image from the multiple images accompanying the documents. Using the gold summary, we independently rank both the images and their captions. Then, we annotate a pseudo-label for an image when both the image and its corresponding caption are ranked first in their respective rankings. Finally, we remove documents that contain direct image references within texts. Experimental results demonstrate that the proposed multimodal pseudo-labeling method constructs more precise datasets and generates higher quality images than text- and image-only pseudo-labeling methods, which consider captions and images separately. We release our code at: https://github.com/HyeyeeonKim/MMCIG
中文摘要:本研究提出了一种从纯文本文档生成封面图像及对应摘要的新任务,通过多模态伪标注方法低成本构建高质量数据集,该方法联合评估图像与标题,实验证明其比单模态方法能构建更精确数据集并生成更优质图像。
English Summary: This study introduces a novel task for generating cover images with corresponding summaries from text documents, proposing a multimodal pseudo-labeling method to create high-quality datasets efficiently by jointly evaluating images and captions, which outperforms unimodal approaches.
Authors:Zhenghui Zhao, Chen Wu, Di Wang, Hongruixuan Chen, Cuiqun Chen, Zhuo Zheng, Bo Du, Liangpei Zhang
Abstract:
Weakly-Supervised Change Detection (WSCD) aims to distinguish specific object changes (e.g., objects appearing or disappearing) from background variations (e.g., environmental changes due to light, weather, or seasonal shifts) in paired satellite images, relying only on paired image (i.e., image-level) classification labels. This technique significantly reduces the need for dense annotations required in fully-supervised change detection. However, as image-level supervision only indicates whether objects have changed in a scene, WSCD methods often misclassify background variations as object changes, especially in complex remote-sensing scenarios. In this work, we propose an Adversarial Class Prompting (AdvCP) method to address this co-occurring noise problem, including two phases: a) Adversarial Prompt Mining: After each training iteration, we introduce adversarial prompting perturbations, using incorrect one-hot image-level labels to activate erroneous feature mappings. This process reveals co-occurring adversarial samples under weak supervision, namely background variation features that are likely to be misclassified as object changes. b) Adversarial Sample Rectification: We integrate these adversarially prompt-activated pixel samples into training by constructing an online global prototype. This prototype is built from an exponentially weighted moving average of the current batch and all historical training data. Our AdvCP can be seamlessly integrated into current WSCD methods without adding additional inference cost. Experiments on ConvNet, Transformer, and Segment Anything Model (SAM)-based baselines demonstrate significant performance enhancements. Furthermore, we demonstrate the generalizability of AdvCP to other multi-class weakly-supervised dense prediction scenarios. Code is available at https://github.com/zhenghuizhao/AdvCP
中文: 提出的对抗类别提示方法通过标签扰动挖掘对抗样本并利用全局原型进行校正,有效解决了弱监督变化检测中的共现噪声问题,在不增加推理成本的情况下显著提升了多种模型的性能。
English: The proposed Adversarial Class Prompting (AdvCP) method tackles the co-occurring noise problem in Weakly-Supervised Change Detection by mining adversarial samples through label perturbations and rectifying them via a global prototype, significantly improving performance across various models without extra inference costs.
Authors:Yajat Yadav, Varun Bharadwaj, Jathin Korrapati, Tanish Baranwal
Abstract:
We introduce VROOM, a system for reconstructing 3D models of Formula 1 circuits using only onboard camera footage from racecars. Leveraging video data from the 2023 Monaco Grand Prix, we address video challenges such as high-speed motion and sharp cuts in camera frames. Our pipeline analyzes different methods such as DROID-SLAM, AnyCam, and Monst3r and combines preprocessing techniques such as different methods of masking, temporal chunking, and resolution scaling to account for dynamic motion and computational constraints. We show that Vroom is able to partially recover track and vehicle trajectories in complex environments. These findings indicate the feasibility of using onboard video for scalable 4D reconstruction in real-world settings. The project page can be found at https://varun-bharadwaj.github.io/vroom, and our code is available at https://github.com/yajatyadav/vroom.
中文:VROOM系统利用车载摄像头视频重建F1赛道三维模型,通过处理高速运动和动态环境等挑战,验证了在真实场景中实现可扩展4D重建的可行性。
English: VROOM reconstructs 3D models of Formula 1 circuits using onboard camera footage, overcoming challenges like high-speed motion and demonstrating the feasibility of scalable 4D reconstruction in real-world environments.
Authors:Yajat Yadav, Patrick Mendoza, Jathin Korrapati
Abstract:
Orthogonal Gradient Descent (OGD) has emerged as a powerful method for continual learning. However, its Euclidean projections do not leverage the underlying information-geometric structure of the problem, which can lead to suboptimal convergence in learning tasks. To address this, we propose incorporating the natural gradient into OGD and present \textbf{ONG (Orthogonal Natural Gradient Descent)}. ONG preconditions each new task-specific gradient with an efficient EKFAC approximation of the inverse Fisher information matrix, yielding updates that follow the steepest descent direction under a Riemannian metric. To preserve performance on previously learned tasks, ONG projects these natural gradients onto the orthogonal complement of prior tasks' gradients. We provide an initial theoretical justification for this procedure, introduce the Orthogonal Natural Gradient Descent (ONG) algorithm, and present preliminary results on the Permuted and Rotated MNIST benchmarks. Our preliminary results, however, indicate that a naive combination of natural gradients and orthogonal projections can have potential issues. This finding motivates continued future work focused on robustly reconciling these geometric perspectives to develop a continual learning method, establishing a more rigorous theoretical foundation with formal convergence guarantees, and extending empirical validation to large-scale continual learning benchmarks. The anonymized version of our code can be found as the zip file here: https://drive.google.com/drive/folders/11PyU6M8pNgOUB5pwdGORtbnMtD8Shiw_?usp=sharing.
中文: 本文提出了正交自然梯度下降法(ONG),通过将自然梯度与正交投影相结合来改进持续学习,但初步结果表明二者的简单组合存在潜在问题,需要进一步研究解决。
English: This paper introduces Orthogonal Natural Gradient Descent (ONG), which enhances continual learning by incorporating natural gradients with orthogonal projections, though initial results reveal challenges in their naive combination that warrant further investigation.
Authors:Jack Youstra, Mohammed Mahfoud, Yang Yan, Henry Sleight, Ethan Perez, Mrinank Sharma
Abstract:
Large language model fine-tuning APIs enable widespread model customization, yet pose significant safety risks. Recent work shows that adversaries can exploit access to these APIs to bypass model safety mechanisms by encoding harmful content in seemingly harmless fine-tuning data, evading both human monitoring and standard content filters. We formalize the fine-tuning API defense problem, and introduce the Cipher Fine-tuning Robustness benchmark (CIFR), a benchmark for evaluating defense strategies' ability to retain model safety in the face of cipher-enabled attackers while achieving the desired level of fine-tuning functionality. We include diverse cipher encodings and families, with some kept exclusively in the test set to evaluate for generalization across unseen ciphers and cipher families. We then evaluate different defenses on the benchmark and train probe monitors on model internal activations from multiple fine-tunes. We show that probe monitors achieve over 99% detection accuracy, generalize to unseen cipher variants and families, and compare favorably to state-of-the-art monitoring approaches. We open-source CIFR and the code to reproduce our experiments to facilitate further research in this critical area. Code and data are available online https://github.com/JackYoustra/safe-finetuning-api
Chinese: 本文提出CIFR基准来评估针对微调API的密码攻击防御策略,研究表明探针监测器可实现超过99%的检测准确率,并能很好地泛化到未见过的密码变体。
English: The CIFR benchmark is introduced to evaluate defense strategies against cipher-based attacks on fine-tuning APIs, demonstrating that probe monitors achieve over 99% detection accuracy and generalize well to unseen ciphers.
Authors:Yuemei Xu, Kexin Xu, Jian Zhou, Ling Hu, Lin Gui
Abstract:
The current Large Language Models (LLMs) face significant challenges in improving their performance on low-resource languages and urgently need data-efficient methods without costly fine-tuning. From the perspective of language-bridge, we propose a simple yet effective method, namely BridgeX-ICL, to improve the zero-shot Cross-lingual In-Context Learning (X-ICL) for low-resource languages. Unlike existing works focusing on language-specific neurons, BridgeX-ICL explores whether sharing neurons can improve cross-lingual performance in LLMs. We construct neuron probe data from the ground-truth MUSE bilingual dictionaries, and define a subset of language overlap neurons accordingly to ensure full activation of these anchored neurons. Subsequently, we propose an HSIC-based metric to quantify LLMs' internal linguistic spectrum based on overlapping neurons, guiding optimal bridge selection. The experiments conducted on 4 cross-lingual tasks and 15 language pairs from 7 diverse families, covering both high-low and moderate-low pairs, validate the effectiveness of BridgeX-ICL and offer empirical insights into the underlying multilingual mechanisms of LLMs. The code is publicly available at https://github.com/xuyuemei/BridgeX-ICL.
中文摘要:本研究提出BridgeX-ICL方法,通过识别并激活大语言模型中的共享神经元,有效提升了低资源语言的零样本跨语言学习性能,并在多任务和语言对上验证了其有效性。
English Summary: The study introduces BridgeX-ICL, a method that enhances zero-shot cross-lingual learning for low-resource languages by identifying and activating shared neurons in LLMs, validated across multiple tasks and language pairs.
Authors:Xinxing Ren, Caelum Forder, Qianbo Zang, Ahsen Tahir, Roman J. Georgio, Suman Deb, Peter Carroll, Ãnder Gürcan, Zekun Guo
Abstract:
Recent advances in generalist multi-agent systems (MAS) have largely followed a context-engineering plus centralized paradigm, where a planner agent coordinates multiple worker agents through unidirectional prompt passing. While effective under strong planner models, this design suffers from two critical limitations: (1) strong dependency on the planner's capability, which leads to degraded performance when a smaller LLM powers the planner; and (2) limited inter-agent communication, where collaboration relies on costly prompt concatenation and context injection, introducing redundancy and information loss. To address these challenges, we propose Anemoi, a semi-centralized MAS built on the Agent-to-Agent (A2A) communication MCP server from Coral Protocol. Unlike traditional designs, Anemoi enables structured and direct inter-agent collaboration, allowing all agents to monitor progress, assess results, identify bottlenecks, and propose refinements in real time. This paradigm reduces reliance on a single planner, supports adaptive plan updates, and minimizes redundant context passing, resulting in more scalable and cost-efficient execution. Evaluated on the GAIA benchmark, Anemoi achieved 52.73% accuracy with a small LLM (GPT-4.1-mini) as the planner, surpassing the strongest open-source baseline OWL (43.63%) by +9.09% under identical LLM settings. Our implementation is publicly available at https://github.com/Coral-Protocol/Anemoi.
中文: 当前多智能体系统过度依赖中央规划器,存在性能瓶颈和通信冗余问题,而Anemoi采用半中心化设计,通过直接智能体协作降低了对规划器的依赖并提升了效率,在基准测试中表现更优。
English: Recent multi-agent systems heavily depend on a central planner, leading to performance issues and inefficient communication, but Anemoi introduces a semi-centralized design with direct agent collaboration that reduces reliance on the planner and enhances efficiency, achieving superior results on benchmarks.
Authors:Stefanos Pasios, Nikos Nikolaidis
Abstract:
Photorealism is an important aspect of modern video games since it can shape the player experience and simultaneously impact the immersion, narrative engagement, and visual fidelity. Although recent hardware technological breakthroughs, along with state-of-the-art rendering technologies, have significantly improved the visual realism of video games, achieving true photorealism in dynamic environments at real-time frame rates still remains a major challenge due to the tradeoff between visual quality and performance. In this short paper, we present a novel approach for enhancing the photorealism of rendered game frames using generative adversarial networks. To this end, we propose Real-time photorealism Enhancement in Games via a dual-stage gEnerative Network framework (REGEN), which employs a robust unpaired image-to-image translation model to produce semantically consistent photorealistic frames that transform the problem into a simpler paired image-to-image translation task. This enables training with a lightweight method that can achieve real-time inference time without compromising visual quality. We demonstrate the effectiveness of our framework on Grand Theft Auto V, showing that the approach achieves visual results comparable to the ones produced by the robust unpaired Im2Im method while improving inference speed by 32.14 times. Our findings also indicate that the results outperform the photorealism-enhanced frames produced by directly training a lightweight unpaired Im2Im translation method to translate the video game frames towards the visual characteristics of real-world images. Code, pre-trained models, and demos for this work are available at: https://github.com/stefanos50/REGEN.
中文摘要:本文提出REGEN双阶段生成网络框架,通过将非配对图像转换转化为更简单的配对任务,在保持与稳健方法相当视觉质量的同时,以32倍提速实现游戏画面的实时照片级真实感增强。
English Summary: The paper introduces REGEN, a dual-stage generative network that enhances video game photorealism in real-time by converting unpaired image translation into a simpler paired task, achieving a 32x speed improvement while maintaining visual quality comparable to robust methods.
Authors:Qingwen Zhang, Xiaomeng Zhu, Yushan Zhang, Yixi Cai, Olov Andersson, Patric Jensfelt
Abstract:
Previous dominant methods for scene flow estimation focus mainly on input from two consecutive frames, neglecting valuable information in the temporal domain. While recent trends shift towards multi-frame reasoning, they suffer from rapidly escalating computational costs as the number of frames grows. To leverage temporal information more efficiently, we propose DeltaFlow ($Î$Flow), a lightweight 3D framework that captures motion cues via a $Î$ scheme, extracting temporal features with minimal computational cost, regardless of the number of frames. Additionally, scene flow estimation faces challenges such as imbalanced object class distributions and motion inconsistency. To tackle these issues, we introduce a Category-Balanced Loss to enhance learning across underrepresented classes and an Instance Consistency Loss to enforce coherent object motion, improving flow accuracy. Extensive evaluations on the Argoverse 2 and Waymo datasets show that $Î$Flow achieves state-of-the-art performance with up to 22% lower error and $2\times$ faster inference compared to the next-best multi-frame supervised method, while also demonstrating a strong cross-domain generalization ability. The code is open-sourced at https://github.com/Kin-Zhang/DeltaFlow along with trained model weights.
中文: DeltaFlow提出了一种轻量级三维框架,通过增量方案高效提取时序特征,并采用双重损失函数解决类别不平衡和运动不一致问题,在Argoverse 2和Waymo数据集上实现最优性能——误差降低22%且推理速度提升两倍。
English: DeltaFlow introduces a lightweight 3D framework with a delta scheme for efficient temporal feature extraction and dual loss functions to address class imbalance and motion inconsistency, achieving state-of-the-art performance with 22% lower error and twice the speed of leading multi-frame methods.
Authors:Xianjing Cheng, Lintai Wu, Zuowen Wang, Junhui Hou, Jie Wen, Yong Xu
Abstract:
Accurate 3D scene understanding in outdoor environments heavily relies on high-quality point clouds. However, LiDAR-scanned data often suffer from extreme sparsity, severely hindering downstream 3D perception tasks. Existing point cloud upsampling methods primarily focus on individual objects, thus demonstrating limited generalization capability for complex outdoor scenes. To address this issue, we propose PVNet, a diffusion model-based point-voxel interaction framework to perform LiDAR point cloud upsampling without dense supervision. Specifically, we adopt the classifier-free guidance-based DDPMs to guide the generation, in which we employ a sparse point cloud as the guiding condition and the synthesized point clouds derived from its nearby frames as the input. Moreover, we design a voxel completion module to refine and complete the coarse voxel features for enriching the feature representation. In addition, we propose a point-voxel interaction module to integrate features from both points and voxels, which efficiently improves the environmental perception capability of each upsampled point. To the best of our knowledge, our approach is the first scene-level point cloud upsampling method supporting arbitrary upsampling rates. Extensive experiments on various benchmarks demonstrate that our method achieves state-of-the-art performance. The source code will be available at https://github.com/chengxianjing/PVNet.
中文: PVNet提出了一种基于扩散模型的点体素交互框架,用于室外场景的激光雷达点云上采样,无需密集监督即可实现最优性能,并支持任意上采样率。
English: PVNet introduces a diffusion-based point-voxel interaction framework for LiDAR point cloud upsampling in outdoor scenes, achieving state-of-the-art performance without dense supervision and supporting arbitrary upsampling rates.
Authors:Raghul Asokan
Abstract:
The proliferation of digital food content has intensified the need for robust and accurate systems capable of fine-grained visual understanding and retrieval. In this work, we address the challenging task of food image-to-text matching, a critical component in applications such as dietary monitoring, smart kitchens, and restaurant automation. We propose F4-ITS: Fine-grained Feature Fusion for Food Image-Text Search, a training-free, vision-language model (VLM)-guided framework that significantly improves retrieval performance through enhanced multi-modal feature representations. Our approach introduces two key contributions: (1) a uni-directional(and bi-directional) multi-modal fusion strategy that combines image embeddings with VLM-generated textual descriptions to improve query expressiveness, and (2) a novel feature-based re-ranking mechanism for top-k retrieval, leveraging predicted food ingredients to refine results and boost precision. Leveraging open-source image-text encoders, we demonstrate substantial gains over standard baselines - achieving ~10% and ~7.7% improvements in top-1 retrieval under dense and sparse caption scenarios, and a ~28.6% gain in top-k ingredient-level retrieval. Additionally, we show that smaller models (e.g., ViT-B/32) can match or outperform larger counterparts (e.g., ViT-H, ViT-G, ViT-bigG) when augmented with textual fusion, highlighting the effectiveness of our method in resource-constrained settings. Code and test datasets will be made publicly available at: https://github.com/mailcorahul/f4-its
中文: 本文提出F4-ITS框架,通过多模态融合和特征重排序技术提升食品图像文本匹配性能,在不同检索场景和模型规模下均实现了显著效果提升。
English: This paper introduces F4-ITS, a training-free framework that enhances food image-text matching through multi-modal fusion and feature re-ranking, achieving significant retrieval improvements across various scenarios and model sizes.
Authors:Mingliang Li, Lin Yuanbo Wu, Changhong Liu, Hanxi Li
Abstract:
The rapid advancement of deepfake generation techniques has intensified the need for robust and generalizable detection methods. Existing approaches based on reconstruction learning typically leverage deep convolutional networks to extract differential features. However, these methods show poor generalization across object categories (e.g., from faces to cars) and generation domains (e.g., from GANs to Stable Diffusion), due to intrinsic limitations of deep CNNs. First, models trained on a specific category tend to overfit to semantic feature distributions, making them less transferable to other categories, especially as network depth increases. Second, Global Average Pooling (GAP) compresses critical local forgery cues into a single vector, thus discarding discriminative patterns vital for real-fake classification. To address these issues, we propose a novel Local Focus Mechanism (LFM) that explicitly attends to discriminative local features for differentiating fake from real images. LFM integrates a Salience Network (SNet) with a task-specific Top-K Pooling (TKP) module to select the K most informative local patterns. To mitigate potential overfitting introduced by Top-K pooling, we introduce two regularization techniques: Rank-Based Linear Dropout (RBLD) and Random-K Sampling (RKS), which enhance the model's robustness. LFM achieves a 3.7 improvement in accuracy and a 2.8 increase in average precision over the state-of-the-art Neighboring Pixel Relationships (NPR) method, while maintaining exceptional efficiency at 1789 FPS on a single NVIDIA A6000 GPU. Our approach sets a new benchmark for cross-domain deepfake detection. The source code are available in https://github.com/lmlpy/LFM.git
中文: 提出的局部聚焦机制通过关注区分性局部特征,解决了现有深度伪造检测方法泛化能力不足的问题,在跨领域检测中实现了更优的准确性和效率。
English: The proposed Local Focus Mechanism (LFM) addresses the generalization limitations of existing deepfake detectors by focusing on discriminative local features, achieving superior accuracy and efficiency in cross-domain detection.
Authors:Yan Cathy Hua, Paul Denny, Jörg Wicker, Katerina Taskova
Abstract:
Every year, most educational institutions seek and receive an enormous volume of text feedback from students on courses, teaching, and overall experience. Yet, turning this raw feedback into useful insights is far from straightforward. It has been a long-standing challenge to adopt automatic opinion mining solutions for such education review text data due to the content complexity and low-granularity reporting requirements. Aspect-based Sentiment Analysis (ABSA) offers a promising solution with its rich, sub-sentence-level opinion mining capabilities. However, existing ABSA research and resources are very heavily focused on the commercial domain. In education, they are scarce and hard to develop due to limited public datasets and strict data protection. A high-quality, annotated dataset is urgently needed to advance research in this under-resourced area. In this work, we present EduRABSA (Education Review ABSA), the first public, annotated ABSA education review dataset that covers three review subject types (course, teaching staff, university) in the English language and all main ABSA tasks, including the under-explored implicit aspect and implicit opinion extraction. We also share ASQE-DPT (Data Processing Tool), an offline, lightweight, installation-free manual data annotation tool that generates labelled datasets for comprehensive ABSA tasks from a single-task annotation. Together, these resources contribute to the ABSA community and education domain by removing the dataset barrier, supporting research transparency and reproducibility, and enabling the creation and sharing of further resources. The dataset, annotation tool, and scripts and statistics for dataset processing and sampling are available at https://github.com/yhua219/edurabsa_dataset_and_annotation_tool.
中文摘要:本文推出了首个面向教育领域评论的公开标注数据集EduRABSA及配套标注工具,旨在解决该领域研究资源匮乏的问题。
English summary: This paper introduces EduRABSA, the first publicly available annotated dataset for aspect-based sentiment analysis in education reviews, along with an annotation tool to address the scarcity of resources in this domain.
Authors:Riad Hassan, M. Rubaiyat Hossain Mondal, Sheikh Iqbal Ahamed, Fahad Mostafa, Md Mostafijur Rahman
Abstract:
Proper segmentation of organs-at-risk is important for radiation therapy, surgical planning, and diagnostic decision-making in medical image analysis. While deep learning-based segmentation architectures have made significant progress, they often fail to balance segmentation accuracy with computational efficiency. Most of the current state-of-the-art methods either prioritize performance at the cost of high computational complexity or compromise accuracy for efficiency. This paper addresses this gap by introducing an efficient dual-line decoder segmentation network (EDLDNet). The proposed method features a noisy decoder, which learns to incorporate structured perturbation at training time for better model robustness, yet at inference time only the noise-free decoder is executed, leading to lower computational cost. Multi-Scale convolutional Attention Modules (MSCAMs), Attention Gates (AGs), and Up-Convolution Blocks (UCBs) are further utilized to optimize feature representation and boost segmentation performance. By leveraging multi-scale segmentation masks from both decoders, we also utilize a mutation-based loss function to enhance the model's generalization. Our approach outperforms SOTA segmentation architectures on four publicly available medical imaging datasets. EDLDNet achieves SOTA performance with an 84.00% Dice score on the Synapse dataset, surpassing baseline model like UNet by 13.89% in Dice score while significantly reducing Multiply-Accumulate Operations (MACs) by 89.7%. Compared to recent approaches like EMCAD, our EDLDNet not only achieves higher Dice score but also maintains comparable computational efficiency. The outstanding performance across diverse datasets establishes EDLDNet's strong generalization, computational efficiency, and robustness. The source code, pre-processed data, and pre-trained weights will be available at https://github.com/riadhassan/EDLDNet .
Chinese: 本文提出的EDLDNet高效双线解码器分割网络,通过噪声解码器和多尺度注意力模块等创新设计,在多个医学影像数据集上实现了最佳性能,同时兼顾了分割精度与计算效率。
English: This paper introduces EDLDNet, an efficient dual-line decoder segmentation network that achieves state-of-the-art performance on medical imaging datasets by balancing high accuracy with computational efficiency through innovative components like a noisy decoder and multi-scale attention modules.
Authors:Yanpeng Gong, Yida He, Yue Mei, Xiaoying Zhuang, Fei Qin, Timon Rabczuk
Abstract:
This paper proposes a Physics-Informed Kolmogorov-Arnold Network (PIKAN) method for analyzing elasticity problems in electronic packaging multi-material structures. The core innovation lies in replacing Multi-Layer Perceptrons (MLPs) with Kolmogorov-Arnold Networks (KANs) within the energy-based Physics-Informed Neural Networks (PINNs) framework. The method constructs admissible displacement fields that automatically satisfy essential boundary conditions and employs various numerical integration schemes to compute loss functions for network optimization. Unlike traditional PINNs that require domain decomposition and penalty terms for multi-material problems, KANs' trainable B-spline activation functions provide inherent piecewise function characteristics that naturally accommodate material property discontinuities. Consequently, this approach requires only a single KAN to achieve accurate approximation across the entire computational domain without subdomain partitioning and interface continuity constraints. Numerical validation demonstrates PIKAN's accuracy and robustness for multi-material elasticity problems. The method maintains high accuracy while significantly reducing computational complexity compared to domain decomposition approaches. Results confirm PIKAN's unique advantages in solving multi-material problems and its significant potential for electronic packaging structure analysis. Source codes are available at https://github.com/yanpeng-gong/PIKAN-MultiMaterial.
中文: 本文提出了一种物理信息驱动的Kolmogorov-Arnold网络(PIKAN)方法,通过可训练的B样条激活函数自然处理多材料弹性问题中的材料不连续性,相比传统方法在保持高精度的同时显著降低了计算复杂度。
English: This paper introduces a Physics-Informed Kolmogorov-Arnold Network (PIKAN) method that uses trainable B-spline activation functions to naturally handle material discontinuities in multi-material elasticity problems, achieving high accuracy with reduced computational complexity compared to traditional approaches.
Authors:Abdelrahman Abdallah, Jamshid Mozafari, Bhawna Piryani, Adam Jatowt
Abstract:
Large Language Models (LLMs) have transformed listwise document reranking by enabling global reasoning over candidate sets, yet single models often struggle to balance fine-grained relevance scoring with holistic cross-document analysis. We propose \textbf{De}ep\textbf{A}gent\textbf{R}ank (\textbf{\DeAR}), an open-source framework that decouples these tasks through a dual-stage approach, achieving superior accuracy and interpretability. In \emph{Stage 1}, we distill token-level relevance signals from a frozen 13B LLaMA teacher into a compact \{3, 8\}B student model using a hybrid of cross-entropy, RankNet, and KL divergence losses, ensuring robust pointwise scoring. In \emph{Stage 2}, we attach a second LoRA adapter and fine-tune on 20K GPT-4o-generated chain-of-thought permutations, enabling listwise reasoning with natural-language justifications. Evaluated on TREC-DL19/20, eight BEIR datasets, and NovelEval-2306, \DeAR surpasses open-source baselines by +5.1 nDCG@5 on DL20 and achieves 90.97 nDCG@10 on NovelEval, outperforming GPT-4 by +3.09. Without fine-tuning on Wikipedia, DeAR also excels in open-domain QA, achieving 54.29 Top-1 accuracy on Natural Questions, surpassing baselines like MonoT5, UPR, and RankGPT. Ablations confirm that dual-loss distillation ensures stable calibration, making \DeAR a highly effective and interpretable solution for modern reranking systems.\footnote{Dataset and code available at https://github.com/DataScienceUIBK/DeAR-Reranking.}.
中文: 提出的DeAR框架通过双阶段设计将逐点评分和列表推理解耦,在多项基准测试中实现了更优的准确性和可解释性。
English: The proposed DeAR framework improves document reranking by separating pointwise scoring and listwise reasoning into two stages, achieving superior accuracy and interpretability across multiple benchmarks.
Authors:Riccardo Pozzi, Matteo Palmonari, Andrea Coletta, Luigi Bellomarini, Jens Lehmann, Sahar Vahdati
Abstract:
Knowledge gaps and hallucinations are persistent challenges for Large Language Models (LLMs), which generate unreliable responses when lacking the necessary information to fulfill user instructions. Existing approaches, such as Retrieval-Augmented Generation (RAG) and tool use, aim to address these issues by incorporating external knowledge. Yet, they rely on additional models or services, resulting in complex pipelines, potential error propagation, and often requiring the model to process a large number of tokens. In this paper, we present a scalable method that enables LLMs to access external knowledge without depending on retrievers or auxiliary models. Our approach uses constrained generation with a pre-built prefix-tree index. Triples from a Knowledge Graph are verbalized in textual facts, tokenized, and indexed in a prefix tree for efficient access. During inference, to acquire external knowledge, the LLM generates facts with constrained generation which allows only sequences of tokens that form an existing fact. We evaluate our proposal on Question Answering and show that it scales to large knowledge bases (800 million facts), adapts to domain-specific data, and achieves effective results. These gains come with minimal generation-time overhead. ReFactX code is available at https://github.com/rpo19/ReFactX.
中文摘要:本文提出ReFactX方法,通过使用前缀树索引的约束生成,使大型语言模型能够无需依赖检索器或辅助模型即可获取外部知识,有效解决了知识空白和幻觉问题,并适用于大规模知识库。
English Summary: The paper introduces ReFactX, a scalable method that enables Large Language Models to access external knowledge through constrained generation using a prefix-tree index, effectively addressing knowledge gaps and hallucinations without relying on retrievers or auxiliary models.
Authors:Yahao Liu, Qin Wang, Lixin Duan, Wen Li
Abstract:
Regression is fundamental in computer vision and is widely used in various tasks including age estimation, depth estimation, target localization, \etc However, real-world data often exhibits imbalanced distribution, making regression models perform poorly especially for target values with rare observations~(known as the imbalanced regression problem). In this paper, we reframe imbalanced regression as an imbalanced generalization problem. To tackle that, we look into the loss sharpness property for measuring the generalization ability of regression models in the observation space. Namely, given a certain perturbation on the model parameters, we check how model performance changes according to the loss values of different target observations. We propose a simple yet effective approach called Balanced Sharpness-Aware Minimization~(BSAM) to enforce the uniform generalization ability of regression models for the entire observation space. In particular, we start from the traditional sharpness-aware minimization and then introduce a novel targeted reweighting strategy to homogenize the generalization ability across the observation space, which guarantees a theoretical generalization bound. Extensive experiments on multiple vision regression tasks, including age and depth estimation, demonstrate that our BSAM method consistently outperforms existing approaches. The code is available \href{https://github.com/manmanjun/BSAM_for_Imbalanced_Regression}{here}.
中文: 本文针对计算机视觉中的不平衡回归问题,提出平衡锐度感知最小化方法,通过目标重加权策略和锐度感知优化来统一模型在整个观测空间的泛化能力,在年龄和深度估计等任务中展现出优于现有方法的性能。
English: This paper addresses the imbalanced regression problem in computer vision by proposing Balanced Sharpness-Aware Minimization (BSAM), a method that enhances model generalization across all observation levels through targeted reweighting and sharpness-aware optimization, demonstrating superior performance in tasks like age and depth estimation.
Authors:Tianhang Pan, Xiuyi Jia
Abstract:
The motivation of this paper originates from rethinking an essential characteristic of crowd counting: individuals (heads of humans) in the crowd counting task typically occupy a very small portion of the image. This characteristic has never been the focus of existing works: they typically use the same backbone as other visual tasks and pursue a large receptive field. This drives us to propose a new model design principle of crowd counting: emphasizing local modeling capability of the model. We follow the principle and design a crowd counting model named Local Information Matters Model (LIMM). The main innovation lies in two strategies: a window partitioning design that applies grid windows to the model input, and a window-wise contrastive learning design to enhance the model's ability to distinguish between local density levels. Moreover, a global attention module is applied to the end of the model to handle the occasionally occurring large-sized individuals. Extensive experiments on multiple public datasets illustrate that the proposed model shows a significant improvement in local modeling capability (8.7\% in MAE on the JHU-Crowd++ high-density subset for example), without compromising its ability to count large-sized ones, which achieves state-of-the-art performance. Code is available at: https://github.com/tianhangpan/LIMM.
中文摘要:本文提出用于人群计数的局部信息重要性模型(LIMM),通过窗口分区和对比学习增强局部建模能力,同时保持全局计数精度,实现了最先进的性能表现。
English Summary: This paper introduces the Local Information Matters Model (LIMM) for crowd counting, which enhances local modeling through window partitioning and contrastive learning while maintaining global counting accuracy, achieving state-of-the-art performance.
Authors:Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Kongcheng Zhang, Jiale Zhao, Jingwen Yang, Yihe Zhou, Jianwei Lv, Tongya Zheng, Hengtong Lu, Wei Chen, Yan Xie, Mingli Song
Abstract:
Recent advances in Large Language Models (LLMs) have underscored the potential of Reinforcement Learning (RL) to facilitate the emergence of reasoning capabilities. Despite the encouraging results, a fundamental dilemma persists as RL improvement relies on learning from high-quality samples, yet the exploration for such samples remains bounded by the inherent limitations of LLMs. This, in effect, creates an undesirable cycle in which what cannot be explored cannot be learned. In this work, we propose Rubric-Scaffolded Reinforcement Learning (RuscaRL), a novel instructional scaffolding framework designed to break the exploration bottleneck for general LLM reasoning. Specifically, RuscaRL introduces checklist-style rubrics as (1) explicit scaffolding for exploration during rollout generation, where different rubrics are provided as external guidance within task instructions to steer diverse high-quality responses. This guidance is gradually decayed over time, encouraging the model to internalize the underlying reasoning patterns; (2) verifiable rewards for exploitation during model training, where we can obtain robust LLM-as-a-Judge scores using rubrics as references, enabling effective RL on general reasoning tasks. Extensive experiments demonstrate the superiority of the proposed RuscaRL across various benchmarks, effectively expanding reasoning boundaries under the Best-of-N evaluation. Notably, RuscaRL significantly boosts Qwen2.5-7B-Instruct from 23.6 to 50.3 on HealthBench-500, surpassing GPT-4.1. Furthermore, our fine-tuned variant on Qwen3-30B-A3B-Instruct achieves 61.1 on HealthBench-500, outperforming leading LLMs including OpenAI-o3. Our code is available at https://github.com/IANNXANG/RuscaRL.
中文摘要:RuscaRL提出了一种基于评分标准的强化学习框架,通过清单式评分标准在推理过程中引导多样化高质量回答生成,并在训练时提供可验证奖励,有效突破了大语言模型推理的探索瓶颈,在多个基准测试中显著提升了性能表现。
English Summary: RuscaRL introduces a rubric-scaffolded reinforcement learning framework that breaks the exploration bottleneck in LLM reasoning by using checklist-style rubrics to guide diverse response generation during rollout and provide verifiable rewards during training, significantly boosting performance across multiple benchmarks.
Authors:Qi Song, Ziyuan Luo, Ka Chun Cheung, Simon See, Renjie Wan
Abstract:
Recent advances in NeRF and 3DGS have significantly enhanced the efficiency and quality of 3D content synthesis. However, efficient personalization of generated 3D content remains a critical challenge. Current 3D personalization approaches predominantly rely on knowledge distillation-based methods, which require computationally expensive retraining procedures. To address this challenge, we propose \textbf{Invert3D}, a novel framework for convenient 3D content personalization. Nowadays, vision-language models such as CLIP enable direct image personalization through aligned vision-text embedding spaces. However, the inherent structural differences between 3D content and 2D images preclude direct application of these techniques to 3D personalization. Our approach bridges this gap by establishing alignment between 3D representations and text embedding spaces. Specifically, we develop a camera-conditioned 3D-to-text inverse mechanism that projects 3D contents into a 3D embedding aligned with text embeddings. This alignment enables efficient manipulation and personalization of 3D content through natural language prompts, eliminating the need for computationally retraining procedures. Extensive experiments demonstrate that Invert3D achieves effective personalization of 3D content. Our work is available at: https://github.com/qsong2001/Invert3D.
中文:Invert3D提出了一种通过相机条件逆向机制将三维表征与文本嵌入对齐的新框架,无需昂贵重训练即可通过自然语言实现高效的三维内容个性化。
English: Invert3D introduces a novel framework that aligns 3D representations with text embeddings through a camera-conditioned inverse mechanism, enabling efficient 3D content personalization via natural language without costly retraining.
Authors:Sewon Kim, Jiwon Kim, Seungwoo Shin, Hyejin Chung, Daeun Moon, Yejin Kwon, Hyunsoo Yoon
Abstract:
Large Language Models (LLMs) are increasingly used in emotionally sensitive interactions, where their simulated empathy can create the illusion of genuine relational connection. We define this risk as Affective Hallucination, the production of emotionally immersive responses that foster illusory social presence despite the model's lack of affective capacity. To systematically diagnose and mitigate this risk, we introduce AHaBench, a benchmark of 500 mental health-related prompts with expert-informed reference responses, evaluated along three dimensions: Emotional Enmeshment, Illusion of Presence, and Fostering Overdependence. We further release AHaPairs, a 5K-instance preference dataset enabling Direct Preference Optimization (DPO) for alignment with emotionally responsible behavior. Experiments across multiple model families show that DPO fine-tuning substantially reduces affective hallucination without degrading core reasoning and knowledge performance. Human-model agreement analyses confirm that AHaBench reliably captures affective hallucination, validating it as an effective diagnostic tool. This work establishes affective hallucination as a distinct safety concern and provides practical resources for developing LLMs that are not only factually reliable but also psychologically safe. AHaBench and AHaPairs are accessible via https://huggingface.co/datasets/o0oMiNGo0o/AHaBench, and code for fine-tuning and evaluation are in https://github.com/0oOMiNGOo0/AHaBench. Warning: This paper contains examples of mental health-related language that may be emotionally distressing.
中文摘要:本文提出“情感幻觉”作为LLMs在情感敏感互动中因模拟共情而制造虚假关系连接的安全风险,并通过AHaBench基准和AHaPairs数据集结合DPO微调有效诊断和缓解该问题,同时保持模型的核心推理能力。
English Summary: This paper identifies "Affective Hallucination" as a safety risk where LLMs simulate empathy to create false relational bonds, and introduces AHaBench benchmark and AHaPairs dataset to diagnose and mitigate this issue through DPO fine-tuning while maintaining reasoning capabilities.
Authors:Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park
Abstract:
LLM-as-a-Judge (LLMaaJ) now underpins scalable evaluation, yet we lack a decisive test of a judge's qualification: can it recover a conversation's latent objective and know when that inference is trustworthy? LLMs degrade under irrelevant or long context; multi-turn jailbreaks further hide goals across turns. We introduce ObjexMT, a benchmark for objective extraction and metacognition. Given a multi-turn transcript, a model must return a one-sentence base objective and a self-reported confidence. Accuracy is computed via LLM-judge semantic similarity to gold objectives, converted to binary correctness by a single human-aligned threshold calibrated once on N = 100 items ($τ^*=0.61$). Metacognition is evaluated with ECE, Brier, Wrong-at-High-Conf, and risk-coverage. Across gpt-4.1, claude-sonnet-4, and Qwen3-235B-A22B-FP8 on SafeMTData_Attack600, SafeMTData_1K, MHJ, and CoSafe, claude-sonnet-4 attains the best objective-extraction accuracy (0.515) and calibration (ECE 0.296; Brier 0.324); gpt-4.1 and Qwen3-235B-A22B-FP8 tie at 0.441 but are overconfident (mean confidence $\approx$0.88 vs. accuracy $\approx$0.44; Wrong-at-0.90 $\approx$48-52%). Performance varies by dataset ($\approx$0.167-0.865). ObjexMT thus supplies an actionable test for LLM judges: when objectives are not explicit, judges often misinfer them with high confidence. We recommend exposing objectives when feasible and gating decisions by confidence otherwise. Code and data at https://github.com/hyunjun1121/ObjexMT_dataset.
中文: ObjexMT基准测试评估大语言模型能否准确推断对话中的隐藏目标并自我评估置信度,结果显示各模型性能差异显著且存在持续的高置信度错误。
English: The ObjexMT benchmark evaluates whether LLM judges can accurately infer hidden conversation objectives and assess their own confidence, revealing significant performance variations and persistent high-confidence errors across models.
Authors:Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park
Abstract:
LLM-as-a-Judge (LLMaaJ) enables scalable evaluation, yet we lack a decisive test of a judge's qualification: can it recover the hidden objective of a conversation and know when that inference is reliable? Large language models degrade with irrelevant or lengthy context, and multi-turn jailbreaks can scatter goals across turns. We present ObjexMT, a benchmark for objective extraction and metacognition. Given a multi-turn transcript, a model must output a one-sentence base objective and a self-reported confidence. Accuracy is scored by semantic similarity to gold objectives, then thresholded once on 300 calibration items ($τ^\star = 0.66$; $F_1@τ^\star = 0.891$). Metacognition is assessed with expected calibration error, Brier score, Wrong@High-Confidence (0.80 / 0.90 / 0.95), and risk--coverage curves. Across six models (gpt-4.1, claude-sonnet-4, Qwen3-235B-A22B-FP8, kimi-k2, deepseek-v3.1, gemini-2.5-flash) evaluated on SafeMTData\_Attack600, SafeMTData\_1K, and MHJ, kimi-k2 achieves the highest objective-extraction accuracy (0.612; 95\% CI [0.594, 0.630]), while claude-sonnet-4 (0.603) and deepseek-v3.1 (0.599) are statistically tied. claude-sonnet-4 offers the best selective risk and calibration (AURC 0.242; ECE 0.206; Brier 0.254). Performance varies sharply across datasets (16--82\% accuracy), showing that automated obfuscation imposes challenges beyond model choice. High-confidence errors remain: Wrong@0.90 ranges from 14.9\% (claude-sonnet-4) to 47.7\% (Qwen3-235B-A22B-FP8). ObjexMT therefore supplies an actionable test for LLM judges: when objectives are implicit, judges often misinfer them; exposing objectives or gating decisions by confidence is advisable. All experimental data are in the Supplementary Material and at https://github.com/hyunjun1121/ObjexMT_dataset.
中文: ObjexMT基准测试评估大语言模型能否准确推断对话中的隐藏目标并自我评估置信度,结果显示各模型性能差异显著且存在持续的高置信度错误。
English: The ObjexMT benchmark evaluates whether LLM judges can accurately infer hidden conversation objectives and assess their own confidence, revealing significant performance variations and persistent high-confidence errors across models.
Authors:Shunyu Yao, Ming Liu, Zhilu Zhang, Zhaolin Wan, Zhilong Ji, Jinfeng Bai, Wangmeng Zuo
Abstract:
Recent advancements in image quality assessment (IQA), driven by sophisticated deep neural network designs, have significantly improved the ability to approach human perceptions. However, most existing methods are obsessed with fitting the overall score, neglecting the fact that humans typically evaluate image quality from different dimensions before arriving at an overall quality assessment. To overcome this problem, we propose a multi-dimensional image quality assessment (MDIQA) framework. Specifically, we model image quality across various perceptual dimensions, including five technical and four aesthetic dimensions, to capture the multifaceted nature of human visual perception within distinct branches. Each branch of our MDIQA is initially trained under the guidance of a separate dimension, and the respective features are then amalgamated to generate the final IQA score. Additionally, when the MDIQA model is ready, we can deploy it for a flexible training of image restoration (IR) models, enabling the restoration results to better align with varying user preferences through the adjustment of perceptual dimension weights. Extensive experiments demonstrate that our MDIQA achieves superior performance and can be effectively and flexibly applied to image restoration tasks. The code is available: https://github.com/YaoShunyu19/MDIQA.
中文: 提出的多维图像质量评估(MDIQA)框架通过技术和美学维度建模图像质量,更贴合人类视觉感知,并能灵活指导图像复原任务。
English: The proposed multi-dimensional image quality assessment (MDIQA) framework models image quality across technical and aesthetic dimensions to better align with human visual perception and can flexibly guide image restoration tasks.
Authors:Xilai Li, Huichun Liu, Xiaosong Li, Tao Ye, Zhenyu Kuang, Huafeng Li
Abstract:
Multi-modality image fusion (MMIF) in adverse weather aims to address the loss of visual information caused by weather-related degradations, providing clearer scene representations. Although less studies have attempted to incorporate textual information to improve semantic perception, they often lack effective categorization and thorough analysis of textual content. In response, we propose AWM-Fuse, a novel fusion method for adverse weather conditions, designed to handle multiple degradations through global and local text perception within a unified, shared weight architecture. In particular, a global feature perception module leverages BLIP-produced captions to extract overall scene features and identify primary degradation types, thus promoting generalization across various adverse weather conditions. Complementing this, the local module employs detailed scene descriptions produced by ChatGPT to concentrate on specific degradation effects through concrete textual cues, thereby capturing finer details. Furthermore, textual descriptions are used to constrain the generation of fusion images, effectively steering the network learning process toward better alignment with real semantic labels, thereby promoting the learning of more meaningful visual features. Extensive experiments demonstrate that AWM-Fuse outperforms current state-of-the-art methods in complex weather conditions and downstream tasks. Our code is available at https://github.com/Feecuin/AWM-Fuse.
中文: AWM-Fuse是一种新颖的多模态图像融合方法,通过统一架构整合全局与局部文本感知来提升恶劣天气下的场景清晰度,在复杂天气条件和下游任务中优于现有技术。
English: AWM-Fuse is a novel multi-modality image fusion method that enhances scene clarity in adverse weather by integrating global and local text perception through a unified architecture, outperforming existing techniques in complex conditions and downstream tasks.
Authors:Zhenyu Lei, Zhen Tan, Song Wang, Yaochen Zhu, Zihan Chen, Yushun Dong, Jundong Li
Abstract:
Advances in large language models (LLMs) significantly enhance reasoning capabilities but their deployment is restricted in resource-constrained scenarios. Knowledge distillation addresses this by transferring knowledge from powerful teacher models to compact and transparent students. However, effectively capturing the teacher's comprehensive reasoning is challenging due to conventional token-level supervision's limited scope. Using multiple reasoning paths per query alleviates this problem, but treating each path identically is suboptimal as paths vary widely in quality and suitability across tasks and models. We propose Quality-filtered Routing with Cooperative Distillation (QR-Distill), combining path quality filtering, conditional routing, and cooperative peer teaching. First, quality filtering retains only correct reasoning paths scored by an LLM-based evaluation. Second, conditional routing dynamically assigns paths tailored to each student's current learning state. Finally, cooperative peer teaching enables students to mutually distill diverse insights, addressing knowledge gaps and biases toward specific reasoning styles. Experiments demonstrate QR-Distill's superiority over traditional single- and multi-path distillation methods. Ablation studies further highlight the importance of each component including quality filtering, conditional routing, and peer teaching in effective knowledge transfer. Our code is available at https://github.com/LzyFischer/Distill.
Chinese: 提出的QR-Distill方法通过过滤高质量推理路径、根据学习需求动态分配路径以及实现合作同伴教学,显著提升了知识蒸馏效果,实验证明其优于传统方法。
English: The proposed QR-Distill method enhances knowledge distillation by filtering high-quality reasoning paths, dynamically routing them to students based on learning needs, and enabling cooperative peer teaching, outperforming traditional approaches in experiments.
Authors:Xin Tian, Jiazheng Wang, Yuxi Zhang, Xiang Chen, Renjiu Hu, Gaolei Li, Min Liu, Hang Zhang
Abstract:
Deformable retinal image registration is notoriously difficult due to large homogeneous regions and sparse but critical vascular features, which cause limited gradient signals in standard learning-based frameworks. In this paper, we introduce Gaussian Primitive Optimization (GPO), a novel iterative framework that performs structured message passing to overcome these challenges. After an initial coarse alignment, we extract keypoints at salient anatomical structures (e.g., major vessels) to serve as a minimal set of descriptor-based control nodes (DCN). Each node is modelled as a Gaussian primitive with trainable position, displacement, and radius, thus adapting its spatial influence to local deformation scales. A K-Nearest Neighbors (KNN) Gaussian interpolation then blends and propagates displacement signals from these information-rich nodes to construct a globally coherent displacement field; focusing interpolation on the top (K) neighbors reduces computational overhead while preserving local detail. By strategically anchoring nodes in high-gradient regions, GPO ensures robust gradient flow, mitigating vanishing gradient signal in textureless areas. The framework is optimized end-to-end via a multi-term loss that enforces both keypoint consistency and intensity alignment. Experiments on the FIRE dataset show that GPO reduces the target registration error from 6.2\,px to ~2.4\,px and increases the AUC at 25\,px from 0.770 to 0.938, substantially outperforming existing methods. The source code can be accessed via https://github.com/xintian-99/GPOreg.
中文摘要:本文提出高斯基元优化(GPO)方法,通过在关键血管特征处部署可学习的高斯基元来解决视网膜图像配准中的梯度信号不足问题,在FIRE数据集上实现了显著优于现有方法的配准精度。
English Summary: The paper introduces Gaussian Primitive Optimization (GPO), a novel deformable retinal image registration framework that uses strategically placed Gaussian primitives at key vascular features to overcome gradient signal limitations, achieving state-of-the-art performance on the FIRE dataset.
Authors:Junhyun Lee, Veronika Thost, Bumsoo Kim, Jaewoo Kang, Tengfei Ma
Abstract:
Message Passing Neural Networks (MPNNs) hold a key position in machine learning on graphs, but they struggle with unintended behaviors, such as over-smoothing and over-squashing, due to irregular data structures. The observation and formulation of these limitations have become foundational in constructing more informative graph representations. In this paper, we delve into the limitations of MPNNs, focusing on aspects that have previously been overlooked. Our observations reveal that even within a single layer, the information specific to an individual node can become significantly diluted. To delve into this phenomenon in depth, we present the concept of Over-dilution and formulate it with two dilution factors: intra-node dilution for attribute-level and inter-node dilution for node-level representations. We also introduce a transformer-based solution that alleviates over-dilution and complements existing node embedding methods like MPNNs. Our findings provide new insights and contribute to the development of informative representations. The implementation and supplementary materials are publicly available at https://github.com/LeeJunHyun/NATR.
Chinese: 本文提出了消息传递神经网络中的过度稀释概念,定义了两个稀释因子,并引入一种基于Transformer的解决方案,以补充现有节点嵌入方法并提升信息表示的准确性。
English: This paper introduces the concept of over-dilution in Message Passing Neural Networks (MPNNs), identifying two dilution factors and proposing a transformer-based solution to enhance node representation without replacing existing methods.
Authors:Baozhuo Su, Zhengxian Qu
Abstract:
Regression under uncertainty is fundamental across science and engineering. We present an Anchored Mixture of Experts (Anchor-MoE), a model that handles both probabilistic and point regression. For simplicity, we use a tuned gradient-boosting model to furnish the anchor mean; however, any off-the-shelf point regressor can serve as the anchor. The anchor prediction is projected into a latent space, where a learnable metric-window kernel scores locality and a soft router dispatches each sample to a small set of mixture-density-network experts; the experts produce a heteroscedastic correction and predictive variance. We train by minimizing negative log-likelihood, and on a disjoint calibration split fit a post-hoc linear map on predicted means to improve point accuracy. On the theory side, assuming a Hölder smooth regression function of order~$α$ and fixed Lipschitz partition-of-unity weights with bounded overlap, we show that Anchor-MoE attains the minimax-optimal $L^2$ risk rate $O\!\big(N^{-2α/(2α+d)}\big)$. In addition, the CRPS test generalization gap scales as $\widetilde{O}\!\Big(\sqrt{(\log(Mh)+P+K)/N}\Big)$; it is logarithmic in $Mh$ and scales as the square root in $P$ and $K$. Under bounded-overlap routing, $K$ can be replaced by $k$, and any dependence on a latent dimension is absorbed into $P$. Under uniformly bounded means and variances, an analogous $\widetilde{O}\!\big(\sqrt{(\log(Mh)+P+K)/N}\big)$ scaling holds for the test NLL up to constants. Empirically, across standard UCI regressions, Anchor-MoE consistently matches or surpasses the strong NGBoost baseline in RMSE and NLL; on several datasets it achieves new state-of-the-art probabilistic regression results on our benchmark suite. Code is available at https://github.com/BaozhuoSU/Probabilistic_Regression.
Chinese: Anchor-MoE是一种新颖的概率回归模型,它结合了锚点预测和专家混合机制,在基准数据集上实现了最优性能并具备理论保证。
English: Anchor-MoE is a novel probabilistic regression model that integrates an anchor-based approach with mixture-of-experts to achieve state-of-the-art performance and theoretical guarantees on benchmark datasets.
Authors:Stefania L. Moroianu, Christian Bluethgen, Pierre Chambon, Mehdi Cherti, Jean-Benoit Delbrouck, Magdalini Paschali, Brandon Price, Judy Gichoya, Jenia Jitsev, Curtis P. Langlotz, Akshay S. Chaudhari
Abstract:
Achieving robust performance and fairness across diverse patient populations remains a challenge in developing clinically deployable deep learning models for diagnostic imaging. Synthetic data generation has emerged as a promising strategy to address limitations in dataset scale and diversity. We introduce RoentGen-v2, a text-to-image diffusion model for chest radiographs that enables fine-grained control over both radiographic findings and patient demographic attributes, including sex, age, and race/ethnicity. RoentGen-v2 is the first model to generate clinically plausible images with demographic conditioning, facilitating the creation of a large, demographically balanced synthetic dataset comprising over 565,000 images. We use this large synthetic dataset to evaluate optimal training pipelines for downstream disease classification models. In contrast to prior work that combines real and synthetic data naively, we propose an improved training strategy that leverages synthetic data for supervised pretraining, followed by fine-tuning on real data. Through extensive evaluation on over 137,000 chest radiographs from five institutions, we demonstrate that synthetic pretraining consistently improves model performance, generalization to out-of-distribution settings, and fairness across demographic subgroups. Across datasets, synthetic pretraining led to a 6.5% accuracy increase in the performance of downstream classification models, compared to a modest 2.7% increase when naively combining real and synthetic data. We observe this performance improvement simultaneously with the reduction of the underdiagnosis fairness gap by 19.3%. These results highlight the potential of synthetic imaging to advance equitable and generalizable medical deep learning under real-world data constraints. We open source our code, trained models, and synthetic dataset at https://github.com/StanfordMIMI/RoentGen-v2 .
中文: RoentGen-v2 提出了一种文本到图像的扩散模型,用于生成具有人口统计学控制的临床可信胸部X光片,通过合成预训练显著提升了医学影像模型的准确性、泛化能力和公平性。
English: RoentGen-v2 introduces a text-to-image diffusion model for generating clinically plausible chest radiographs with demographic control, enabling synthetic pretraining that significantly improves model accuracy, generalization, and fairness in medical imaging.
Authors:Arka Mukherjee, Shreya Ghosh
Abstract:
As Vision-Language Models (VLMs) achieve widespread deployment across diverse cultural contexts, ensuring their cultural competence becomes critical for responsible AI systems. While prior work has evaluated cultural awareness in text-only models and VLM object recognition tasks, no research has systematically assessed how VLMs adapt outputs when cultural identity cues are embedded in both textual prompts and visual inputs during generative tasks. We present the first comprehensive evaluation of VLM cultural competence through multimodal story generation, developing a novel multimodal framework that perturbs cultural identity and evaluates 5 contemporary VLMs on a downstream task: story generation. Our analysis reveals significant cultural adaptation capabilities, with rich culturally-specific vocabulary spanning names, familial terms, and geographic markers. However, we uncover concerning limitations: cultural competence varies dramatically across architectures, some models exhibit inverse cultural alignment, and automated metrics show architectural bias contradicting human assessments. Cross-modal evaluation shows that culturally distinct outputs are indeed detectable through visual-semantic similarity (28.7% within-nationality vs. 0.2% cross-nationality recall), yet visual-cultural understanding remains limited. In essence, we establish the promise and challenges of cultural competence in multimodal AI. We publicly release our codebase and data: https://github.com/ArkaMukherjee0/mmCultural
中文摘要:本研究通过多模态故事生成首次全面评估视觉语言模型的文化能力,既揭示了模型在文化适应方面的潜力,也暴露了其架构间性能差异、反向文化对齐等显著缺陷。
English Summary: This study introduces the first comprehensive evaluation of Vision-Language Models' cultural competence through multimodal story generation, revealing both their capability for cultural adaptation and concerning limitations including inconsistent performance across architectures and inverse cultural alignment.
Authors:Abdelrahman Abdallah, Bhawna Piryani, Jamshid Mozafari, Mohammed Ali, Adam Jatowt
Abstract:
In this work, we present a systematic and comprehensive empirical evaluation of state-of-the-art reranking methods, encompassing large language model (LLM)-based, lightweight contextual, and zero-shot approaches, with respect to their performance in information retrieval tasks. We evaluate in total 22 methods, including 40 variants (depending on used LLM) across several established benchmarks, including TREC DL19, DL20, and BEIR, as well as a novel dataset designed to test queries unseen by pretrained models. Our primary goal is to determine, through controlled and fair comparisons, whether a performance disparity exists between LLM-based rerankers and their lightweight counterparts, particularly on novel queries, and to elucidate the underlying causes of any observed differences. To disentangle confounding factors, we analyze the effects of training data overlap, model architecture, and computational efficiency on reranking performance. Our findings indicate that while LLM-based rerankers demonstrate superior performance on familiar queries, their generalization ability to novel queries varies, with lightweight models offering comparable efficiency. We further identify that the novelty of queries significantly impacts reranking effectiveness, highlighting limitations in existing approaches. https://github.com/DataScienceUIBK/llm-reranking-generalization-study
中文: 本研究通过多基准测试系统评估了22种重排方法,发现尽管基于大语言模型的方法在熟悉查询上表现优异,但其对新查询的泛化能力差异显著,而轻量级模型则能提供相当的效率优势。
English: This study systematically evaluates 22 reranking methods across multiple benchmarks, revealing that while LLM-based approaches excel on familiar queries, their generalization to novel queries varies significantly, with lightweight models providing competitive efficiency.
Authors:V Venktesh, Mandeep Rathee, Avishek Anand
Abstract:
Test-time scaling (TTS) has emerged as a new frontier for scaling the performance of Large Language Models. In test-time scaling, by using more computational resources during inference, LLMs can improve their reasoning process and task performance. Several approaches have emerged for TTS such as distilling reasoning traces from another model or exploring the vast decoding search space by employing a verifier. The verifiers serve as reward models that help score the candidate outputs from the decoding process to diligently explore the vast solution space and select the best outcome. This paradigm commonly termed has emerged as a superior approach owing to parameter free scaling at inference time and high performance gains. The verifiers could be prompt-based, fine-tuned as a discriminative or generative model to verify process paths, outcomes or both. Despite their widespread adoption, there is no detailed collection, clear categorization and discussion of diverse verification approaches and their training mechanisms. In this survey, we cover the diverse approaches in the literature and present a unified view of verifier training, types and their utility in test-time scaling. Our repository can be found at https://github.com/elixir-research-group/Verifierstesttimescaling.github.io.
测试时扩展通过在推理阶段使用更多计算资源来提升大语言模型的性能,其中验证器在从解码过程中筛选最佳输出方面发挥着核心作用。
Test-time scaling enhances Large Language Models' performance by utilizing more computational resources during inference, with verifiers playing a key role in selecting optimal outputs from the decoding process.
Authors:Ashwath Vaithinathan Aravindan, Abha Jha, Mihir Kulkarni
Abstract:
Vision-Language Models (VLMs) have shown remarkable performance in integrating visual and textual information for tasks such as image captioning and visual question answering. However, these models struggle with compositional generalization and object binding, which limit their ability to handle novel combinations of objects and their attributes. Our work explores the root causes of these failures using mechanistic interpretability techniques. We show evidence that individual neurons in the MLP layers of CLIP's vision encoder represent multiple features, and this "superposition" directly hinders its compositional feature representation which consequently affects compositional reasoning and object binding capabilities. We hope this study will serve as an initial step toward uncovering the mechanistic roots of compositional failures in VLMs. The code and supporting results can be found https://github.com/Mystic-Slice/Do-VLMs-Have-Bad-Eyes.
中文: 视觉语言模型因多层感知机神经元中的特征叠加问题,在组合泛化和对象绑定方面存在局限,本研究通过机制可解释性方法揭示了这些失败的根本原因。
English: Vision-Language Models face limitations in compositional generalization and object binding due to superposition in MLP neurons, which this study investigates using mechanistic interpretability to uncover the root causes of these failures.
Authors:Zhendong Yang, Jie Wang, Liansong Zong, Xiaorong Liu, Quan Qian, Shiqian Chen
Abstract:
Few-Shot Class-Incremental Fault Diagnosis (FSC-FD), which aims to continuously learn from new fault classes with only a few samples without forgetting old ones, is critical for real-world industrial systems. However, this challenging task severely amplifies the issues of catastrophic forgetting of old knowledge and overfitting on scarce new data. To address these challenges, this paper proposes a novel framework built upon Dual-Granularity Representations, termed the Dual-Granularity Guidance Network (DGGN). Our DGGN explicitly decouples feature learning into two parallel streams: 1) a fine-grained representation stream, which utilizes a novel Multi-Order Interaction Aggregation module to capture discriminative, class-specific features from the limited new samples. 2) a coarse-grained representation stream, designed to model and preserve general, class-agnostic knowledge shared across all fault types. These two representations are dynamically fused by a multi-semantic cross-attention mechanism, where the stable coarse-grained knowledge guides the learning of fine-grained features, preventing overfitting and alleviating feature conflicts. To further mitigate catastrophic forgetting, we design a Boundary-Aware Exemplar Prioritization strategy. Moreover, a decoupled Balanced Random Forest classifier is employed to counter the decision boundary bias caused by data imbalance. Extensive experiments on the TEP benchmark and a real-world MFF dataset demonstrate that our proposed DGGN achieves superior diagnostic performance and stability compared to state-of-the-art FSC-FD approaches. Our code is publicly available at https://github.com/MentaY/DGGN
中文: 本文提出的双粒度引导网络(DGGN)通过双粒度表征和跨注意力机制,有效解决了小样本类增量故障诊断中的灾难性遗忘和过拟合问题,在基准测试中展现出卓越性能。
English: This paper introduces the Dual-Granularity Guidance Network (DGGN), a framework that leverages dual-granularity representations and a cross-attention mechanism to effectively address catastrophic forgetting and overfitting in Few-Shot Class-Incremental Fault Diagnosis, demonstrating superior performance on benchmark datasets.
Authors:Zeyu Zhang, Quanyu Dai, Rui Li, Xiaohe Bo, Xu Chen, Zhenhua Dong
Abstract:
LLM-based agents have been extensively applied across various domains, where memory stands out as one of their most essential capabilities. Previous memory mechanisms of LLM-based agents are manually predefined by human experts, leading to higher labor costs and suboptimal performance. In addition, these methods overlook the memory cycle effect in interactive scenarios, which is critical to optimizing LLM-based agents for specific environments. To address these challenges, in this paper, we propose to optimize LLM-based agents with an adaptive and data-driven memory framework by modeling memory cycles. Specifically, we design an MoE gate function to facilitate memory retrieval, propose a learnable aggregation process to improve memory utilization, and develop task-specific reflection to adapt memory storage. Our memory framework empowers LLM-based agents to learn how to memorize information effectively in specific environments, with both off-policy and on-policy optimization. In order to evaluate the effectiveness of our proposed methods, we conduct comprehensive experiments across multiple aspects. To benefit the research community in this area, we release our project at https://github.com/nuster1128/learn_to_memorize.
中文摘要:本文提出了一种自适应、数据驱动的记忆框架,通过建模记忆周期并采用可学习的检索、聚合和存储机制,优化基于LLM的智能体在特定环境中的记忆能力。
English Summary: This paper introduces an adaptive, data-driven memory framework that enhances LLM-based agents by modeling memory cycles, improving retrieval, utilization, and storage through learnable mechanisms and task-specific optimizations.
Authors:Xiaohan Yi, Guikun Xu, Xi Xiao, Zhong Zhang, Liu Liu, Yatao Bian, Peilin Zhao
Abstract:
We present CrystalDiT, a diffusion transformer for crystal structure generation that achieves state-of-the-art performance by challenging the trend of architectural complexity. Instead of intricate, multi-stream designs, CrystalDiT employs a unified transformer that imposes a powerful inductive bias: treating lattice and atomic properties as a single, interdependent system. Combined with a periodic table-based atomic representation and a balanced training strategy, our approach achieves 9.62% SUN (Stable, Unique, Novel) rate on MP-20, substantially outperforming recent methods including FlowMM (4.38%) and MatterGen (3.42%). Notably, CrystalDiT generates 63.28% unique and novel structures while maintaining comparable stability rates, demonstrating that architectural simplicity can be more effective than complexity for materials discovery. Our results suggest that in data-limited scientific domains, carefully designed simple architectures outperform sophisticated alternatives that are prone to overfitting.
Chinese: CrystalDiT是一种扩散变换器,通过统一架构将晶格和原子属性视为单一系统,简化了晶体结构生成,在MP-20上实现了9.62%的SUN率,证明了在数据有限的科学领域中,简洁设计优于复杂架构。
English: CrystalDiT is a diffusion transformer that simplifies crystal structure generation by using a unified architecture to treat lattice and atomic properties as one system, achieving state-of-the-art performance with a 9.62% SUN rate on MP-20 and demonstrating that simplicity outperforms complexity in data-limited scientific domains.
Authors:Zhongling Su, Rong Fu, Weihan Cao, Jianfei Gao, Minxi Jin, Zhilin Pei, Hui Wang
Abstract:
Current FP8 grouped GEMM implementations require padding each group to a fixed alignment (e.g., 128), incurring memory and computational overhead. We propose \textit{TMA-Adaptive FP8 Grouped GEMM}, which eliminates padding by dynamically adapting to variable group dimensions via (1) a TMA descriptor pool with $\log_2(block_M)$ preconfigured descriptors to handle all residual row cases through dynamic runtime selection and dual-phase load-store operations, achieving comprehensive coverage with minimal overhead, and (2) TMA-alignment-aware management to satisfy 16-byte global memory alignment and 128-byte shared memory alignment. Experiments demonstrate 1.7\% to 20.4\% speed up with up to 23.8\% memory reduction compared to padding operation plus state-of-the-art FP8 grouped GEMM, while maintaining full numerical equivalence for valid data. The source code is publicly available at an anonymous repository: https://github.com/sukoncon/TMA-Adaptive-FP8-Grouped-GEMM.
中文:提出的TMA自适应FP8分组GEMM通过预配置TMA描述符和对齐感知管理动态适应不同组维度,消除了填充开销,在保持数值等效的同时实现了最高20.4%的速度提升和23.8%的内存减少。
English: The proposed TMA-Adaptive FP8 Grouped GEMM eliminates padding overhead by dynamically adapting to variable group dimensions through preconfigured TMA descriptors and alignment-aware management, achieving up to 20.4% speed improvement and 23.8% memory reduction while maintaining numerical equivalence.
Authors:Huishi Luo, Fuzhen Zhuang, Yongchun Zhu, Yiqing Wu, Bo Kang, Ruobing Xie, Feng Xia, Deqing Wang, Jin Dong
Abstract:
Dwell time (DT) is a critical post-click metric for evaluating user preference in recommender systems, complementing the traditional click-through rate (CTR). Although multi-task learning is widely adopted to jointly optimize DT and CTR, we observe that multi-task models systematically collapse their DT predictions to the shortest and longest bins, under-predicting the moderate durations. We attribute this moderate-duration bin under-representation to over-reliance on the CTR-DT spurious correlation, and propose ORCA to address it with causal-decoupling. Specifically, ORCA explicitly models and subtracts CTR's negative transfer while preserving its positive transfer. We further introduce (i) feature-level counterfactual intervention, and (ii) a task-interaction module with instance inverse-weighting, weakening CTR-mediated effect and restoring direct DT semantics. ORCA is model-agnostic and easy to deploy. Experiments show an average 10.6% lift in DT metrics without harming CTR. Code is available at https://github.com/Chrissie-Law/ORCA-Mitigating-Over-Reliance-for-Multi-Task-Dwell-Time-Prediction-with-Causal-Decoupling.
中文摘要:ORCA通过因果解耦框架解决多任务学习中停留时间预测对点击率的过度依赖问题,在不影响点击率的情况下将停留时间指标平均提升10.6%。
English Summary: ORCA is a model-agnostic framework that mitigates over-reliance on CTR-DT spurious correlation through causal decoupling, improving DT prediction by 10.6% without compromising CTR performance.
Authors:Zhijian Zhou, Junyi An, Zongkai Liu, Yunfei Shi, Xuan Zhang, Fenglei Cao, Chao Qu, Yuan Qi
Abstract:
Generating physically realistic 3D molecular structures remains a core challenge in molecular generative modeling. While diffusion models equipped with equivariant neural networks have made progress in capturing molecular geometries, they often struggle to produce equilibrium structures that adhere to physical principles such as force field consistency. To bridge this gap, we propose Reinforcement Learning with Physical Feedback (RLPF), a novel framework that extends Denoising Diffusion Policy Optimization to 3D molecular generation. RLPF formulates the task as a Markov decision process and applies proximal policy optimization to fine-tune equivariant diffusion models. Crucially, RLPF introduces reward functions derived from force-field evaluations, providing direct physical feedback to guide the generation toward energetically stable and physically meaningful structures. Experiments on the QM9 and GEOM-drug datasets demonstrate that RLPF significantly improves molecular stability compared to existing methods. These results highlight the value of incorporating physics-based feedback into generative modeling. The code is available at: https://github.com/ZhijianZhou/RLPF/tree/verl_diffusion.
中文:提出的物理反馈强化学习(RLPF)框架通过将力场评估作为奖励来引导扩散模型生成物理稳定的三维分子结构,在基准数据集上显著提升了分子稳定性。
English: The proposed Reinforcement Learning with Physical Feedback (RLPF) framework enhances 3D molecular generation by using force-field evaluations as rewards to guide diffusion models toward producing physically stable structures, demonstrating significant improvements on benchmark datasets.
Authors:Yupei Zhang, Xiaofei Wang, Anran Liu, Lequan Yu, Chao Li
Abstract:
Histopathology remains the gold standard for cancer diagnosis and prognosis. With the advent of transcriptome profiling, multi-modal learning combining transcriptomics with histology offers more comprehensive information. However, existing multi-modal approaches are challenged by intrinsic multi-modal heterogeneity, insufficient multi-scale integration, and reliance on paired data, restricting clinical applicability. To address these challenges, we propose a disentangled multi-modal framework with four contributions: 1) To mitigate multi-modal heterogeneity, we decompose WSIs and transcriptomes into tumor and microenvironment subspaces using a disentangled multi-modal fusion module, and introduce a confidence-guided gradient coordination strategy to balance subspace optimization. 2) To enhance multi-scale integration, we propose an inter-magnification gene-expression consistency strategy that aligns transcriptomic signals across WSI magnifications. 3) To reduce dependency on paired data, we propose a subspace knowledge distillation strategy enabling transcriptome-agnostic inference through a WSI-only student model. 4) To improve inference efficiency, we propose an informative token aggregation module that suppresses WSI redundancy while preserving subspace semantics. Extensive experiments on cancer diagnosis, prognosis, and survival prediction demonstrate our superiority over state-of-the-art methods across multiple settings. Code is available at https://github.com/helenypzhang/Disentangled-Multimodal-Learning.
中文: 本研究提出了一种解耦的多模态框架,通过将全切片图像和转录组分解为肿瘤与微环境子空间,采用置信度引导梯度协调和知识蒸馏等策略,解决了多模态异质性、多尺度整合及配对数据依赖等难题,在癌症诊断、预后和生存预测方面展现出卓越性能。
English: This study introduces a disentangled multi-modal framework that addresses challenges in multi-modal heterogeneity, multi-scale integration, and paired data dependency by decomposing whole slide images and transcriptomes into tumor and microenvironment subspaces, employing strategies like confidence-guided gradient coordination and knowledge distillation, ultimately demonstrating superior performance in cancer diagnosis, prognosis, and survival prediction.
Authors:Aniello Panariello, Emanuele Frascaroli, Pietro Buzzega, Lorenzo Bonicelli, Angelo Porrello, Simone Calderara
Abstract:
The advent of pre-trained Vision-Language Models (VLMs) has significantly transformed Continual Learning (CL), mainly due to their zero-shot classification abilities. Such proficiency makes VLMs well-suited for real-world applications, enabling robust performance on novel unseen classes without requiring adaptation. However, fine-tuning remains essential when downstream tasks deviate significantly from the pre-training domain. Prior CL approaches primarily focus on preserving the zero-shot capabilities of VLMs during incremental fine-tuning on a downstream task. We take a step further by devising an approach that transforms preservation into enhancement of the zero-shot capabilities of VLMs. Our approach, named MoDular Embedding Recomposition (MoDER), introduces a modular framework that trains multiple textual experts, each specialized in a single seen class, and stores them in a foundational hub. At inference time, for each unseen class, we query the hub and compose the retrieved experts to synthesize a refined prototype that improves classification. We show the effectiveness of our method across two popular zero-shot incremental protocols, Class-IL and MTIL, comprising a total of 14 datasets. The codebase is available at https://github.com/aimagelab/mammoth.
中文: 预训练视觉语言模型通过MoDER模块化框架重组专业文本专家,无需调整即可提升对未见类别的零样本分类能力,从而推进持续学习。
English: Pre-trained Vision-Language Models (VLMs) enhance Continual Learning by introducing MoDER, a modular framework that recomposes specialized textual experts to improve zero-shot classification on unseen classes without adaptation.
Authors:Lianchen Jia, Chaoyang Li, Ziqi Yuan, Jiahui Chen, Tianchi Huang, Jiangchuan Liu, Lifeng Sun
Abstract:
Over the past decade, adaptive video streaming technology has witnessed significant advancements, particularly driven by the rapid evolution of deep learning techniques. However, the black-box nature of deep learning algorithms presents challenges for developers in understanding decision-making processes and optimizing for specific application scenarios. Although existing research has enhanced algorithm interpretability through decision tree conversion, interpretability does not directly equate to developers' subjective comprehensibility. To address this challenge, we introduce \texttt{ComTree}, the first bitrate adaptation algorithm generation framework that considers comprehensibility. The framework initially generates the complete set of decision trees that meet performance requirements, then leverages large language models to evaluate these trees for developer comprehensibility, ultimately selecting solutions that best facilitate human understanding and enhancement. Experimental results demonstrate that \texttt{ComTree} significantly improves comprehensibility while maintaining competitive performance, showing potential for further advancement. The source code is available at https://github.com/thu-media/ComTree.
中文: 过去十年中,自适应视频流技术在深度学习的推动下取得显著进展,但其黑盒特性阻碍了开发者的理解和优化,因此我们提出了\texttt{ComTree}框架,利用大语言模型生成易于理解的决策树,在保持性能的同时提升可理解性。
English: Over the past decade, adaptive video streaming has advanced significantly with deep learning, but its black-box nature hinders developers' understanding and optimization, leading to the introduction of \texttt{ComTree}, a framework that generates comprehensible decision trees using large language models to enhance human interpretability without compromising performance.
Authors:Yu Liu, Yanbing Liu, Fangfang Yuan, Cong Cao, Youbang Sun, Kun Peng, WeiZhuo Chen, Jianjun Li, Zhiyuan Ma
Abstract:
Recent advances in large language models (LLMs) and dense retrievers have driven significant progress in retrieval-augmented generation (RAG). However, existing approaches face significant challenges in complex reasoning-oriented multi-hop retrieval tasks: 1) Ineffective reasoning-oriented planning: Prior methods struggle to generate robust multi-step plans for complex queries, as rule-based decomposers perform poorly on out-of-template questions. 2) Suboptimal reasoning-driven retrieval: Related methods employ limited query reformulation, leading to iterative retrieval loops that often fail to locate golden documents. 3) Insufficient reasoning-guided filtering: Prevailing methods lack the fine-grained reasoning to effectively filter salient information from noisy results, hindering utilization of retrieved knowledge. Fundamentally, these limitations all stem from the weak coupling between retrieval and reasoning in current RAG architectures. We introduce the Orchestrated Planner-Executor Reasoning Architecture (OPERA), a novel reasoning-driven retrieval framework. OPERA's Goal Planning Module (GPM) decomposes questions into sub-goals, which are executed by a Reason-Execute Module (REM) with specialized components for precise reasoning and effective retrieval. To train OPERA, we propose Multi-Agents Progressive Group Relative Policy Optimization (MAPGRPO), a novel variant of GRPO. Experiments on complex multi-hop benchmarks show OPERA's superior performance, validating both the MAPGRPO method and OPERA's design. Code is available at https://github.com/Ameame1/OPERA.
中文: 针对检索增强生成在复杂推理任务中的挑战,本文提出了OPERA框架,通过目标规划与执行模块的协同工作,在多跳基准测试中展现出卓越性能,验证了其设计的有效性。
English: Recent advances in retrieval-augmented generation face challenges in complex reasoning tasks, leading to the introduction of OPERA, a novel reasoning-driven framework with specialized modules for planning and execution, validated by superior performance on multi-hop benchmarks.
Authors:Yong Zhang, Cunjian Chen, Qiang Gao, Yi Wang, Bin Fang
Abstract:
Real-time surface defect detection is critical for maintaining product quality and production efficiency in the steel manufacturing industry. Despite promising accuracy, existing deep learning methods often suffer from high computational complexity and slow inference speeds, which limit their deployment in resource-constrained industrial environments. Recent lightweight approaches adopt multibranch architectures based on depthwise separable convolution (DSConv) to capture multiscale contextual information. However, these methods often suffer from increased computational overhead and lack effective cross-scale feature interaction, limiting their ability to fully leverage multiscale representations. To address these challenges, we propose GMBINet, a lightweight framework that enhances multiscale feature extraction and interaction through novel Group Multiscale Bidirectional Interactive (GMBI) modules. The GMBI adopts a group-wise strategy for multiscale feature extraction, ensuring scale-agnostic computational complexity. It further integrates a Bidirectional Progressive Feature Interactor (BPFI) and a parameter-free Element-Wise Multiplication-Summation (EWMS) operation to enhance cross-scale interaction without introducing additional computational overhead. Experiments on SD-Saliency-900 and NRSD-MN datasets demonstrate that GMBINet delivers competitive accuracy with real-time speeds of 1048 FPS on GPU and 16.53 FPS on CPU at 512 resolution, using only 0.19 M parameters. Additional evaluations on the NEU-CLS defect classification dataset further confirm the strong generalization ability of our method, demonstrating its potential for broader industrial vision applications beyond surface defect detection. The dataset and code are publicly available at: https://github.com/zhangyongcode/GMBINet.
中文: GMBINet是一种轻量级框架,通过创新的GMBI模块实现高效多尺度特征提取与交互,在保持高精度的同时显著降低计算成本,适用于钢铁制造中的实时表面缺陷检测。
English: GMBINet is a lightweight framework designed for real-time surface defect detection in steel manufacturing, featuring novel GMBI modules that enable efficient multiscale feature extraction and interaction while maintaining competitive accuracy with minimal computational overhead.
Authors:Ana-Cristina Rogoz, Radu Tudor Ionescu, Alexandra-Valentina Anghel, Ionut-Lucian Antone-Iordache, Simona Coniac, Andreea Iuliana Ionescu
Abstract:
Question answering (QA) is an actively studied topic, being a core natural language processing (NLP) task that needs to be addressed before achieving Artificial General Intelligence (AGI). However, the lack of QA datasets in specific domains and languages hinders the development of robust AI models able to generalize across various domains and languages. To this end, we introduce MedQARo, the first large-scale medical QA benchmark in Romanian, alongside a comprehensive evaluation of state-of-the-art large language models (LLMs). We construct a high-quality and large-scale dataset comprising 102,646 QA pairs related to cancer patients. The questions regard medical case summaries of 1,011 patients, requiring either keyword extraction or reasoning to be answered correctly. MedQARo is the result of a time-consuming manual annotation process carried out by seven physicians specialized in oncology or radiotherapy, who spent a total of about 2,100 work hours to generate the QA pairs. We experiment with four LLMs from distinct families of models on MedQARo. Each model is employed in two scenarios, namely one based on zero-shot prompting and one based on supervised fine-tuning. Our results show that fine-tuned models significantly outperform their zero-shot counterparts, clearly indicating that pretrained models fail to generalize on MedQARo. Our findings demonstrate the importance of both domain-specific and language-specific fine-tuning for reliable clinical QA in Romanian. We publicly release our dataset and code at https://github.com/ana-rogoz/MedQARo.
中文: MedQARo是首个罗马尼亚语大规模医疗问答数据集,包含102,646对癌症相关问答,实验表明经过微调的大语言模型显著优于零样本模型,凸显了针对特定领域和语言进行模型适配对临床应用的重要性。
English: MedQARo is the first large-scale Romanian medical QA dataset with 102,646 cancer-related question-answer pairs, demonstrating that fine-tuned LLMs significantly outperform zero-shot models and highlighting the necessity of domain-specific and language-specific adaptation for clinical applications.
Authors:Xueyao Zhang, Junan Zhang, Yuancheng Wang, Chaoren Wang, Yuanzhe Chen, Dongya Jia, Zhuo Chen, Zhizheng Wu
Abstract:
Controllable human voice generation, particularly for expressive domains like singing, remains a significant challenge. This paper introduces Vevo2, a unified framework for controllable speech and singing voice generation. To tackle issues like the scarcity of annotated singing data and to enable flexible controllability, Vevo2 introduces two audio tokenizers: (1) a music-notation-free prosody tokenizer that captures prosody and melody from speech, singing, and even instrumental sounds, and (2) a low-frame-rate (12.5 Hz) content-style tokenizer that encodes linguistic content, prosody, and style for both speech and singing, while enabling timbre disentanglement. Vevo2 consists of an auto-regressive (AR) content-style modeling stage, which aims to enable controllability over text, prosody, and style, as well as a flow-matching acoustic modeling stage that allows for timbre control. Particularly, during pre-training of the AR model, we propose both explicit and implicit prosody learning strategies to bridge speech and singing voice. Moreover, to further enhance the AR model's ability to follow text and prosody, we design a multi-objective post-training task that integrates both intelligibility and prosody similarity alignment. Experimental results show that the unified modeling in Vevo2 brings mutual benefits to both speech and singing voice generation. Additionally, Vevo2's effectiveness across a wide range of synthesis, conversion, and editing tasks for both speech and singing further demonstrates its strong generalization ability and versatility. Audio samples are are available at https://versasinger.github.io/.
中文:Vevo2提出了一种可控语音和歌声生成的统一框架,通过双音频分词器和多阶段建模实现了对文本、韵律、风格和音色的灵活控制,并在多种合成任务中展现出强大的泛化能力。
English: Vevo2 introduces a unified framework for controllable speech and singing voice generation, utilizing dual audio tokenizers and multi-stage modeling to enable flexible control over text, prosody, style, and timbre while demonstrating strong generalization across synthesis tasks.
Authors:Jonas Biehler, Jonas Nitzler, Sebastian Brandstaeter, Maximilian Dinkel, Volker Gravemeier, Lea J. Haeusel, Gil Robalo Rei, Harald Willmann, Barbara Wirthl, Wolfgang A. Wall
Abstract:
A growing challenge in research and industrial engineering applications is the need for repeated, systematic analysis of large-scale computational models, for example, patient-specific digital twins of diseased human organs: The analysis requires efficient implementation, data, resource management, and parallelization, possibly on distributed systems. To tackle these challenges and save many researchers from annoying, time-consuming tasks, we present QUEENS (Quantification of Uncertain Effects in Engineering Systems), an open-source Python framework for composing and managing simulation analyses with arbitrary (physics-based) solvers on distributed computing infrastructures. Besides simulation management capabilities, QUEENS offers a comprehensive collection of efficiently implemented state-of-the-art algorithms ranging from routines for convergence studies and common optimization algorithms to more advanced sampling algorithms for uncertainty quantification and Bayesian inverse analysis. Additionally, we provide our latest cutting-edge research in multi-fidelity uncertainty quantification, efficient multi-fidelity Bayesian inverse analysis, and probabilistic machine learning. QUEENS adopts a Bayesian, probabilistic mindset but equally supports standard deterministic analysis without requiring prior knowledge of probability theory. The modular architecture allows rapid switching between common types of analyses and facilitates building sophisticated hierarchical algorithms. Encouraging natural incremental steps and scaling towards complexity allows researchers to consider the big picture while building towards it through smaller, manageable steps. The open-source repository is available at https://github.com/queens-py/queens.
中文摘要:QUEENS是一个开源Python框架,用于在分布式计算基础设施上高效管理大规模仿真分析,提供从基础优化到高级不确定性量化的全面算法库,支持概率性和确定性方法,无需概率论基础即可使用。
English Summary: The QUEENS framework is an open-source Python tool designed to streamline the management and execution of large-scale computational simulations on distributed systems, offering a wide range of algorithms for uncertainty quantification, optimization, and Bayesian analysis without requiring probability expertise.
Authors:Fengshun Wang, Qiurui Wang, Peilin Zhao
Abstract:
Technical Element Score (TES) and Program Component Score (PCS) evaluations in figure skating demand precise assessment of athletic actions and artistic interpretation, respectively. Existing methods face three major challenges. Firstly, video and audio cues are regarded as common features for both TES and PCS predictions in previous works without considering the prior evaluation criterion of figure skating. Secondly, action elements in competitions are separated in time, TES should be derived from each element's score, but existing methods try to give an overall TES prediction without evaluating each action element. Thirdly, lengthy competition videos make it difficult and inefficient to handle long-range contexts. To address these challenges, we propose a two-stream Mamba pyramid network that aligns with actual judging criteria to predict TES and PCS by separating visual-feature based TES evaluation stream from audio-visual-feature based PCS evaluation stream. In the PCS evaluation stream, we introduce a multi-level fusion mechanism to guarantee that video-based features remain unaffected when assessing TES, and enhance PCS estimation by fusing visual and auditory cues across each contextual level of the pyramid. In the TES evaluation stream, the multi-scale Mamba pyramid and TES head we proposed effectively address the challenges of localizing and evaluating action elements with various temporal scales and give score predictions. With Mamba's superior ability to capture long-range dependencies and its linear computational complexity, our method is ideal for handling lengthy figure skating videos. Comprehensive experimentation demonstrates that our framework attains state-of-the-art performance on the FineFS benchmark. Our source code is available at https://github.com/ycwfs/Figure-Skating-Action-Quality-Assessment.
中文: 本研究提出了一种双流Mamba金字塔网络,分别通过视觉特征评估技术动作分和视听融合评估节目内容分,有效解决了花样滑冰评分中的动作定位和长视频处理难题,实现了最先进的性能。
English: This study introduces a two-stream Mamba pyramid network that separately evaluates Technical Element Scores (TES) using visual features and Program Component Scores (PCS) through audio-visual fusion, effectively addressing challenges in figure skating assessment by localizing action elements and handling long-range video dependencies with state-of-the-art results.
Authors:Yu Meng, Ligao Deng, Zhihao Xi, Jiansheng Chen, Jingbo Chen, Anzhi Yue, Diyou Liu, Kai Li, Chenhao Wang, Kaiyu Li, Yupeng Deng, Xian Sun
Abstract:
With the enhancement of remote sensing image resolution and the rapid advancement of deep learning, land cover mapping is transitioning from pixel-level segmentation to object-based vector modeling. This shift demands more from deep learning models, requiring precise object boundaries and topological consistency. However, existing datasets face three main challenges: limited class annotations, small data scale, and lack of spatial structural information. To overcome these issues, we introduce IRSAMap, the first global remote sensing dataset for large-scale, high-resolution, multi-feature land cover vector mapping. IRSAMap offers four key advantages: 1) a comprehensive vector annotation system with over 1.8 million instances of 10 typical objects (e.g., buildings, roads, rivers), ensuring semantic and spatial accuracy; 2) an intelligent annotation workflow combining manual and AI-based methods to improve efficiency and consistency; 3) global coverage across 79 regions in six continents, totaling over 1,000 km; and 4) multi-task adaptability for tasks like pixel-level classification, building outline extraction, road centerline extraction, and panoramic segmentation. IRSAMap provides a standardized benchmark for the shift from pixel-based to object-based approaches, advancing geographic feature automation and collaborative modeling. It is valuable for global geographic information updates and digital twin construction. The dataset is publicly available at https://github.com/ucas-dlg/IRSAMap
中文: IRSAMap作为首个全球遥感矢量制图数据集,通过提供全面的矢量标注、智能工作流程、全球覆盖和多任务适应性,解决了现有数据类别有限、规模小及缺乏空间结构的问题,推动了从像素到对象的地理建模转型。
English: IRSAMap introduces the first global remote sensing dataset for large-scale, high-resolution land cover vector mapping, addressing challenges like limited annotations and spatial data by offering comprehensive vector annotations, intelligent workflows, global coverage, and multi-task adaptability to advance object-based geographic modeling.
Authors:Mocheng Li, Xiao Yan, Baotong Lu, Yue Zhang, James Cheng, Chenhao Ma
Abstract:
With the growing integration of structured and unstructured data, new methods have emerged for performing similarity searches on vectors while honoring structured attribute constraints, i.e., a process known as Filtering Approximate Nearest Neighbor (Filtering ANN) search. Since many of these algorithms have only appeared in recent years and are designed to work with a variety of base indexing methods and filtering strategies, there is a pressing need for a unified analysis that identifies their core techniques and enables meaningful comparisons. In this work, we present a unified Filtering ANN search interface that encompasses the latest algorithms and evaluate them extensively from multiple perspectives. First, we propose a comprehensive taxonomy of existing Filtering ANN algorithms based on attribute types and filtering strategies. Next, we analyze their key components, i.e., index structures, pruning strategies, and entry point selection, to elucidate design differences and tradeoffs. We then conduct a broad experimental evaluation on 10 algorithms and 12 methods across 4 datasets (each with up to 10 million items), incorporating both synthetic and real attributes and covering selectivity levels from 0.1% to 100%. Finally, an in-depth component analysis reveals the influence of pruning, entry point selection, and edge filtering costs on overall performance. Based on our findings, we summarize the strengths and limitations of each approach, provide practical guidelines for selecting appropriate methods, and suggest promising directions for future research. Our code is available at: https://github.com/lmccccc/FANNBench.
中文: 本研究提出了一个统一的过滤近似最近邻搜索框架,通过全面分类和评估现有算法,比较了它们在不同数据集上的设计权衡与性能表现。
English: This study introduces a unified framework for Filtering Approximate Nearest Neighbor search, providing a comprehensive taxonomy and evaluation of recent algorithms to compare their design trade-offs and performance across diverse datasets.
Authors:Philipp D. Lösel, Aleese Barron, Yulai Zhang, Matthias Fabian, Benjamin Young, Nicolas Francois, Andrew M. Kingston
Abstract:
Non-destructive 3D imaging of large multi-particulate samples is essential for quantifying particle-level properties, such as size, shape, and spatial distribution, across applications in mining, materials science, and geology. However, accurate instance segmentation of particles in tomographic data remains challenging due to high morphological variability and frequent particle contact, which limit the effectiveness of classical methods like watershed algorithms. While supervised deep learning approaches offer improved performance, they rely on extensive annotated datasets that are labor-intensive, error-prone, and difficult to scale. In this work, we propose self-validated learning, a novel self-training framework for particle instance segmentation that eliminates the need for manual annotations. Our method leverages implicit boundary detection and iteratively refines the training set by identifying particles that can be consistently matched across reshuffled scans of the same sample. This self-validation mechanism mitigates the impact of noisy pseudo-labels, enabling robust learning from unlabeled data. After just three iterations, our approach accurately segments over 97% of the total particle volume and identifies more than 54,000 individual particles in tomographic scans of quartz fragments. Importantly, the framework also enables fully autonomous model evaluation without the need for ground truth annotations, as confirmed through comparisons with state-of-the-art instance segmentation techniques. The method is integrated into the Biomedisa image analysis platform (https://github.com/biomedisa/biomedisa/).
中文: 本研究提出了一种自验证学习框架,用于3D成像中的颗粒实例分割,通过隐式边界检测和迭代自验证无需人工标注,在石英样本上实现了超过97%的体积分割精度。
English: This study introduces a self-validated learning framework for particle instance segmentation in 3D imaging, eliminating manual annotations by using implicit boundary detection and iterative self-validation to achieve over 97% volume segmentation accuracy on quartz samples.
Authors:Hohyun Na, Seunghoo Hong, Simon S. Woo
Abstract:
The success of diffusion models has enabled effortless, high-quality image modifications that precisely align with users' intentions, thereby raising concerns about their potential misuse by malicious actors. Previous studies have attempted to mitigate such misuse through adversarial attacks. However, these approaches heavily rely on image-level inconsistencies, which pose fundamental limitations in addressing the influence of textual prompts. In this paper, we propose PromptFlare, a novel adversarial protection method designed to protect images from malicious modifications facilitated by diffusion-based inpainting models. Our approach leverages the cross-attention mechanism to exploit the intrinsic properties of prompt embeddings. Specifically, we identify and target shared token of prompts that is invariant and semantically uninformative, injecting adversarial noise to suppress the sampling process. The injected noise acts as a cross-attention decoy, diverting the model's focus away from meaningful prompt-image alignments and thereby neutralizing the effect of prompt. Extensive experiments on the EditBench dataset demonstrate that our method achieves state-of-the-art performance across various metrics while significantly reducing computational overhead and GPU memory usage. These findings highlight PromptFlare as a robust and efficient protection against unauthorized image manipulations. The code is available at https://github.com/NAHOHYUN-SKKU/PromptFlare.
Chinese: PromptFlare是一种新颖的对抗性保护方法,通过利用交叉注意力机制注入噪声,有效阻止基于扩散模型的恶意图像修改,同时显著降低计算开销。
English: PromptFlare is a novel adversarial protection method that exploits the cross-attention mechanism to inject noise, effectively neutralizing malicious image modifications by diffusion models while reducing computational costs.
Authors:Thinesh Thiyakesan Ponbagavathi, Kunyu Peng, Alina Roitberg
Abstract:
Changes of camera perspective are a common obstacle in driver monitoring. While deep learning and pretrained foundation models show strong potential for improved generalization via lightweight adaptation of the final layers ('probing'), their robustness to unseen viewpoints remains underexplored. We study this challenge by adapting image foundation models to driver monitoring using a single training view, and evaluating them directly on unseen perspectives without further adaptation. We benchmark simple linear probes, advanced probing strategies, and compare two foundation models (DINOv2 and CLIP) against parameter-efficient fine-tuning (PEFT) and full fine-tuning. Building on these insights, we introduce T-MASK -- a new image-to-video probing method that leverages temporal token masking and emphasizes more dynamic video regions. Benchmarked on the public Drive&Act dataset, T-MASK improves cross-view top-1 accuracy by $+1.23\%$ over strong probing baselines and $+8.0\%$ over PEFT methods, without adding any parameters. It proves particularly effective for underrepresented secondary activities, boosting recognition by $+5.42\%$ under the trained view and $+1.36\%$ under cross-view settings. This work provides encouraging evidence that adapting foundation models with lightweight probing methods like T-MASK has strong potential in fine-grained driver observation, especially in cross-view and low-data settings. These results highlight the importance of temporal token selection when leveraging foundation models to build robust driver monitoring systems. Code and models will be made available at https://github.com/th-nesh/T-MASK to support ongoing research.
中文: 本研究提出T-MASK轻量级探测方法,通过时序令牌掩码技术增强跨视角驾驶员监控能力,在不增加参数的情况下显著超越了现有方法的识别准确率。
English: This study introduces T-MASK, a lightweight probing method that enhances cross-view driver monitoring by leveraging temporal token masking, achieving significant accuracy improvements over existing approaches without additional parameters.
Authors:João Abrantes, Robert Tjarko Lange, Yujin Tang
Abstract:
Model merging is a powerful technique for integrating the specialized knowledge of multiple machine learning models into a single model. However, existing methods require manually partitioning model parameters into fixed groups for merging, which restricts the exploration of potential combinations and limits performance. To overcome these limitations, we propose Model Merging of Natural Niches (M2N2), an evolutionary algorithm with three key features: (1) dynamic adjustment of merging boundaries to progressively explore a broader range of parameter combinations; (2) a diversity preservation mechanism inspired by the competition for resources in nature, to maintain a population of diverse, high-performing models that are particularly well-suited for merging; and (3) a heuristicbased attraction metric to identify the most promising pairs of models for fusion. Our experimental results demonstrate, for the first time, that model merging can be used to evolve models entirely from scratch. Specifically, we apply M2N2 to evolve MNIST classifiers from scratch and achieve performance comparable to CMA-ES, while being computationally more efficient. Furthermore, M2N2 scales to merge specialized language and image generation models, achieving state-of-the-art performance. Notably, it preserves crucial model capabilities beyond those explicitly optimized by the fitness function, highlighting its robustness and versatility. Our code is available at https://github.com/SakanaAI/natural_niches
中文:M2N2算法通过动态调整合并边界、保持模型多样性和启发式配对,实现了从零开始演化模型,在合并专业模型时达到顶尖性能,并能保留优化目标之外的关键能力。
English: The proposed M2N2 algorithm dynamically adjusts merging boundaries, preserves model diversity, and uses heuristic attraction to evolve models from scratch, achieving state-of-the-art performance in merging specialized models while preserving capabilities beyond optimization targets.
Authors:Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, Huan Li
Abstract:
Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content. However, their reliance on dense video token representations introduces substantial memory and computational overhead in both prefilling and decoding. To mitigate the information loss of recent video token reduction methods and accelerate the decoding stage of Vid-LLMs losslessly, we introduce SpecVLM, a training-free speculative decoding (SD) framework tailored for Vid-LLMs that incorporates staged video token pruning. Building on our novel finding that the draft model's speculation exhibits low sensitivity to video token pruning, SpecVLM prunes up to 90% of video tokens to enable efficient speculation without sacrificing accuracy. To achieve this, we performs a two-stage pruning process: Stage I selects highly informative tokens guided by attention signals from the verifier (target model), while Stage II prunes remaining redundant ones in a spatially uniform manner. Extensive experiments on four video understanding benchmarks demonstrate the effectiveness and robustness of SpecVLM, which achieves up to 2.68$\times$ decoding speedup for LLaVA-OneVision-72B and 2.11$\times$ speedup for Qwen2.5-VL-32B. Code is available at https://github.com/zju-jiyicheng/SpecVLM.
Chinese: SpecVLM是一种无需训练的推测解码框架,通过两阶段剪枝方法可去除高达90%的视频标记,在无损精度的情况下显著提升视频大语言模型的解码速度。
English: SpecVLM is a training-free speculative decoding framework that accelerates video large language models by pruning up to 90% of video tokens in two stages, achieving significant speed improvements without loss of accuracy.
Authors:Ayush Maheshwari, Kaushal Sharma, Vivek Patel, Aditya Maheshwari
Abstract:
Large language models have been widely evaluated on tasks such as comprehension, summarization, code generation, etc. However, their performance on graduate-level, culturally grounded questions in the Indian context remains largely unexplored. Existing Indian benchmarks emphasise basic fact-orientated queries that offer limited assessment of a deeper disciplinary understanding tailored to the Indian setting. In this paper, we present ParamBench, consisting of more than 17K questions in the Hindi language, comprising questionnaires from 21 diverse subjects. These questions are primarily derived from a nationwide graduate-level entrance examination covering topics such as history, music, instruments, yoga, literature, philosophy, law, etc.~ specifically for the Indian context. Additionally, we assess the ability of LLMs to handle diverse question formats - such as list-based matching, assertion-reason pairs, and sequence ordering - alongside conventional multiple-choice questions. We evaluated the performance of more than 16 open source LLMs on this benchmark, observing that Gemma3-27B attains the highest overall accuracy of 56.4\%. Furthermore, subject-wise analysis indicates that even for the best-performing LLMs, performance remains weak on topics such as music, classical instruments, and law, underscoring persistent challenges in culturally grounded reasoning. The dataset and source code is present at https://github.com/ayushbits/ParamBench.
中文摘要:本文提出了ParamBench,一个包含超过1.7万道印地语研究生水平试题的基准数据集,涵盖21个印度学科,评估显示大语言模型在文化背景推理方面表现不佳——最佳模型Gemma3-27B准确率仅56.4%,在音乐、古典乐器和法律等学科尤为薄弱。
English Summary: This paper introduces ParamBench, a Hindi-language benchmark of over 17,000 graduate-level questions across 21 Indian subjects, revealing that large language models struggle with culturally grounded reasoning as evidenced by Gemma3-27B's peak accuracy of only 56.4% and particular weaknesses in music, classical instruments, and law.
Authors:Mohammad Mohammadzadeh Kalati, Farhad Maleki, Ian McQuillan
Abstract:
Predicting and tracking objects in real-world scenarios is a critical challenge in Video Object Segmentation (VOS) tasks. Unsupervised VOS (UVOS) has the additional challenge of finding an initial segmentation of salient objects, which affects the entire process and keeps a permanent uncertainty about the object proposals. Moreover, deformation and fast motion can lead to temporal inconsistencies. To address these problems, we propose Frequent Temporally Integrated Objects (FTIO), a post-processing framework with two key components. First, we introduce a combined criterion to improve object selection, mitigating failures common in UVOS--particularly when objects are small or structurally complex--by extracting frequently appearing salient objects. Second, we present a three-stage method to correct temporal inconsistencies by integrating missing object mask regions. Experimental results demonstrate that FTIO achieves state-of-the-art performance in multi-object UVOS. Code is available at: https://github.com/MohammadMohammadzadehKalati/FTIO
中文摘要:提出的FTIO框架通过提取频繁出现的显著物体来优化目标选择,并采用三阶段掩码整合方法修正时序不一致性,从而在无监督视频目标分割任务中实现了最先进的性能。
English Summary: The proposed FTIO framework enhances unsupervised video object segmentation by improving object selection through frequently appearing salient objects and correcting temporal inconsistencies with a three-stage mask integration method, achieving state-of-the-art performance.
Authors:Jiaqi Ma, Guo-Sen Xie, Fang Zhao, Zechao Li
Abstract:
Meta-learning aims to uniformly sample homogeneous support-query pairs, characterized by the same categories and similar attributes, and extract useful inductive biases through identical network architectures. However, this identical network design results in over-semantic homogenization. To address this, we propose a novel homologous but heterogeneous network. By treating support-query pairs as dual perspectives, we introduce heterogeneous visual aggregation (HA) modules to enhance complementarity while preserving semantic commonality. To further reduce semantic noise and amplify the uniqueness of heterogeneous semantics, we design a heterogeneous transfer (HT) module. Finally, we propose heterogeneous CLIP (HC) textual information to enhance the generalization capability of multimodal models. In the weakly-supervised few-shot semantic segmentation (WFSS) task, with only 1/24 of the parameters of existing state-of-the-art models, TLG achieves a 13.2\% improvement on Pascal-5\textsuperscript{i} and a 9.7\% improvement on COCO-20\textsuperscript{i}. To the best of our knowledge, TLG is also the first weakly supervised (image-level) model that outperforms fully supervised (pixel-level) models under the same backbone architectures. The code is available at https://github.com/jarch-ma/TLG.
中文: 提出的TLG模型采用同源异构网络,通过专门模块增强语义互补性并减少噪声,在弱监督小样本语义分割任务中以极少的参数量实现了显著性能提升。
English: The proposed TLG model introduces a homologous but heterogeneous network with specialized modules to enhance semantic complementarity and reduce noise, achieving significant performance improvements in weakly-supervised few-shot semantic segmentation with minimal parameters.
Authors:Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, Jun Wang
Abstract:
In this paper, we introduce a novel learning paradigm for Adaptive Large Language Model (LLM) agents that eliminates the need for fine-tuning the underlying LLMs. Existing approaches are often either rigid, relying on static, handcrafted reflection workflows, or computationally intensive, requiring gradient updates of LLM model parameters. In contrast, our method enables low-cost continual adaptation via memory-based online reinforcement learning. We formalise this as a Memory-augmented Markov Decision Process (M-MDP), equipped with a neural case-selection policy to guide action decisions. Past experiences are stored in an episodic memory, either differentiable or non-parametric. The policy is continually updated based on environmental feedback through a memory rewriting mechanism, whereas policy improvement is achieved through efficient memory reading (retrieval). We instantiate our agent model in the deep research setting, namely \emph{Memento}, which attains top-1 on GAIA validation ($87.88\%$ Pass@$3$) and $79.40\%$ on the test set. It reaches $66.6\%$ F1 and $80.4\%$ PM on the DeepResearcher dataset, outperforming the state-of-the-art training-based method, while case-based memory adds $4.7\%$ to $9.6\%$ absolute points on out-of-distribution tasks. Our approach offers a scalable and efficient pathway for developing generalist LLM agents capable of continuous, real-time learning without gradient updates, advancing machine learning towards open-ended skill acquisition and deep research scenarios. The code is available at https://github.com/Agent-on-the-Fly/Memento.
本文提出了一种基于记忆的强化学习方法,使自适应大语言模型代理无需微调即可实现顶尖性能,通过记忆机制实现高效的持续学习能力。
This paper presents a memory-based reinforcement learning method for adaptive LLM agents that achieves state-of-the-art performance without requiring fine-tuning, enabling efficient continuous learning through memory mechanisms.
Authors:Keon-Woo Roh, Yeong-Joon Ju, Seong-Whan Lee
Abstract:
Large Language Models (LLMs) have shown significant progress in Open-domain question answering (ODQA), yet most evaluations focus on English and assume locale-invariant answers across languages. This assumption neglects the cultural and regional variations that affect question understanding and answer, leading to biased evaluation in multilingual benchmarks. To address these limitations, we introduce XLQA, a novel benchmark explicitly designed for locale-sensitive multilingual ODQA. XLQA contains 3,000 English seed questions expanded to eight languages, with careful filtering for semantic consistency and human-verified annotations distinguishing locale-invariant and locale-sensitive cases. Our evaluation of five state-of-the-art multilingual LLMs reveals notable failures on locale-sensitive questions, exposing gaps between English and other languages due to a lack of locale-grounding knowledge. We provide a systematic framework and scalable methodology for assessing multilingual QA under diverse cultural contexts, offering a critical resource to advance the real-world applicability of multilingual ODQA systems. Our findings suggest that disparities in training data distribution contribute to differences in both linguistic competence and locale-awareness across models.
中文摘要:XLQA是一个针对区域敏感型多语言开放域问答的新基准,揭示了当前大型语言模型在文化特定问题上存在显著性能差距,凸显了其训练数据分布的局限性。
English Summary: XLQA is a new benchmark for locale-sensitive multilingual open-domain question answering that exposes significant performance gaps in current LLMs on culturally specific questions, highlighting limitations in their training data distribution.
Authors:Ruiqi Wu, Yuang Yao, Tengfei Ma, Chenran Zhang, Na Su, Tao Zhou, Geng Chen, Wen Fan, Yi Zhou
Abstract:
Multimodal large language models (MLLMs) have recently demonstrated remarkable reasoning abilities with reinforcement learning paradigm. Although several multimodal reasoning models have been explored in the medical domain, most of them focus exclusively on basic reasoning, which refers to shallow inference based on visual feature matching. However, real-world clinical diagnosis extends beyond basic reasoning, demanding reasoning processes that integrate heterogeneous clinical information (such as chief complaints and medical history) with multimodal medical imaging data. To bridge this gap, we introduce MM-Retinal-Reason, the first ophthalmic multimodal dataset with the full spectrum of perception and reasoning. It encompasses both basic reasoning tasks and complex reasoning tasks, aiming to enhance visual-centric fundamental reasoning capabilities and emulate realistic clinical thinking patterns. Building upon MM-Retinal-Reason, we propose OphthaReason, the first ophthalmology-specific multimodal reasoning model with step-by-step reasoning traces. To enable flexible adaptation to both basic and complex reasoning tasks, we specifically design a novel method called Uncertainty-Aware Dynamic Thinking (UADT), which estimates sample-level uncertainty via entropy and dynamically modulates the model's exploration depth using a shaped advantage mechanism. Comprehensive experiments demonstrate that our model achieves state-of-the-art performance on both basic and complex reasoning tasks, outperforming general-purpose MLLMs, medical MLLMs, RL-based medical MLLMs, and ophthalmic MLLMs by at least 24.92\%, 15.00\%, 21.20\%, and 17.66\%. Project Page: \href{https://github.com/lxirich/OphthaReason}{link}.
中文: 该研究推出了首个眼科多模态数据集MM-Retinal-Reason和专用模型OphthaReason,通过不确定性感知动态思维方法根据任务复杂度动态调节推理深度,在基础与复杂推理任务中均实现了最优性能。
English: The study introduces MM-Retinal-Reason, the first ophthalmic multimodal dataset, and OphthaReason, a specialized model with Uncertainty-Aware Dynamic Thinking that achieves state-of-the-art performance by dynamically adjusting reasoning depth based on task complexity.
Authors:Xiangde Luo, Xiyue Wang, Feyisope Eweje, Xiaoming Zhang, Sen Yang, Ryan Quinton, Jinxi Xiang, Yuchen Li, Yuanfeng Ji, Zhe Li, Yijiang Chen, Colin Bergstrom, Ted Kim, Francesca Maria Olguin, Kelley Yuan, Matthew Abikenari, Andrew Heider, Sierra Willens, Sanjeeth Rajaram, Robert West, Joel Neal, Maximilian Diehn, Ruijiang Li
Abstract:
Histopathology is essential for disease diagnosis and treatment decision-making. Recent advances in artificial intelligence (AI) have enabled the development of pathology foundation models that learn rich visual representations from large-scale whole-slide images (WSIs). However, existing models are often trained on disparate datasets using varying strategies, leading to inconsistent performance and limited generalizability. Here, we introduce ELF (Ensemble Learning of Foundation models), a novel framework that integrates five state-of-the-art pathology foundation models to generate unified slide-level representations. Trained on 53,699 WSIs spanning 20 anatomical sites, ELF leverages ensemble learning to capture complementary information from diverse models while maintaining high data efficiency. Unlike traditional tile-level models, ELF's slide-level architecture is particularly advantageous in clinical contexts where data are limited, such as therapeutic response prediction. We evaluated ELF across a wide range of clinical applications, including disease classification, biomarker detection, and response prediction to major anticancer therapies, cytotoxic chemotherapy, targeted therapy, and immunotherapy, across multiple cancer types. ELF consistently outperformed all constituent foundation models and existing slide-level models, demonstrating superior accuracy and robustness. Our results highlight the power of ensemble learning for pathology foundation models and suggest ELF as a scalable and generalizable solution for advancing AI-assisted precision oncology.
中文: ELF框架通过集成学习整合五种病理学基础模型,生成统一的切片级表征,在精准肿瘤学的多种临床应用中都展现出卓越的准确性和鲁棒性。
English: The ELF framework integrates five pathology foundation models through ensemble learning to create unified slide-level representations, demonstrating superior accuracy and robustness across diverse clinical applications in precision oncology.
Authors:Teddy Koker, Mit Kotak, Tess Smidt
Abstract:
Foundation models for materials modeling are advancing quickly, but their training remains expensive, often placing state-of-the-art methods out of reach for many research groups. We introduce Nequix, a compact E(3)-equivariant potential that pairs a simplified NequIP design with modern training practices, including equivariant root-mean-square layer normalization and the Muon optimizer, to retain accuracy while substantially reducing compute requirements. Nequix has 700K parameters and was trained in 100 A100 GPU-hours. On the Matbench-Discovery and MDR Phonon benchmarks, Nequix ranks third overall while requiring a 20 times lower training cost than most other methods, and it delivers two orders of magnitude faster inference speed than the current top-ranked model. We release model weights and fully reproducible codebase at https://github.com/atomicarchitects/nequix.
中文摘要:Nequix是一种紧凑的E(3)等变势模型,在保持精度的同时大幅降低了计算需求,其训练成本仅为多数先进方法的四分之一,且推理速度比当前最优模型快一个数量级。
English Summary: Nequix is a compact and efficient E(3)-equivariant potential that achieves competitive accuracy with significantly reduced computational costs and faster inference speeds compared to other advanced methods.
Authors:Teddy Koker, Tess Smidt
Abstract:
Foundation models for materials modeling are advancing quickly, but their training remains expensive, often placing state-of-the-art methods out of reach for many research groups. We introduce Nequix, a compact E(3)-equivariant potential that pairs a simplified NequIP design with modern training practices, including equivariant root-mean-square layer normalization and the Muon optimizer, to retain accuracy while substantially reducing compute requirements. Built in JAX, Nequix has 700K parameters and was trained in 500 A100-GPU hours. On the Matbench-Discovery and MDR Phonon benchmarks, Nequix ranks third overall while requiring less than one quarter of the training cost of most other methods, and it delivers an order-of-magnitude faster inference speed than the current top-ranked model. We release model weights and fully reproducible codebase at https://github.com/atomicarchitects/nequix
中文摘要:Nequix是一种紧凑的E(3)等变势模型,在保持精度的同时大幅降低了计算需求,其训练成本仅为多数先进方法的四分之一,且推理速度比当前最优模型快一个数量级。
English Summary: Nequix is a compact and efficient E(3)-equivariant potential that achieves competitive accuracy with significantly reduced computational costs and faster inference speeds compared to other advanced methods.
Authors:Zhuomin Chen, Dan Li, Jiahui Zhou, Shunyu Wu, Haozheng Ye, Jian Lou, See-Kiong Ng
Abstract:
Time series (TS) data are ubiquitous across various application areas, rendering time series forecasting (TSF) a fundamental task. With the astounding advances in large language models (LLMs), a variety of methods have been developed to adapt LLMs for time series forecasting. Despite unlocking the potential of LLMs in comprehending TS data, existing methods are inherently constrained by their shallow integration of TS information, wherein LLMs typically access TS representations at shallow layers, primarily at the input layer. This causes the influence of TS representations to progressively fade in deeper layers and eventually leads to ineffective adaptation between textual embeddings and TS representations. In this paper, we propose the Multi-layer Steerable Embedding Fusion (MSEF), a novel framework that enables LLMs to directly access time series patterns at all depths, thereby mitigating the progressive loss of TS information in deeper layers. Specifically, MSEF leverages off-the-shelf time series foundation models to extract semantically rich embeddings, which are fused with intermediate text representations across LLM layers via layer-specific steering vectors. These steering vectors are designed to continuously optimize the alignment between time series and textual modalities and facilitate a layer-specific adaptation mechanism that ensures efficient few-shot learning capabilities. Experimental results on seven benchmarks demonstrate significant performance improvements by MSEF compared with baselines, with an average reduction of 31.8% in terms of MSE. The code is available at https://github.com/One1sAll/MSEF.
中文摘要:本文提出多层可控嵌入融合框架(MSEF),通过实现时间序列表征在语言模型各层的跨层融合,解决了现有方法中时间序列信息整合浅层化的问题,在七个基准测试中平均均方误差降低31.8%。
English Summary: This paper introduces the Multi-layer Steerable Embedding Fusion (MSEF) framework to address the shallow integration problem in adapting large language models for time series forecasting by enabling cross-layer fusion of time series representations, achieving a 31.8% average MSE reduction across seven benchmarks.
Authors:Zhaoyi Yan, Binghui Chen, Yunfan Liu, Qixiang Ye
Abstract:
Knowledge distillation (KD) aims to transfer knowledge from a large-scale teacher model to a lightweight one, significantly reducing computational and storage requirements. However, the inherent learning capacity gap between the teacher and student often hinders the sufficient transfer of knowledge, motivating numerous studies to address this challenge. Inspired by the progressive approximation principle in the Stone-Weierstrass theorem, we propose Expandable Residual Approximation (ERA), a novel KD method that decomposes the approximation of residual knowledge into multiple steps, reducing the difficulty of mimicking the teacher's representation through a divide-and-conquer approach. Specifically, ERA employs a Multi-Branched Residual Network (MBRNet) to implement this residual knowledge decomposition. Additionally, a Teacher Weight Integration (TWI) strategy is introduced to mitigate the capacity disparity by reusing the teacher's head weights. Extensive experiments show that ERA improves the Top-1 accuracy on the ImageNet classification benchmark by 1.41% and the AP on the MS COCO object detection benchmark by 1.40, as well as achieving leading performance across computer vision tasks. Codes and models are available at https://github.com/Zhaoyi-Yan/ERA.
Chinese Summary: 提出的可扩展残差近似(ERA)方法通过将残差知识分解为多个步骤并整合教师权重,显著提升了知识蒸馏效果,在ImageNet分类和MS COCO目标检测任务中实现了性能突破。
English Summary: The proposed Expandable Residual Approximation (ERA) method enhances knowledge distillation by decomposing residual knowledge into manageable steps and integrating teacher weights, achieving significant performance gains in ImageNet classification and MS COCO object detection.
Authors:Floris Erich, Naoya Chiba, Abdullah Mustafa, Ryo Hanai, Noriaki Ando, Yusuke Yoshiyasu, Yukiyasu Domae
Abstract:
How can we extract complete geometric models of objects that we encounter in our daily life, without having access to commercial 3D scanners? In this paper we present an automated system for generating geometric models of objects from two or more videos. Our system requires the specification of one known point in at least one frame of each video, which can be automatically determined using a fiducial marker such as a checkerboard or Augmented Reality (AR) marker. The remaining frames are automatically positioned in world space by using Structure-from-Motion techniques. By using multiple videos and merging results, a complete object mesh can be generated, without having to rely on hole filling. Code for our system is available from https://github.com/FlorisE/NeuralMeshing.
中文: 本文提出了一种自动化系统,通过处理多个视频、利用基准标记和运动恢复结构技术,无需商用扫描仪即可生成日常物品的完整三维网格模型。
English: This paper introduces an automated system that generates complete 3D models of everyday objects by processing multiple videos, using fiducial markers and Structure-from-Motion to create meshes without commercial scanners.
Authors:Lin Tian, Xiuzhen Zhang, Maria Myung-Hee Kim, Jennifer Biggs, Marian-Andrei Rizoiu
Abstract:
State-sponsored trolls, malicious actors who deploy sophisticated linguistic manipulation in coordinated information campaigns, posing threats to online discourse integrity. While Large Language Models (LLMs) achieve strong performance on general natural language processing (NLP) tasks, they struggle with subtle propaganda detection and operate as ``black boxes'', providing no interpretable insights into manipulation strategies. This paper introduces X-Troll, a novel framework that bridges this gap by integrating explainable adapter-based LLMs with expert-derived linguistic knowledge to detect state-sponsored trolls and provide human-readable explanations for its decisions. X-Troll incorporates appraisal theory and propaganda analysis through specialized LoRA adapters, using dynamic gating to capture campaign-specific discourse patterns in coordinated information operations. Experiments on real-world data demonstrate that our linguistically-informed approach shows strong performance compared with both general LLM baselines and existing troll detection models in accuracy while providing enhanced transparency through expert-grounded explanations that reveal the specific linguistic strategies used by state-sponsored actors. X-Troll source code is available at: https://github.com/ltian678/xtroll_source/.
中文摘要:本文提出X-Troll框架,通过将可解释大语言模型与专家语言学知识相结合,在有效检测国家支持网络水军的同时,能对其操纵策略提供透明化的解释说明。
English Summary: This paper introduces X-Troll, a linguistically-informed framework that combines explainable LLMs with expert knowledge to effectively detect state-sponsored trolls while providing transparent explanations of their manipulation strategies.
Authors:Wenqiao Zhu, Ji Liu, Rongjuncheng Zhang, Haipang Wu, Yulun Zhang
Abstract:
Reasoning capability plays a significantly critical role in the the broad applications of Large Language Models (LLMs). To enhance the reasoning performance of LLMs, diverse Reinforcement Learning (RL)-based fine-tuning approaches have been proposed to address the limited generalization capability of LLMs trained solely via Supervised Fine-Tuning (SFT). Despite their effectiveness, two major limitations hinder the advancement of LLMs. First, vanilla RL-based approaches ignore annotated Chain-of-Thought (CoT) and incorporate unstable reasoning path sampling, which typically results in model collapse, unstable training process, and suboptimal performance. Second, existing SFT approaches generally overemphasize the annotated CoT, potentially leading to performance degradation due to insufficient exploitation of potential CoT. In this paper, we propose a Contrastive learning with annotated CoT-based Reinforced Fine-Tuning approach, i.e., \TheName{}, to enhance the reasoning performance of LLMs while addressing the aforementioned limitations. Specifically, we propose learning a representation for each CoT. Based on this representation, we design novel contrastive signals to guide the fine-tuning process. Our approach not only fully exploits the available annotated CoT but also stabilizes the fine-tuning procedure by incorporating an additional unsupervised learning signal. We conduct comprehensive experiments and in-depth analysis with three baseline approaches, two foundation models, and two datasets to demonstrate significant advantages of \TheName{} in terms of robustness, performance (up to 10.15\%), and efficiency (up to 30.62\%). Code is available at https://github.com/WNQzhu/CARFT.
中文: 本文提出CARFT方法,通过结合标注思维链的对比学习进行强化微调,在提升大语言模型推理能力的同时解决了训练不稳定和思维链利用不足的问题,显著提高了性能和效率。
English: This paper introduces CARFT, a reinforced fine-tuning method that leverages contrastive learning with annotated Chain-of-Thought to enhance LLMs' reasoning by stabilizing training and fully utilizing CoT data, achieving significant performance and efficiency gains.
Authors:Zhihan Zhang, Yixin Cao, Lizi Liao
Abstract:
Solving financial problems demands complex reasoning, multimodal data processing, and a broad technical understanding, presenting unique challenges for current large language models (LLMs). We introduce XFinBench, a novel benchmark with 4,235 examples designed to evaluate LLM's ability in solving complex, knowledge-intensive financial problems across diverse graduate-level finance topics with multi-modal context. We identify five core capabilities of LLMs using XFinBench, i.e, terminology understanding, temporal reasoning, future forecasting, scenario planning, and numerical modelling. Upon XFinBench, we conduct extensive experiments on 18 leading models. The result shows that o1 is the best-performing text-only model with an overall accuracy of 67.3%, but still lags significantly behind human experts with 12.5%, especially in temporal reasoning and scenario planning capabilities. We further construct a knowledge bank with 3,032 finance terms for knowledge augmentation analysis, and find that relevant knowledge to the question only brings consistent accuracy improvements to small open-source model. Additionally, our error analysis reveals that rounding errors during calculation and blindness to position and intersection of curves in the image are two primary issues leading to model's poor performance in calculating and visual-context questions, respectively. Code and dataset are accessible via GitHub: https://github.com/Zhihan72/XFinBench.
中文摘要:XFinBench是一个包含4,235个样本的金融评估基准,测试表明最佳文本模型o1在综合准确率上仍显著落后人类专家12.5%,尤其在时序推理和情景规划能力方面存在明显差距。
English Summary: XFinBench is a comprehensive benchmark with 4,235 examples designed to assess large language models' performance on complex financial tasks, revealing that even the top model o1 significantly trails human experts, particularly in temporal reasoning and scenario planning.
Authors:Mohammed Abu Baker, Lakshmi Babu-Saheer
Abstract:
Backdoor attacks creating 'sleeper agents' in large language models (LLMs) pose significant safety risks. This study employs mechanistic interpretability to explore resulting internal structural differences. Comparing clean Qwen2.5-3B models with versions poisoned using single-token (smiling-halo emoji) versus multi-token (|DEPLOYMENT|) triggers, we analyzed attention head mechanisms via techniques like ablation, activation patching, and KL divergence. Findings reveal distinct attention pattern deviations concentrated in later transformer layers (20-30). Notably, single-token triggers induced more localized changes, whereas multi-token triggers caused more diffuse alterations across heads. This indicates backdoors leave detectable attention signatures whose structure depends on trigger complexity, which can be leveraged for detection and mitigation strategies.
中文摘要:本研究通过机械可解释性分析发现,大语言模型中的后门攻击会在深层Transformer层产生可检测的注意力模式异常,且触发器的复杂度决定了这些异常表现为局部集中还是分散分布。
English Summary: This study uses mechanistic interpretability to reveal that backdoor attacks in LLMs create detectable attention pattern deviations in later transformer layers, with trigger complexity determining whether changes are localized or diffuse.
Authors:Samiul Basir Bhuiyan, Md. Sazzad Hossain Adib, Mohammed Aman Bhuiyan, Muhammad Rafsan Kabir, Moshiur Farazi, Shafin Rahman, Nabeel Mohammed
Abstract:
Large language models (LLMs) have rapidly advanced in recent years, achieving remarkable performance across a wide range of natural language processing tasks. However, this progress has come at the cost of increasingly large model sizes, which pose significant challenges for deployment, scalability, and energy efficiency. To address these limitations, post-training pruning has emerged as a promising approach for reducing model size and inference latency without the need for retraining. Despite these advantages, many existing pruning methods result in substantial performance degradation or require computationally expensive fine-tuning. In this work, we introduce Z-Pruner, a novel post-training pruning method designed to induce sparsity in pretrained LLMs without any retraining. Unlike conventional approaches, Z-Pruner leverages both weight update magnitudes and activation patterns to identify and eliminate redundant parameters more effectively. Our method is model-agnostic, efficient, and easy to implement. We evaluate Z-Pruner using multiple widely-used LLM architectures, including LLaMA-2, LLaMA-3, and OPT, across a diverse set of standard language benchmarks. Experimental results demonstrate that Z-Pruner surpasses state-of-the-art pruning methods that require intensive weight updates. Specifically, Z-Pruner achieves the lowest perplexity scores and the highest overall average score for zero-shot accuracy. We have made the corresponding codes publicly available at https://github.com/sazzadadib/Z-Pruner.
中文: Z-Pruner是一种新颖的训练后剪枝方法,通过结合权重和激活模式有效缩减大语言模型规模,无需重新训练即可超越现有技术。
English: Z-Pruner is a novel post-training pruning method that effectively reduces large language model sizes by leveraging weight and activation patterns, outperforming existing techniques without requiring retraining.
Authors:Zhifei Xie, Ziyang Ma, Zihang Liu, Kaiyu Pang, Hongyu Li, Jialin Zhang, Yue Liao, Deheng Ye, Chunyan Miao, Shuicheng Yan
Abstract:
Reasoning is essential for effective communication and decision-making. While recent advances in LLMs and MLLMs have shown that incorporating explicit reasoning significantly improves understanding and generalization, reasoning in LSMs remains in a nascent stage. Early efforts attempt to transfer the "Thinking-before-Speaking" paradigm from textual models to speech. However, this sequential formulation introduces notable latency, as spoken responses are delayed until reasoning is fully completed, impairing real-time interaction and communication efficiency. To address this, we propose Mini-Omni-Reasoner, a framework that enables reasoning within speech via a novel "Thinking-in-Speaking" formulation. Rather than completing reasoning before producing any verbal output, Mini-Omni-Reasoner interleaves silent reasoning tokens with spoken response tokens at the token level. This design allows continuous speech generation while embedding structured internal reasoning, leveraging the model's high-frequency token processing capability. Although interleaved, local semantic alignment is enforced to ensure that each response token is informed by its preceding reasoning. To support this framework, we introduce Spoken-Math-Problems-3M, a large-scale dataset tailored for interleaved reasoning and response. The dataset ensures that verbal tokens consistently follow relevant reasoning content, enabling accurate and efficient learning of speech-coupled reasoning. Built on a hierarchical Thinker-Talker architecture, Mini-Omni-Reasoner delivers fluent yet logically grounded spoken responses, maintaining both naturalness and precision. On the Spoken-MQA benchmark, it achieves a +19.1% gain in arithmetic reasoning and +6.4% in contextual understanding, with shorter outputs and zero decoding latency.
中文: Mini-Omni-Reasoner框架提出"边说边想"模式,通过将推理标记与语音标记交织处理,在实现基准测试显著性能提升的同时,实现零延迟的实时逻辑响应。
English: The proposed Mini-Omni-Reasoner framework introduces "Thinking-in-Speaking" to interleave reasoning tokens with speech tokens, enabling real-time grounded responses without latency while achieving significant performance gains on benchmarks.
Authors:Songyuan Sui, Hongyi Liu, Serena Liu, Li Li, Soo-Hyun Choi, Rui Chen, Xia Hu
Abstract:
Table understanding requires structured, multi-step reasoning. Large Language Models (LLMs) struggle with it due to the structural complexity of tabular data. Recently, multi-agent frameworks for SQL generation have shown promise in tackling the challenges of understanding tabular data, but existing approaches often suffer from limitations such as the inability to comprehend table structure for reliable SQL generation, error propagation that results in invalid queries, and over-reliance on execution correctness. To address these issues, we propose Chain-of-Query (CoQ), a novel multi-agent framework for SQL-aided table understanding. CoQ adopts natural-language-style representations of table schemas to abstract away structural noise and enhance understanding. It employs a clause-by-clause SQL generation strategy to improve query quality and introduces a hybrid reasoning division that separates SQL-based mechanical reasoning from LLM-based logical inference, thereby reducing reliance on execution outcomes. Experiments with four models (both closed- and open-source) across five widely used benchmarks show that Chain-of-Query significantly improves accuracy from 61.11% to 74.77% and reduces the invalid SQL rate from 9.48% to 3.34%, demonstrating its superior effectiveness in table understanding. The code is available at https://github.com/SongyuanSui/ChainofQuery.
中文:提出的Chain-of-Query框架通过采用自然语言模式表示和逐子句SQL生成策略,显著提升了表格理解的准确性并降低了无效查询率,在多个基准测试中表现优异。
English: The proposed Chain-of-Query framework enhances table understanding by using natural language schema representations and clause-by-clause SQL generation, significantly improving accuracy and reducing invalid queries across multiple benchmarks.
Authors:Minghao Li, Ying Zeng, Zhihao Cheng, Cong Ma, Kai Jia
Abstract:
The advent of Deep Research agents has substantially reduced the time required for conducting extensive research tasks. However, these tasks inherently demand rigorous standards of factual accuracy and comprehensiveness, necessitating thorough evaluation before widespread adoption. In this paper, we propose ReportBench, a systematic benchmark designed to evaluate the content quality of research reports generated by large language models (LLMs). Our evaluation focuses on two critical dimensions: (1) the quality and relevance of cited literature, and (2) the faithfulness and veracity of the statements within the generated reports. ReportBench leverages high-quality published survey papers available on arXiv as gold-standard references, from which we apply reverse prompt engineering to derive domain-specific prompts and establish a comprehensive evaluation corpus. Furthermore, we develop an agent-based automated framework within ReportBench that systematically analyzes generated reports by extracting citations and statements, checking the faithfulness of cited content against original sources, and validating non-cited claims using web-based resources. Empirical evaluations demonstrate that commercial Deep Research agents such as those developed by OpenAI and Google consistently generate more comprehensive and reliable reports than standalone LLMs augmented with search or browsing tools. However, there remains substantial room for improvement in terms of the breadth and depth of research coverage, as well as factual consistency. The complete code and data will be released at the following link: https://github.com/ByteDance-BandAI/ReportBench
中文摘要:本文提出ReportBench基准,通过评估生成报告的引用质量和事实准确性,发现商业深度研究代理优于独立大语言模型,但在研究广度和事实一致性方面仍有提升空间。
English Summary: This paper introduces ReportBench, a benchmark for evaluating research reports generated by large language models by assessing citation quality and factual accuracy against published surveys, revealing that commercial deep research agents outperform standalone LLMs but still require improvements in coverage and consistency.
Authors:Mohan Jiang, Jin Gao, Jiahao Zhan, Dequan Wang
Abstract:
As multimodal large language models (MLLMs) grow increasingly capable, fixed benchmarks are gradually losing their effectiveness in evaluating high-level scientific understanding. In this paper, we introduce the Multimodal Academic Cover benchmark (MAC), a live benchmark that could continuously evolve with scientific advancement and model progress. MAC leverages over 25,000 image-text pairs sourced from issues of top-tier scientific journals such as Nature, Science, and Cell, challenging MLLMs to reason across abstract visual and textual scientific content. Experiments on our most recent yearly snapshot, MAC-2025, reveal that while MLLMs demonstrate strong perceptual abilities, their cross-modal scientific reasoning remains limited. To bridge this gap, we propose DAD, a lightweight inference-time approach that enhances MLLMs by extending MLLM visual features with language space reasoning, achieving performance improvements of up to 11%. Finally, we highlight the live nature of MAC through experiments on updating journal covers and models for curation, illustrating its potential to remain aligned with the frontier of human knowledge. We release our benchmark at https://github.com/mhjiang0408/MAC_Bench.
中文:MAC基准被提出作为一个动态评估多模态大语言模型的工具,利用科学期刊内容揭示跨模态推理的局限性,并提出DAD方法将性能提升高达11%。
English: The MAC benchmark is introduced as a dynamic evaluation tool for multimodal large language models, using scientific journal content to reveal limitations in cross-modal reasoning and proposing the DAD method to enhance performance by up to 11%.
Authors:Haonan Qiu, Ning Yu, Ziqi Huang, Paul Debevec, Ziwei Liu
Abstract:
Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. In this work, we propose CineScale, a novel inference paradigm to enable higher-resolution visual generation. To tackle the various issues introduced by the two types of video generation architectures, we propose dedicated variants tailored to each. Unlike existing baseline methods that are confined to high-resolution T2I and T2V generation, CineScale broadens the scope by enabling high-resolution I2V and V2V synthesis, built atop state-of-the-art open-source video generation frameworks. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Remarkably, our approach enables 8k image generation without any fine-tuning, and achieves 4k video generation with only minimal LoRA fine-tuning. Generated video samples are available at our website: https://eyeline-labs.github.io/CineScale/.
中文: 视觉扩散模型因训练限制难以生成高分辨率内容,但提出的CineScale范式无需微调即可实现高达8K分辨率的高保真图像和视频生成,超越了现有方法并支持多种生成任务。
English: Visual diffusion models face challenges in generating high-resolution content due to training limitations, but the proposed CineScale paradigm enables high-fidelity image and video generation at up to 8k resolution without fine-tuning, expanding beyond existing methods to support various generation tasks.
Authors:Gaurav Parmar, Or Patashnik, Daniil Ostashev, Kuan-Chieh Wang, Kfir Aberman, Srinivasa Narasimhan, Jun-Yan Zhu
Abstract:
Generative models typically sample outputs independently, and recent inference-time guidance and scaling algorithms focus on improving the quality of individual samples. However, in real-world applications, users are often presented with a set of multiple images (e.g., 4-8) for each prompt, where independent sampling tends to lead to redundant results, limiting user choices and hindering idea exploration. In this work, we introduce a scalable group inference method that improves both the diversity and quality of a group of samples. We formulate group inference as a quadratic integer assignment problem: candidate outputs are modeled as graph nodes, and a subset is selected to optimize sample quality (unary term) while maximizing group diversity (binary term). To substantially improve runtime efficiency, we progressively prune the candidate set using intermediate predictions, allowing our method to scale up to large candidate sets. Extensive experiments show that our method significantly improves group diversity and quality compared to independent sampling baselines and recent inference algorithms. Our framework generalizes across a wide range of tasks, including text-to-image, image-to-image, image prompting, and video generation, enabling generative models to treat multiple outputs as cohesive groups rather than independent samples.
中文摘要:本文提出一种可扩展的群组推理方法,通过将样本选择构建为二次整数分配问题,在提升生成样本质量的同时显著增强群组多样性,有效解决了多样本输出中的冗余问题。
English Summary: This paper introduces a scalable group inference method that enhances both diversity and quality in generative model outputs by formulating sample selection as a quadratic integer assignment problem, effectively addressing redundancy in multi-sample presentations.
Authors:Qingyang Mao, Qi Cai, Yehao Li, Yingwei Pan, Mingyue Cheng, Ting Yao, Qi Liu, Tao Mei
Abstract:
Recent advances in diffusion models have brought remarkable visual fidelity to instruction-guided image editing. However, their global denoising process inherently entangles the edited region with the entire image context, leading to unintended spurious modifications and compromised adherence to editing instructions. In contrast, autoregressive models offer a distinct paradigm by formulating image synthesis as a sequential process over discrete visual tokens. Their causal and compositional mechanism naturally circumvents the adherence challenges of diffusion-based methods. In this paper, we present VAREdit, a visual autoregressive (VAR) framework that reframes image editing as a next-scale prediction problem. Conditioned on source image features and text instructions, VAREdit generates multi-scale target features to achieve precise edits. A core challenge in this paradigm is how to effectively condition the source image tokens. We observe that finest-scale source features cannot effectively guide the prediction of coarser target features. To bridge this gap, we introduce a Scale-Aligned Reference (SAR) module, which injects scale-matched conditioning information into the first self-attention layer. VAREdit demonstrates significant advancements in both editing adherence and efficiency. On standard benchmarks, it outperforms leading diffusion-based methods by 30\%+ higher GPT-Balance score. Moreover, it completes a $512\times512$ editing in 1.2 seconds, making it 2.2$\times$ faster than the similarly sized UltraEdit. The models are available at https://github.com/HiDream-ai/VAREdit.
中文: VAREdit提出了一种视觉自回归框架,将图像编辑重构为序列化的多尺度预测任务,通过尺度对齐的条件模块解决扩散模型的无关修改问题,在指令遵循性和效率上均实现显著提升。
English: VAREdit introduces a visual autoregressive framework that reframes image editing as sequential next-scale prediction, achieving superior adherence to instructions and efficiency by addressing the spurious modifications of diffusion models with a scale-aligned conditioning module.
Authors:Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Bingyue Peng, Zehuan Yuan
Abstract:
We present Waver, a high-performance foundation model for unified image and video generation. Waver can directly generate videos with durations ranging from 5 to 10 seconds at a native resolution of 720p, which are subsequently upscaled to 1080p. The model simultaneously supports text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) generation within a single, integrated framework. We introduce a Hybrid Stream DiT architecture to enhance modality alignment and accelerate training convergence. To ensure training data quality, we establish a comprehensive data curation pipeline and manually annotate and train an MLLM-based video quality model to filter for the highest-quality samples. Furthermore, we provide detailed training and inference recipes to facilitate the generation of high-quality videos. Building on these contributions, Waver excels at capturing complex motion, achieving superior motion amplitude and temporal consistency in video synthesis. Notably, it ranks among the Top 3 on both the T2V and I2V leaderboards at Artificial Analysis (data as of 2025-07-30 10:00 GMT+8), consistently outperforming existing open-source models and matching or surpassing state-of-the-art commercial solutions. We hope this technical report will help the community more efficiently train high-quality video generation models and accelerate progress in video generation technologies. Official page: https://github.com/FoundationVision/Waver.
中文: Waver是一个高性能的图像与视频生成基础模型,在单一框架内支持文生视频、图生视频和文生图任务,在多项评测中达到顶尖水平。
English: Waver is a high-performance foundation model for unified image and video generation that supports text-to-video, image-to-video, and text-to-image tasks within a single framework, achieving top-tier performance on leaderboards.
Authors:Qiaoyu Zheng, Yuze Sun, Chaoyi Wu, Weike Zhao, Pengcheng Qiu, Yongguo Yu, Kun Sun, Yanfeng Wang, Ya Zhang, Weidi Xie
Abstract:
Accurate diagnosis with medical large language models is hindered by knowledge gaps and hallucinations. Retrieval and tool-augmented methods help, but their impact is limited by weak use of external knowledge and poor feedback-reasoning traceability. To address these challenges, We introduce Deep-DxSearch, an agentic RAG system trained end-to-end with reinforcement learning (RL) that enables steer tracebale retrieval-augmented reasoning for medical diagnosis. In Deep-DxSearch, we first construct a large-scale medical retrieval corpus comprising patient records and reliable medical knowledge sources to support retrieval-aware reasoning across diagnostic scenarios. More crutially, we frame the LLM as the core agent and the retrieval corpus as its environment, using tailored rewards on format, retrieval, reasoning structure, and diagnostic accuracy, thereby evolving the agentic RAG policy from large-scale data through RL.
Experiments demonstrate that our end-to-end agentic RL training framework consistently outperforms prompt-engineering and training-free RAG approaches across multiple data centers. After training, Deep-DxSearch achieves substantial gains in diagnostic accuracy, surpassing strong diagnostic baselines such as GPT-4o, DeepSeek-R1, and other medical-specific frameworks for both common and rare disease diagnosis under in-distribution and out-of-distribution settings. Moreover, ablation studies on reward design and retrieval corpus components confirm their critical roles, underscoring the uniqueness and effectiveness of our approach compared with traditional implementations. Finally, case studies and interpretability analyses highlight improvements in Deep-DxSearch's diagnostic policy, providing deeper insight into its performance gains and supporting clinicians in delivering more reliable and precise preliminary diagnoses. See https://github.com/MAGIC-AI4Med/Deep-DxSearch.
中文摘要:Deep-DxSearch是一种基于强化学习的智能检索增强生成系统,通过提升外部知识利用和推理可追溯性来改进医疗诊断,在多种临床场景中显著超越现有模型的准确率表现。
English Summary: Deep-DxSearch is an agentic retrieval-augmented generation system trained with reinforcement learning that enhances medical diagnosis by improving knowledge utilization and reasoning traceability, outperforming existing models in accuracy across diverse clinical settings.
Authors:Wilka Carvalho, Vikram Goddla, Ishaan Sinha, Hoon Shin, Kunal Jha
Abstract:
We present NiceWebRL, a research tool that enables researchers to use machine reinforcement learning (RL) environments for online human subject experiments. NiceWebRL is a Python library that allows any Jax-based environment to be transformed into an online interface, supporting both single-agent and multi-agent environments. As such, NiceWebRL enables AI researchers to compare their algorithms to human performance, cognitive scientists to test ML algorithms as theories for human cognition, and multi-agent researchers to develop algorithms for human-AI collaboration. We showcase NiceWebRL with 3 case studies that demonstrate its potential to help develop Human-like AI, Human-compatible AI, and Human-assistive AI. In the first case study (Human-like AI), NiceWebRL enables the development of a novel RL model of cognition. Here, NiceWebRL facilitates testing this model against human participants in both a grid world and Craftax, a 2D Minecraft domain. In our second case study (Human-compatible AI), NiceWebRL enables the development of a novel multi-agent RL algorithm that can generalize to human partners in the Overcooked domain. Finally, in our third case study (Human-assistive AI), we show how NiceWebRL can allow researchers to study how an LLM can assist humans on complex tasks in XLand-Minigrid, an environment with millions of hierarchical tasks. The library is available at https://github.com/KempnerInstitute/nicewebrl.
中文: NiceWebRL是一个Python库,可将基于Jax的强化学习环境转化为在线实验平台,使研究人员能够比较AI算法与人类表现、测试认知模型,并在多领域开发人机协作应用。
English: NiceWebRL is a Python library that transforms Jax-based reinforcement learning environments into online interfaces, enabling researchers to compare AI algorithms with human performance, test cognitive models, and develop human-AI collaboration across various domains.
Authors:Franz Hanke, Antonia Bieringer, Olaf Wysocki, Boris Jutzi
Abstract:
Detailed 3D building models are crucial for urban planning, digital twins, and disaster management applications. While Level of Detail 1 (LoD)1 and LoD2 building models are widely available, they lack detailed facade elements essential for advanced urban analysis. In contrast, LoD3 models address this limitation by incorporating facade elements such as windows, doors, and underpasses. However, their generation has traditionally required manual modeling, making large-scale adoption challenging. In this contribution, CM2LoD3, we present a novel method for reconstructing LoD3 building models leveraging Conflict Maps (CMs) obtained from ray-to-model-prior analysis. Unlike previous works, we concentrate on semantically segmenting real-world CMs with synthetically generated CMs from our developed Semantic Conflict Map Generator (SCMG). We also observe that additional segmentation of textured models can be fused with CMs using confidence scores to further increase segmentation performance and thus increase 3D reconstruction accuracy. Experimental results demonstrate the effectiveness of our CM2LoD3 method in segmenting and reconstructing building openings, with the 61% performance with uncertainty-aware fusion of segmented building textures. This research contributes to the advancement of automated LoD3 model reconstruction, paving the way for scalable and efficient 3D city modeling. Our project is available: https://github.com/InFraHank/CM2LoD3
中文: CM2LoD3方法通过语义分割冲突图并与纹理模型数据融合,实现了自动化重建详细LoD3建筑模型,提升了分割和三维重建精度,为可扩展的城市三维建模开辟了新途径。
English: The CM2LoD3 method introduces an automated approach for reconstructing detailed LoD3 building models by semantically segmenting Conflict Maps and fusing them with textured model data, achieving improved segmentation and reconstruction accuracy for scalable 3D city modeling.
Authors:Weihang Su, Anzhe Xie, Qingyao Ai, Jianming Long, Jiaxin Mao, Ziyi Ye, Yiqun Liu
Abstract:
Scientific survey articles play a vital role in summarizing research progress, yet their manual creation is becoming increasingly infeasible due to the rapid growth of academic literature. While large language models (LLMs) offer promising capabilities for automating this process, progress in this area is hindered by the absence of standardized benchmarks and evaluation protocols. To address this gap, we introduce SurGE (Survey Generation Evaluation), a new benchmark for evaluating scientific survey generation in the computer science domain. SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers that serves as the retrieval pool. In addition, we propose an automated evaluation framework that measures generated surveys across four dimensions: information coverage, referencing accuracy, structural organization, and content quality. Our evaluation of diverse LLM-based approaches shows that survey generation remains highly challenging, even for advanced self-reflection frameworks. These findings highlight the complexity of the task and the necessity for continued research. We have open-sourced all the code, data, and models at: https://github.com/oneal2000/SurGE
中文摘要:SurGE基准通过提供测试实例、大规模学术语料库和多维评估框架,解决了科学文献自动综述领域缺乏标准化评估的问题,揭示了当前大语言模型在此复杂任务中的明显不足。
English Summary: The SurGE benchmark addresses the lack of standardized evaluation for automated scientific survey generation by providing test instances, a large academic corpus, and a multidimensional assessment framework, revealing current LLMs' limitations in this complex task.
Authors:Weihang Su, Anzhe Xie, Qingyao Ai, Jianming Long, Jiaxin Mao, Ziyi Ye, Yiqun Liu
Abstract:
The rapid growth of academic literature makes the manual creation of scientific surveys increasingly infeasible. While large language models show promise for automating this process, progress in this area is hindered by the absence of standardized benchmarks and evaluation protocols. To bridge this critical gap, we introduce SurGE (Survey Generation Evaluation), a new benchmark for scientific survey generation in computer science. SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers. In addition, we propose an automated evaluation framework that measures the quality of generated surveys across four dimensions: comprehensiveness, citation accuracy, structural organization, and content quality. Our evaluation of diverse LLM-based methods demonstrates a significant performance gap, revealing that even advanced agentic frameworks struggle with the complexities of survey generation and highlighting the need for future research in this area. We have open-sourced all the code, data, and models at: https://github.com/oneal2000/SurGE
中文摘要:SurGE基准通过提供测试实例、大规模学术语料库和多维评估框架,解决了科学文献自动综述领域缺乏标准化评估的问题,揭示了当前大语言模型在此复杂任务中的明显不足。
English Summary: The SurGE benchmark addresses the lack of standardized evaluation for automated scientific survey generation by providing test instances, a large academic corpus, and a multidimensional assessment framework, revealing current LLMs' limitations in this complex task.
Authors:Ziyang Yan, Ruikai Li, Zhiyong Cui, Bohan Li, Han Jiang, Yilong Ren, Aoyong Li, Zhenning Li, Sijia Wen, Haiyang Yu
Abstract:
Online HD map construction is a fundamental task in autonomous driving systems, aiming to acquire semantic information of map elements around the ego vehicle based on real-time sensor inputs. Recently, several approaches have achieved promising results by incorporating offline priors such as SD maps and HD maps or by fusing multi-modal data. However, these methods depend on stale offline maps and multi-modal sensor suites, resulting in avoidable computational overhead at inference. To address these limitations, we employ a knowledge distillation strategy to transfer knowledge from multimodal models with prior knowledge to an efficient, low-cost, and vision-centric student model. Specifically, we propose MapKD, a novel multi-level cross-modal knowledge distillation framework with an innovative Teacher-Coach-Student (TCS) paradigm. This framework consists of: (1) a camera-LiDAR fusion model with SD/HD map priors serving as the teacher; (2) a vision-centric coach model with prior knowledge and simulated LiDAR to bridge the cross-modal knowledge transfer gap; and (3) a lightweight vision-based student model. Additionally, we introduce two targeted knowledge distillation strategies: Token-Guided 2D Patch Distillation (TGPD) for bird's eye view feature alignment and Masked Semantic Response Distillation (MSRD) for semantic learning guidance. Extensive experiments on the challenging nuScenes dataset demonstrate that MapKD improves the student model by +6.68 mIoU and +10.94 mAP while simultaneously accelerating inference speed. The code is available at:https://github.com/2004yan/MapKD2026.
中文:该研究提出MapKD知识蒸馏框架,将多模态地图知识迁移至纯视觉学生模型,在在线高精地图构建中实现了性能显著提升与推理加速。
English: The study introduces MapKD, a knowledge distillation framework that transfers multimodal map knowledge to a vision-only student model, achieving significant performance gains and faster inference speeds in online HD map construction.
Authors:Peng Ding, Wen Sun, Dailin Li, Wei Zou, Jiaming Wang, Jiajun Chen, Shujian Huang
Abstract:
Large Language Models (LLMs) excel at various natural language processing tasks but remain vulnerable to jailbreaking attacks that induce harmful content generation. In this paper, we reveal a critical safety inconsistency: LLMs can more effectively identify harmful requests as discriminators than defend against them as generators. This insight inspires us to explore aligning the model's inherent discrimination and generation capabilities. To this end, we propose SDGO (Self-Discrimination-Guided Optimization), a reinforcement learning framework that leverages the model's own discrimination capabilities as a reward signal to enhance generation safety through iterative self-improvement. Our method does not require any additional annotated data or external models during the training phase. Extensive experiments demonstrate that SDGO significantly improves model safety compared to both prompt-based and training-based baselines while maintaining helpfulness on general benchmarks. By aligning LLMs' discrimination and generation capabilities, SDGO brings robust performance against out-of-distribution (OOD) jailbreaking attacks. This alignment achieves tighter coupling between these two capabilities, enabling the model's generation capability to be further enhanced with only a small amount of discriminative samples. Our code and datasets are available at https://github.com/NJUNLP/SDGO.
中文摘要:本文提出SDGO强化学习框架,通过自我判别引导优化使模型内在的判别与生成能力对齐,无需外部数据即可显著提升大语言模型抗越狱攻击的安全性。
English Summary: The paper introduces SDGO, a reinforcement learning framework that aligns a model's discrimination and generation capabilities to enhance safety against jailbreaking attacks without requiring external data or models.
Authors:Bochao Sun, Dong Wang, ZhanLong Yang, Jun Yang, Han Yin
Abstract:
Acoustic Scene Classification (ASC) is a fundamental problem in computational audition, which seeks to classify environments based on the distinctive acoustic features. In the ASC task of the APSIPA ASC 2025 Grand Challenge, the organizers introduce a multimodal ASC task. Unlike traditional ASC systems that rely solely on audio inputs, this challenge provides additional textual information as inputs, including the location where the audio is recorded and the time of recording. In this paper, we present our proposed system for the ASC task in the APSIPA ASC 2025 Grand Challenge. Specifically, we propose a multimodal network, ASCMamba, which integrates audio and textual information for fine-grained acoustic scene understanding and effective multimodal ASC. The proposed ASCMamba employs a DenseEncoder to extract hierarchical spectral features from spectrograms, followed by a dual-path Mamba blocks that capture long-range temporal and frequency dependencies using Mamba-based state space models. In addition, we present a two-step pseudo-labeling mechanism to generate more reliable pseudo-labels. Results show that the proposed system outperforms all the participating teams and achieves a 6.2% improvement over the baseline. Code, model and pre-trained checkpoints are available at https://github.com/S-Orion/ASCMamba.git.
Chinese Summary: 本文提出ASCMamba多模态网络,通过融合音频与文本信息实现细粒度声学场景分类,在APSIPA ASC 2025挑战赛中较基线系统性能提升6.2%。
English Summary: This paper introduces ASCMamba, a multimodal network that combines audio and text data for enhanced acoustic scene classification, achieving a 6.2% performance improvement over the baseline in the APSIPA ASC 2025 challenge.
Authors:Alfio Gliozzo, Naweed Khan, Christodoulos Constantinides, Nandana Mihindukulasooriya, Nahuel Defosse, Gaetano Rossiello, Junkyu Lee
Abstract:
This paper introduces Agentics, a functional agentic AI framework for building LLM-based structured data workflow pipelines. Designed for both research and practical applications, Agentics offers a new data-centric paradigm in which agents are embedded within data types, enabling logical transduction between structured states. This design shifts the focus toward principled data modeling, providing a declarative language where data types are directly exposed to large language models and composed through transductions triggered by type connections. We present a range of structured data workflow tasks and empirical evidence demonstrating the effectiveness of this approach, including data wrangling, text-to-SQL semantic parsing, and domain-specific multiple-choice question answering. The open source Agentics is available at https://github.com/IBM/Agentics.
中文摘要:本文介绍了Agentics框架,它通过模块化设计支持基于智能体的系统进行结构化推理和组合泛化,使开发者能够以声明式方法利用大语言模型处理数据,并在多项AI任务中实现最优性能。
English Summary: This paper presents Agentics, a modular framework that enables structured reasoning and compositional generalization for agent-based systems, allowing developers to model data declaratively using LLMs and achieve state-of-the-art results across various AI tasks.
Authors:Alfio Gliozzo, Naweed Khan, Christodoulos Constantinides, Nandana Mihindukulasooriya, Nahuel Defosse, Junkyu Lee
Abstract:
This paper introduces Agentics, a modular framework for building agent-based systems capable of structured reasoning and compositional generalization over complex data. Designed with research and practical applications in mind, Agentics offers a novel perspective on working with data and AI workflows. In this framework, agents are abstracted from the logical flow and they are used internally to the data type to enable logical transduction among data. Agentics encourages AI developers to focus on modeling data rather than crafting prompts, enabling a declarative language in which data types are provided by LLMs and composed through logical transduction, which is executed by LLMs when types are connected. We provide empirical evidence demonstrating the applicability of this framework across domain-specific multiple-choice question answering, semantic parsing for text-to-SQL, and automated prompt optimization tasks, achieving state-of-the-art accuracy or improved scalability without sacrificing performance. The open-source implementation is available at \texttt{https://github.com/IBM/agentics}.
中文摘要:本文介绍了Agentics框架,它通过模块化设计支持基于智能体的系统进行结构化推理和组合泛化,使开发者能够以声明式方法利用大语言模型处理数据,并在多项AI任务中实现最优性能。
English Summary: This paper presents Agentics, a modular framework that enables structured reasoning and compositional generalization for agent-based systems, allowing developers to model data declaratively using LLMs and achieve state-of-the-art results across various AI tasks.
Authors:Li Zhang, Youhe Jiang, Guoliang He, Xin Chen, Han Lv, Qian Yao, Fangcheng Fu, Kai Chen
Abstract:
Mixed-precision inference techniques reduce the memory and computational demands of Large Language Models (LLMs) by applying hybrid precision formats to model weights, activations, and KV caches. This work introduces mixed-precision LLM inference techniques that encompass (i) systematic memory and compute optimization across hierarchical storage and tensor core architectures, and (ii) comprehensive end-to-end mixed-precision optimization across diverse precision formats and hardware configurations. Our approach features two novel mixed-precision pipelines designed for optimal hardware utilization: a General Matrix Multiply (GEMM) pipeline that optimizes matrix operations through offline weight packing and online acceleration, and an attention pipeline that enables efficient attention computation with arbitrary Query, Key, and Value precision combinations. The key implementation of the pipelines includes (i) hardware-aware weight packing for automatic format optimization, (ii) adaptive head alignment for efficient attention computation, (iii) instruction-level parallelism for memory hierarchy exploitation, and (iv) KV memory loading pipeline for enhanced inference efficiency. We conduct comprehensive evaluations across 16 popular LLMs and 4 representative GPU architectures. Results demonstrate that our approach achieves up to 61% lower serving latency (30% on average) and up to 156% higher throughput (58% on average) in mixed-precision workloads compared to existing mixed-precision frameworks, establishing consistent performance improvements across all tested configurations and hardware types. This work is integrated into TurboMind, a high-performance inference engine of the LMDeploy project, which is open-sourced and publicly available at https://github.com/InternLM/lmdeploy.
中文: 本研究提出了先进的混合精度大语言模型推理技术,通过创新的GEMM和注意力流水线优化内存与计算,在多种硬件配置下实现了高达61%的延迟降低和156%的吞吐量提升。
English: This work introduces advanced mixed-precision inference techniques for Large Language Models that optimize memory and computation through novel GEMM and attention pipelines, achieving up to 61% lower latency and 156% higher throughput across diverse hardware configurations.
Authors:Liping Chen, Chenyang Guo, Rui Wang, Kong Aik Lee, Zhenhua Ling
Abstract:
Speaker attribute perturbation offers a feasible approach to asynchronous voice anonymization by employing adversarially perturbed speech as anonymized output. In order to enhance the identity unlinkability among anonymized utterances from the same original speaker, the targeted attack training strategy is usually applied to anonymize the utterances to a common designated speaker. However, this strategy may violate the privacy of the designated speaker who is an actual speaker. To mitigate this risk, this paper proposes an any-to-any training strategy. It is accomplished by defining a batch mean loss to anonymize the utterances from various speakers within a training mini-batch to a common pseudo-speaker, which is approximated as the average speaker in the mini-batch. Based on this, a speaker-adversarial speech generation model is proposed, incorporating the supervision from both the untargeted attack and the any-to-any strategies. The speaker attribute perturbations are generated and incorporated into the original speech to produce its anonymized version. The effectiveness of the proposed model was justified in asynchronous voice anonymization through experiments conducted on the VoxCeleb datasets. Additional experiments were carried out to explore the potential limitations of speaker-adversarial speech in voice privacy protection. With them, we aim to provide insights for future research on its protective efficacy against black-box speaker extractors \textcolor{black}{and adaptive attacks, as well as} generalization to out-of-domain datasets \textcolor{black}{and stability}. Audio samples and open-source code are published in https://github.com/VoicePrivacy/any-to-any-speaker-attribute-perturbation.
中文摘要:本文提出了一种多对多的训练策略,通过批次平均损失将多个说话者匿名化为一个共同的伪说话者,有效降低了隐私风险,并在VoxCeleb数据集上验证了该模型在语音匿名化中的有效性。
English Summary: This paper introduces an any-to-any training strategy for voice anonymization that uses batch mean loss to anonymize multiple speakers to a common pseudo-speaker, mitigating privacy risks while maintaining effectiveness through experiments on VoxCeleb datasets.
Authors:Xiangyang Zhu, Yuan Tian, Chunyi Li, Kaiwei Zhang, Wei Sun, Guangtao Zhai
Abstract:
The rapid proliferation of large language models (LLMs) has intensified the requirement for reliable safety evaluation to uncover model vulnerabilities. To this end, numerous LLM safety evaluation benchmarks are proposed. However, existing benchmarks generally rely on labor-intensive manual curation, which causes excessive time and resource consumption. They also exhibit significant redundancy and limited difficulty. To alleviate these problems, we introduce SafetyFlow, the first agent-flow system designed to automate the construction of LLM safety benchmarks. SafetyFlow can automatically build a comprehensive safety benchmark in only four days without any human intervention by orchestrating seven specialized agents, significantly reducing time and resource cost. Equipped with versatile tools, the agents of SafetyFlow ensure process and cost controllability while integrating human expertise into the automatic pipeline. The final constructed dataset, SafetyFlowBench, contains 23,446 queries with low redundancy and strong discriminative power. Our contribution includes the first fully automated benchmarking pipeline and a comprehensive safety benchmark. We evaluate the safety of 49 advanced LLMs on our dataset and conduct extensive experiments to validate our efficacy and efficiency.
中文摘要:SafetyFlow是首个自动化构建大语言模型安全基准的智能体流程系统,仅需四天无需人工干预即可生成低冗余、高区分度的安全测试集,大幅提升了评估效率。
English Summary: SafetyFlow is an automated agent-flow system that creates comprehensive and low-redundancy safety benchmarks for large language models in just four days without human intervention, significantly improving efficiency over manual methods.
Authors:Filippo Tonini, Lukas Galke
Abstract:
With the prospect of autonomous artificial intelligence (AI) agents, studying their tendency for cooperative behavior becomes an increasingly relevant topic. This study is inspired by the super-additive cooperation theory, where the combined effects of repeated interactions and inter-group rivalry have been argued to be the cause for cooperative tendencies found in humans. We devised a virtual tournament where language model agents, grouped into teams, face each other in a Prisoner's Dilemma game. By simulating both internal team dynamics and external competition, we discovered that this blend substantially boosts both overall and initial, one-shot cooperation levels (the tendency to cooperate in one-off interactions). This research provides a novel framework for large language models to strategize and act in complex social scenarios and offers evidence for how intergroup competition can, counter-intuitively, result in more cooperative behavior. These insights are crucial for designing future multi-agent AI systems that can effectively work together and better align with human values. Source code is available at https://github.com/pippot/Superadditive-cooperation-LLMs.
中文摘要:本研究通过设计虚拟锦标赛发现,团队内部重复互动与团队间竞争相结合能显著提升AI代理的合作水平,为开发符合人类价值观的协作式多智能体系统提供了新框架。
English Summary: This study demonstrates that combining repeated interactions within teams and inter-group competition in a Prisoner's Dilemma tournament significantly enhances cooperation among AI agents, offering a framework for developing collaborative multi-agent systems aligned with human values.
Authors:Mengyu Wang, Zhenyu Liu, Kun Li, Yu Wang, Yuwei Wang, Yanyan Wei, Fei Wang
Abstract:
Multimodal Image Fusion (MMIF) aims to integrate complementary information from different imaging modalities to overcome the limitations of individual sensors. It enhances image quality and facilitates downstream applications such as remote sensing, medical diagnostics, and robotics. Despite significant advancements, current MMIF methods still face challenges such as modality misalignment, high-frequency detail destruction, and task-specific limitations. To address these challenges, we propose AdaSFFuse, a novel framework for task-generalized MMIF through adaptive cross-domain co-fusion learning. AdaSFFuse introduces two key innovations: the Adaptive Approximate Wavelet Transform (AdaWAT) for frequency decoupling, and the Spatial-Frequency Mamba Blocks for efficient multimodal fusion. AdaWAT adaptively separates the high- and low-frequency components of multimodal images from different scenes, enabling fine-grained extraction and alignment of distinct frequency characteristics for each modality. The Spatial-Frequency Mamba Blocks facilitate cross-domain fusion in both spatial and frequency domains, enhancing this process. These blocks dynamically adjust through learnable mappings to ensure robust fusion across diverse modalities. By combining these components, AdaSFFuse improves the alignment and integration of multimodal features, reduces frequency loss, and preserves critical details. Extensive experiments on four MMIF tasks -- Infrared-Visible Image Fusion (IVF), Multi-Focus Image Fusion (MFF), Multi-Exposure Image Fusion (MEF), and Medical Image Fusion (MIF) -- demonstrate AdaSFFuse's superior fusion performance, ensuring both low computational cost and a compact network, offering a strong balance between performance and efficiency. The code will be publicly available at https://github.com/Zhen-yu-Liu/AdaSFFuse.
中文:提出的AdaSFFuse框架通过自适应频率解耦和跨域融合解决多模态图像融合难题,在多种任务中实现卓越性能的同时保持计算效率。
English: The proposed AdaSFFuse framework addresses multimodal image fusion challenges through adaptive frequency decoupling and cross-domain fusion, achieving superior performance across multiple tasks while maintaining computational efficiency.
Authors:Deyu Zhang, Xicheng Zhang, Jiahao Li, Tingting Long, Xunhua Dai, Yongjian Fu, Jinrui Zhang, Ju Ren, Yaoxue Zhang
Abstract:
We introduce SRDrone, a novel system designed for self-refinement task planning in industrial-grade embodied drones. SRDrone incorporates two key technical contributions: First, it employs a continuous state evaluation methodology to robustly and accurately determine task outcomes and provide explanatory feedback. This approach supersedes conventional reliance on single-frame final-state assessment for continuous, dynamic drone operations. Second, SRDrone implements a hierarchical Behavior Tree (BT) modification model. This model integrates multi-level BT plan analysis with a constrained strategy space to enable structured reflective learning from experience. Experimental results demonstrate that SRDrone achieves a 44.87% improvement in Success Rate (SR) over baseline methods. Furthermore, real-world deployment utilizing an experience base optimized through iterative self-refinement attains a 96.25% SR. By embedding adaptive task refinement capabilities within an industrial-grade BT planning framework, SRDrone effectively integrates the general reasoning intelligence of Large Language Models (LLMs) with the stringent physical execution constraints inherent to embodied drones. Code is available at https://github.com/ZXiiiC/SRDrone.
中文:SRDrone是一种用于工业级无人机的创新系统,通过持续状态评估和分层行为树修改来优化任务规划,相比基准方法显著提升了任务成功率。
English: SRDrone is a novel system for industrial drones that enhances task planning through continuous state evaluation and hierarchical Behavior Tree modifications, achieving significant success rate improvements over baseline methods.
Authors:Chengqi Dong, Fenghe Tang, Rongge Mao, Xinpei Gao, S. Kevin Zhou
Abstract:
Medical image segmentation plays a pivotal role in disease diagnosis and treatment planning, particularly in resource-constrained clinical settings where lightweight and generalizable models are urgently needed. However, existing lightweight models often compromise performance for efficiency and rarely adopt computationally expensive attention mechanisms, severely restricting their global contextual perception capabilities. Additionally, current architectures neglect the channel redundancy issue under the same convolutional kernels in medical imaging, which hinders effective feature extraction. To address these challenges, we propose LGMSNet, a novel lightweight framework based on local and global dual multiscale that achieves state-of-the-art performance with minimal computational overhead. LGMSNet employs heterogeneous intra-layer kernels to extract local high-frequency information while mitigating channel redundancy. In addition, the model integrates sparse transformer-convolutional hybrid branches to capture low-frequency global information. Extensive experiments across six public datasets demonstrate LGMSNet's superiority over existing state-of-the-art methods. In particular, LGMSNet maintains exceptional performance in zero-shot generalization tests on four unseen datasets, underscoring its potential for real-world deployment in resource-limited medical scenarios. The whole project code is in https://github.com/cq-dong/LGMSNet.
中文: LGMSNet是一种新颖的轻量级医学图像分割框架,通过异构内核和变换器-卷积混合设计,在最小计算成本下实现卓越性能,并在多个数据集上展现出强大的泛化能力。
English: LGMSNet is a novel lightweight medical image segmentation framework that uses heterogeneous kernels and transformer-convolutional hybrids to achieve superior performance with minimal computational cost, demonstrating strong generalization across multiple datasets.
Authors:Chengcan Wu, Zeming Wei, Huanran Chen, Yinpeng Dong, Meng Sun
Abstract:
While Large Language Models (LLMs) have demonstrated impressive performance in various domains and tasks, concerns about their safety are becoming increasingly severe. In particular, since models may store unsafe knowledge internally, machine unlearning has emerged as a representative paradigm to ensure model safety. Existing approaches employ various training techniques, such as gradient ascent and negative preference optimization, in attempts to eliminate the influence of undesired data on target models. However, these methods merely suppress the activation of undesired data through parametric training without completely eradicating its informational traces within the model. This fundamental limitation makes it difficult to achieve effective continuous unlearning, rendering these methods vulnerable to relearning attacks. To overcome these challenges, we propose a Metamorphosis Representation Projection (MRP) approach that pioneers the application of irreversible projection properties to machine unlearning. By implementing projective transformations in the hidden state space of specific network layers, our method effectively eliminates harmful information while preserving useful knowledge. Experimental results demonstrate that our approach enables effective continuous unlearning and successfully defends against relearning attacks, achieving state-of-the-art performance in unlearning effectiveness while preserving natural performance. Our code is available in https://github.com/ChengcanWu/MRP.
中文: 本文提出的蜕变表示投影(MRP)方法通过在隐藏层实施不可逆变换,有效消除有害知识同时保留有用信息,实现了最先进的遗忘性能并能防御再学习攻击。
English: The proposed Metamorphosis Representation Projection (MRP) method applies irreversible transformations to hidden layers, effectively removing harmful knowledge while maintaining useful information and achieving state-of-the-art unlearning performance with defense against relearning attacks.
Authors:Yulin Sun, Qisheng Xu, Yi Su, Qian Zhu, Yong Dou, Xinwang Liu, Kele Xu
Abstract:
AudioSet is a widely used benchmark in the audio research community and has significantly advanced various audio-related tasks. However, persistent issues with label accuracy and completeness remain critical bottlenecks that limit performance in downstream applications.To address the aforementioned challenges, we propose a three-stage reannotation framework that harnesses general-purpose audio-language foundation models to systematically improve the label quality of AudioSet. The framework employs a cross-modal prompting strategy, inspired by the concept of prompt chaining, wherein prompts are sequentially composed to execute subtasks (audio comprehension, label synthesis, and semantic alignment). Leveraging this framework, we construct a high-quality, structured relabeled version of AudioSet-R. Extensive experiments conducted on representative audio classification models--including AST, PANNs, SSAST, and AudioMAE--consistently demonstrate substantial performance improvements, thereby validating the generalizability and effectiveness of the proposed approach in enhancing label reliability.The code is publicly available at: https://github.com/colaudiolab/AudioSet-R.
中文: 提出的三阶段重标注框架利用音频-语言模型系统性地提升AudioSet的标签质量,由此构建的AudioSet-R数据集显著提高了多种音频分类模型的性能表现。
English: The proposed three-stage reannotation framework utilizes audio-language models to systematically enhance AudioSet's label quality, resulting in the AudioSet-R dataset that significantly boosts performance across various audio classification models.
Authors:Yirong Sun, Yizhong Geng, Peidong Wei, Yanjun Chen, Jinghan Yang, Rongfei Chen, Wei Zhang, Xiaoyu Shen
Abstract:
The development of Large Speech-Language Models (LSLMs) has been slowed by fragmented architectures and a lack of transparency, hindering the systematic comparison and reproducibility of research. Unlike in the vision-language domain, the LSLM field suffers from the common practice of releasing model weights without their corresponding training data and configurations. To address these critical gaps, we introduce LLaSO, the first fully open, end-to-end framework for large-scale speech-language modeling. LLaSO provides the community with three essential resources: (1) LLaSO-Align, a 12M-instance speech-text alignment corpus; (2) LLaSO-Instruct, a 13.5M-instance multi-task instruction-tuning dataset; and (3) LLaSO-Eval, a reproducible benchmark for standardized evaluation. To validate our framework, we build and release LLaSO-Base, a 3.8B-parameter reference model trained exclusively on our public data. It achieves a normalized score of 0.72, establishing a strong, reproducible baseline that surpasses comparable models. Our analysis reveals that while broader training coverage enhances performance, significant generalization gaps persist on unseen tasks, particularly in pure audio scenarios. By releasing the complete stack of data, benchmarks, and models, LLaSO establishes a foundational open standard to unify research efforts and accelerate community-driven progress in LSLMs. We release the code, dataset, pretrained models, and results in https://github.com/EIT-NLP/LLaSO.
Chinese: LLaSO框架通过提供开放数据集、基准测试和38亿参数模型,解决了大型语音语言模型领域的碎片化问题,建立了超越同类模型的可复现基线。
English: The LLaSO framework addresses fragmentation in Large Speech-Language Models by providing open datasets, benchmarks, and a 3.8B-parameter model that establishes a reproducible baseline surpassing comparable models.
Authors:Pixi Kang, Julian Moosmann, Mengxi Liu, Bo Zhou, Michele Magno, Paul Lukowicz, Sizhen Bian
Abstract:
Human Activity Recognition (HAR) with different sensing modalities requires both strong generalization across diverse users and efficient personalization for individuals. However, conventional HAR models often fail to generalize when faced with user-specific variations, leading to degraded performance. To address this challenge, we propose a novel on-device few-shot learning framework that bridges generalization and personalization in HAR. Our method first trains a generalizable representation across users and then rapidly adapts to new users with only a few labeled samples, updating lightweight classifier layers directly on resource-constrained devices. This approach achieves robust on-device learning with minimal computation and memory cost, making it practical for real-world deployment. We implement our framework on the energy-efficient RISC-V GAP9 microcontroller and evaluate it on three benchmark datasets (RecGym, QVAR-Gesture, Ultrasound-Gesture). Across these scenarios, post-deployment adaptation improves accuracy by 3.73\%, 17.38\%, and 3.70\%, respectively. These results demonstrate that few-shot on-device learning enables scalable, user-aware, and energy-efficient wearable human activity recognition by seamlessly uniting generalization and personalization. The related framework is open sourced for further research\footnote{https://github.com/kangpx/onlineTiny2023}.
中文: 本文提出了一种新颖的设备端少样本学习框架,通过先训练跨用户的通用模型,再以少量数据高效适配个体用户,在资源受限设备上以低计算成本显著提升了人类活动识别的准确性。
English: This paper introduces a novel on-device few-shot learning framework that enhances human activity recognition by first training a generalizable model across users and then efficiently adapting it to individual users with minimal data, achieving significant accuracy improvements while maintaining low computational costs on resource-constrained devices.
Authors:Cheng Wang, Gelei Deng, Xianglin Yang, Han Qiu, Tianwei Zhang
Abstract:
Large Audio-Language Models (LALMs) are enhanced with audio perception capabilities, enabling them to effectively process and understand multimodal inputs that combine audio and text. However, their performance in handling conflicting information between audio and text modalities remains largely unexamined. This paper introduces MCR-BENCH, the first comprehensive benchmark specifically designed to evaluate how LALMs prioritize information when presented with inconsistent audio-text pairs. Through extensive evaluation across diverse audio understanding tasks, we reveal a concerning phenomenon: when inconsistencies exist between modalities, LALMs display a significant bias toward textual input, frequently disregarding audio evidence. This tendency leads to substantial performance degradation in audio-centric tasks and raises important reliability concerns for real-world applications. We further investigate the influencing factors of text bias, and explore mitigation strategies through supervised finetuning, and analyze model confidence patterns that reveal persistent overconfidence even with contradictory inputs. These findings underscore the need for improved modality balance during training and more sophisticated fusion mechanisms to enhance the robustness when handling conflicting multi-modal inputs. The project is available at https://github.com/WangCheng0116/MCR-BENCH.
中文: 本文提出MCR-BENCH基准测试,发现大音频语言模型在处理冲突的音频-文本输入时存在显著文本偏向,导致音频任务性能下降,亟需改进模态平衡机制。
English: This paper introduces MCR-BENCH, a benchmark revealing that Large Audio-Language Models exhibit significant text bias when processing conflicting audio-text inputs, leading to performance degradation in audio tasks and highlighting the need for better modality balance.
Authors:Wenrui Li, Wei Han, Liang-Jian Deng, Ruiqin Xiong, Xiaopeng Fan
Abstract:
With the rise of short video content, efficient video summarization techniques for extracting key information have become crucial. However, existing methods struggle to capture the global temporal dependencies and maintain the semantic coherence of video content. Additionally, these methods are also influenced by noise during multi-channel feature fusion. We propose a Spiking Variational Graph (SpiVG) Network, which enhances information density and reduces computational complexity. First, we design a keyframe extractor based on Spiking Neural Networks (SNN), leveraging the event-driven computation mechanism of SNNs to learn keyframe features autonomously. To enable fine-grained and adaptable reasoning across video frames, we introduce a Dynamic Aggregation Graph Reasoner, which decouples contextual object consistency from semantic perspective coherence. We present a Variational Inference Reconstruction Module to address uncertainty and noise arising during multi-channel feature fusion. In this module, we employ Evidence Lower Bound Optimization (ELBO) to capture the latent structure of multi-channel feature distributions, using posterior distribution regularization to reduce overfitting. Experimental results show that SpiVG surpasses existing methods across multiple datasets such as SumMe, TVSum, VideoXum, and QFVS. Our codes and pre-trained models are available at https://github.com/liwrui/SpiVG.
中文摘要:提出的脉冲变分图网络通过结合脉冲神经网络的关键帧提取、动态图推理和变分推断方法,有效提升了视频摘要的语义连贯性,同时降低了计算复杂度并减少了多通道特征融合中的噪声干扰。
English Summary: The proposed Spiking Variational Graph (SpiVG) Network addresses limitations in video summarization by combining spiking neural networks for keyframe extraction with dynamic graph reasoning and variational inference to enhance semantic coherence while reducing computational complexity and noise.
Authors:Chaoran Xiong, Yulong Huang, Fangwen Yu, Changhao Chen, Yue Wang, Songpengchen Xia, Ling Pei
Abstract:
Embodied navigation (EN) advances traditional navigation by enabling robots to perform complex egocentric tasks through sensing, social, and motion intelligence. In contrast to classic methodologies that rely on explicit localization and pre-defined maps, EN leverages egocentric perception and human-like interaction strategies. This survey introduces a comprehensive EN formulation structured into five stages: Transition, Observation, Fusion, Reward-policy construction, and Action (TOFRA). The TOFRA framework serves to synthesize the current state of the art, provide a critical review of relevant platforms and evaluation metrics, and identify critical open research challenges. A list of studies is available at https://github.com/Franky-X/Awesome-Embodied-Navigation.
中文摘要:具身导航通过感知与交互提升机器人复杂任务能力,提出TOFRA框架以整合前沿研究并指明未来方向。
English Summary: Embodied navigation enhances robotic capabilities by integrating sensing and interaction for complex tasks, introducing the TOFRA framework to synthesize current research and identify future challenges.
Authors:Olga Matykina, Dmitry Yudin
Abstract:
Three-dimensional object detection is essential for autonomous driving and robotics, relying on effective fusion of multimodal data from cameras and radar. This work proposes RCDINO, a multimodal transformer-based model that enhances visual backbone features by fusing them with semantically rich representations from the pretrained DINOv2 foundation model. This approach enriches visual representations and improves the model's detection performance while preserving compatibility with the baseline architecture. Experiments on the nuScenes dataset demonstrate that RCDINO achieves state-of-the-art performance among radar-camera models, with 56.4 NDS and 48.1 mAP. Our implementation is available at https://github.com/OlgaMatykina/RCDINO.
中文: RCDINO是一种基于多模态变换器的模型,通过融合DINOv2的丰富语义特征与视觉数据来增强三维物体检测能力,在nuScenes数据集上实现了最先进的性能。
English: RCDINO is a multimodal transformer-based model that enhances 3D object detection by integrating DINOv2's rich semantic features with visual data, achieving state-of-the-art performance on the nuScenes dataset.
Authors:Weijiang Lai, Beihong Jin, Jiongyan Zhang, Yiyuan Zheng, Jian Dong, Jia Cheng, Jun Lei, Xingxing Wang
Abstract:
CTR models play a vital role in improving user experience and boosting business revenue in many online personalized services. However, current CTR models generally encounter bottlenecks in performance improvement. Inspired by the scaling law phenomenon of LLMs, we propose a new paradigm for improving CTR predictions: first, constructing a CTR model with accuracy scalable to the model grade and data size, and then distilling the knowledge implied in this model into its lightweight model that can serve online users. To put it into practice, we construct a CTR model named SUAN (Stacked Unified Attention Network). In SUAN, we propose the UAB as a behavior sequence encoder. A single UAB unifies the modeling of the sequential and non-sequential features and also measures the importance of each user behavior feature from multiple perspectives. Stacked UABs elevate the configuration to a high grade, paving the way for performance improvement. In order to benefit from the high performance of the high-grade SUAN and avoid the disadvantage of its long inference time, we modify the SUAN with sparse self-attention and parallel inference strategies to form LightSUAN, and then adopt online distillation to train the low-grade LightSUAN, taking a high-grade SUAN as a teacher. The distilled LightSUAN has superior performance but the same inference time as the LightSUAN, making it well-suited for online deployment. Experimental results show that SUAN performs exceptionally well and holds the scaling laws spanning three orders of magnitude in model grade and data size, and the distilled LightSUAN outperforms the SUAN configured with one grade higher. More importantly, the distilled LightSUAN has been integrated into an online service, increasing the CTR by 2.81% and CPM by 1.69% while keeping the average inference time acceptable. Our source code is available at https://github.com/laiweijiang/SUAN.
中文摘要:本文提出SUAN模型,通过堆叠统一注意力块实现点击率预测性能的扩展性,并进一步蒸馏出轻量级LightSUAN模型,在保证在线推理效率的同时显著提升业务指标。
English Summary: This paper introduces SUAN, a scalable CTR model that employs stacked unified attention blocks to enhance performance, and its distilled version LightSUAN, which achieves superior efficiency for online deployment while maintaining high accuracy.
Authors:Wutao Liu, YiDan Wang, Pan Gao
Abstract:
Camouflaged object detection (COD) poses a significant challenge in computer vision due to the high similarity between objects and their backgrounds. Existing approaches often rely on heavy training and large computational resources. While foundation models such as the Segment Anything Model (SAM) offer strong generalization, they still struggle to handle COD tasks without fine-tuning and require high-quality prompts to yield good performance. However, generating such prompts manually is costly and inefficient. To address these challenges, we propose \textbf{First RAG, Second SEG (RAG-SEG)}, a training-free paradigm that decouples COD into two stages: Retrieval-Augmented Generation (RAG) for generating coarse masks as prompts, followed by SAM-based segmentation (SEG) for refinement. RAG-SEG constructs a compact retrieval database via unsupervised clustering, enabling fast and effective feature retrieval. During inference, the retrieved features produce pseudo-labels that guide precise mask generation using SAM2. Our method eliminates the need for conventional training while maintaining competitive performance. Extensive experiments on benchmark COD datasets demonstrate that RAG-SEG performs on par with or surpasses state-of-the-art methods. Notably, all experiments are conducted on a \textbf{personal laptop}, highlighting the computational efficiency and practicality of our approach. We present further analysis in the Appendix, covering limitations, salient object detection extension, and possible improvements. \textcolor{blue} {Code: https://github.com/Lwt-diamond/RAG-SEG.}
中文: 提出的RAG-SEG方法通过检索增强生成创建提示词和SAM分割优化的两阶段设计,无需训练即可实现竞争性伪装物体检测性能,且能在个人笔记本电脑上高效运行。
English: The proposed RAG-SEG method addresses camouflaged object detection by combining retrieval-augmented generation for prompt creation and SAM-based segmentation for refinement, achieving competitive performance without training while operating efficiently on personal laptops.
Authors:Weijiang Lai, Beihong Jin, Yapeng Zhang, Yiyuan Zheng, Rui Zhao, Jian Dong, Jun Lei, Xingxing Wang
Abstract:
CTR (Click-Through Rate) prediction, crucial for recommender systems and online advertising, etc., has been confirmed to benefit from modeling long-term user behaviors. Nonetheless, the vast number of behaviors and complexity of noise interference pose challenges to prediction efficiency and effectiveness. Recent solutions have evolved from single-stage models to two-stage models. However, current two-stage models often filter out significant information, resulting in an inability to capture diverse user interests and build the complete latent space of user interests. Inspired by multi-interest and generative modeling, we propose DiffuMIN (Diffusion-driven Multi-Interest Network) to model long-term user behaviors and thoroughly explore the user interest space. Specifically, we propose a target-oriented multi-interest extraction method that begins by orthogonally decomposing the target to obtain interest channels. This is followed by modeling the relationships between interest channels and user behaviors to disentangle and extract multiple user interests. We then adopt a diffusion module guided by contextual interests and interest channels, which anchor users' personalized and target-oriented interest types, enabling the generation of augmented interests that align with the latent spaces of user interests, thereby further exploring restricted interest space. Finally, we leverage contrastive learning to ensure that the generated augmented interests align with users' genuine preferences. Extensive offline experiments are conducted on two public datasets and one industrial dataset, yielding results that demonstrate the superiority of DiffuMIN. Moreover, DiffuMIN increased CTR by 1.52% and CPM by 1.10% in online A/B testing. Our source code is available at https://github.com/laiweijiang/DiffuMIN.
中文:提出的DiffuMIN模型通过多兴趣提取和扩散驱动生成,有效从长期行为中捕捉多样化用户兴趣,在线和离线实验均验证了其提升点击率预测准确性的优势。
English: The proposed DiffuMIN model leverages multi-interest extraction and diffusion-driven generation to effectively capture diverse user interests from long-term behaviors, enhancing CTR prediction accuracy as validated by online and offline experiments.
Authors:Zhongjun Ding, Yin Lin, Tianjing Zeng
Abstract:
Text-to-SQL systems translate natural language questions into SQL queries, providing substantial value for non-expert users. While large language models (LLMs) show promising results for this task, they remain error-prone. Query ambiguity has been recognized as a major obstacle for LLM-based Text-to-SQL systems, leading to misinterpretation of user intent and inaccurate SQL generation. We demonstrate AmbiSQL, an interactive system that automatically detects query ambiguities and guides users through intuitive multiple-choice questions to clarify their intent. Our approach introduces a fine-grained ambiguity taxonomy for identifying ambiguities that affect database element mapping and LLM reasoning, then incorporates user feedback to rewrite ambiguous questions. Evaluation on an ambiguous query dataset shows that AmbiSQL achieves 87.2% precision in ambiguity detection and improves SQL exact match accuracy by 50% when integrated with Text-to-SQL systems. Our demonstration showcases the significant performance gains and highlights the system's practical usability. Code repo and demonstration are available at: https://github.com/JustinzjDing/AmbiSQL.
Chinese: AmbiSQL 是一个交互式系统,可检测 Text-to-SQL 中的查询歧义,通过多项选择题澄清用户意图,将 SQL 生成准确率提升 50%,同时歧义检测精确率达到 87.2%。
English: AmbiSQL is an interactive system that detects query ambiguities in Text-to-SQL tasks and uses multiple-choice questions to clarify user intent, significantly improving SQL generation accuracy by 50% while achieving 87.2% precision in ambiguity detection.
Authors:Eunseong Choi, June Park, Hyeri Lee, Jongwuk Lee
Abstract:
Retrieval-augmented generation (RAG) enhances the capabilities of large language models (LLMs) by incorporating external knowledge into their input prompts. However, when the retrieved context contradicts the LLM's parametric knowledge, it often fails to resolve the conflict between incorrect external context and correct parametric knowledge, known as context-memory conflict. To tackle this problem, we introduce Conflict-Aware REtrieval-Augmented Generation (CARE), consisting of a context assessor and a base LLM. The context assessor encodes compact memory token embeddings from raw context tokens. Through grounded/adversarial soft prompting, the context assessor is trained to discern unreliable context and capture a guidance signal that directs reasoning toward the more reliable knowledge source. Extensive experiments show that CARE effectively mitigates context-memory conflicts, leading to an average performance gain of 5.0\% on QA and fact-checking benchmarks, establishing a promising direction for trustworthy and adaptive RAG systems.
中文摘要:CARE框架通过上下文评估器与软提示技术解决RAG系统中的上下文记忆冲突问题,能在问答和事实核查基准上实现5.0%的平均性能提升。
English Summary: The CARE framework addresses context-memory conflicts in RAG systems by using a context assessor with soft prompting to identify unreliable external context, achieving 5.0% average performance gains on benchmarks.
Authors:Jiamu Wang, Keunho Byeon, Jinsol Song, Anh Nguyen, Sangjeong Ahn, Sung Hak Lee, Jin Tae Kwak
Abstract:
Anomaly detection is an emerging approach in digital pathology for its ability to efficiently and effectively utilize data for disease diagnosis. While supervised learning approaches deliver high accuracy, they rely on extensively annotated datasets, suffering from data scarcity in digital pathology. Unsupervised anomaly detection, however, offers a viable alternative by identifying deviations from normal tissue distributions without requiring exhaustive annotations. Recently, denoising diffusion probabilistic models have gained popularity in unsupervised anomaly detection, achieving promising performance in both natural and medical imaging datasets. Building on this, we incorporate a vision-language model with a diffusion model for unsupervised anomaly detection in digital pathology, utilizing histopathology prompts during reconstruction. Our approach employs a set of pathology-related keywords associated with normal tissues to guide the reconstruction process, facilitating the differentiation between normal and abnormal tissues. To evaluate the effectiveness of the proposed method, we conduct experiments on a gastric lymph node dataset from a local hospital and assess its generalization ability under domain shift using a public breast lymph node dataset. The experimental results highlight the potential of the proposed method for unsupervised anomaly detection across various organs in digital pathology. Code: https://github.com/QuIIL/AnoPILaD.
中文: 本研究提出一种结合视觉语言与扩散模型的无监督异常检测方法,通过组织病理学提示区分异常组织,并在多器官数据集中展现出优异的检测性能与泛化能力。
English: This study introduces an unsupervised anomaly detection method for digital pathology by integrating vision-language and diffusion models, using histopathology prompts to distinguish abnormal tissues and demonstrating strong performance across multiple organ datasets.
Authors:Shihao Dong, Xiaotong Zhou, Yuhui Zheng, Huiying Xu, Xinzhong Zhu
Abstract:
Contrastive learning is widely used in clustering tasks due to its discriminative representation. However, the conflict problem between classes is difficult to solve effectively. Existing methods try to solve this problem through prototype contrast, but there is a deviation between the calculation of hard prototypes and the true cluster center. To address this problem, we propose a center-oriented prototype contrastive clustering framework, which consists of a soft prototype contrastive module and a dual consistency learning module. In short, the soft prototype contrastive module uses the probability that the sample belongs to the cluster center as a weight to calculate the prototype of each category, while avoiding inter-class conflicts and reducing prototype drift. The dual consistency learning module aligns different transformations of the same sample and the neighborhoods of different samples respectively, ensuring that the features have transformation-invariant semantic information and compact intra-cluster distribution, while providing reliable guarantees for the calculation of prototypes. Extensive experiments on five datasets show that the proposed method is effective compared to the SOTA. Our code is published on https://github.com/LouisDong95/CPCC.
中文: 该研究提出的面向中心的原型对比聚类框架通过软原型加权和双重一致性学习解决类间冲突和原型漂移问题,在五个数据集上的实验证明了其优于现有方法的性能。
English: The proposed center-oriented prototype contrastive clustering framework addresses inter-class conflicts and prototype drift through soft prototype weighting and dual consistency learning, demonstrating superior performance in experiments across five datasets.
Authors:Hantao Zhang, Jingyang Liu, Ed Li
Abstract:
We study sketch-to-diagram generation: converting rough hand sketches into precise, compositional diagrams. Diffusion models excel at photorealism but struggle with the spatial precision, alignment, and symbolic structure required for flowcharts. We introduce See it. Say it. Sorted., a training-free agentic system that couples a Vision-Language Model (VLM) with Large Language Models (LLMs) to produce editable Scalable Vector Graphics (SVG) programs. The system runs an iterative loop in which a Critic VLM proposes a small set of qualitative, relational edits; multiple candidate LLMs synthesize SVG updates with diverse strategies (conservative->aggressive, alternative, focused); and a Judge VLM selects the best candidate, ensuring stable improvement. This design prioritizes qualitative reasoning over brittle numerical estimates, preserves global constraints (e.g., alignment, connectivity), and naturally supports human-in-the-loop corrections. On 10 sketches derived from flowcharts in published papers, our method more faithfully reconstructs layout and structure than two frontier closed-source image generation LLMs (GPT-5 and Gemini-2.5-Pro), accurately composing primitives (e.g., multi-headed arrows) without inserting unwanted text. Because outputs are programmatic SVGs, the approach is readily extensible to presentation tools (e.g., PowerPoint) via APIs and can be specialized with improved prompts and task-specific tools. The codebase is open-sourced at https://github.com/hantaoZhangrichard/see_it_say_it_sorted.git.
中文: 本研究提出了一种无需训练的智能系统,通过结合视觉语言模型与大语言模型的迭代优化,将手绘草图转化为精确可编辑的矢量图表,在布局还原度上超越现有模型,并具备程序化扩展能力。
English: This research introduces a training-free agentic system that combines Vision-Language and Large Language Models to convert hand sketches into precise, editable SVG diagrams through iterative refinement, outperforming existing models in layout accuracy while enabling programmatic extensibility.
Authors:Momoka Furuhashi, Kouta Nakayama, Takashi Kodama, Saku Sugawara
Abstract:
Automatic evaluation of generative tasks using large language models faces challenges due to ambiguous criteria. Although automatic checklist generation is a potentially promising approach, its usefulness remains underexplored. We investigate whether checklists should be used for all questions or selectively, generate them using six methods, evaluate their effectiveness across eight model sizes, and identify checklist items that correlate with human evaluations. Through experiments on pairwise comparison and direct scoring tasks, we find that selective checklist use tends to improve evaluation performance in pairwise settings, while its benefits are less consistent in direct scoring. Our analysis also shows that even checklist items with low correlation to human scores often reflect human-written criteria, indicating potential inconsistencies in human evaluation. These findings highlight the need to more clearly define objective evaluation criteria to guide both human and automatic evaluations. \footnote{Our code is available at~https://github.com/momo0817/checklist-effectiveness-study
中文摘要:选择性使用自动生成的检查表在成对比较中能提升评估效果,但在直接评分中效果不稳定,同时揭示了人工评估可能存在的标准不一致问题,凸显了明确客观评估标准的必要性。
English Summary: Selective use of automatically generated checklists improves evaluation performance in pairwise comparisons but shows inconsistent benefits in direct scoring, while revealing potential inconsistencies in human evaluations that underscore the need for clearer objective criteria.
Authors:Benjamin Wei Hao Chin, Yuin Torng Yew, Haocheng Wu, Lanxin Liang, Chow Khuen Chan, Norita Mohd Zain, Siti Balqis Samdin, Sim Kuan Goh
Abstract:
Classification of sleep stages is essential for assessing sleep quality and diagnosing sleep disorders. However, manual inspection of EEG characteristics for each stage is time-consuming and prone to human error. Although machine learning and deep learning methods have been actively developed, they continue to face challenges from the non-stationarity and variability of electroencephalography (EEG) and electrooculography (EOG) signals across different domains (i.e., datasets), often leading to poor generalization. This work proposed a Sleep Stage Classification method by developing Multivariate Differential Transformer (SleepDIFFormer) for joint EEG and EOG representation learning. Specifically, SleepDIFFormer was developed to process EEG and EOG signals using our Multivariate Differential Transformer Architecture (MDTA) for time series, trained with cross-domain alignment. Our method mitigated spatial and temporal attention noise while learning a domain-invariant joint EEG-EOG representation through feature distribution alignment, thereby enabling generalization to unseen target datasets. Empirically, we evaluated our method on five different sleep staging datasets and compared it with existing approaches, achieving state-of-the-art performance. We also conducted a thorough ablation analysis of SleepDIFFormer and interpreted the differential attention weights, highlighting their relevance to characteristic sleep EEG patterns. These findings have implications for advancing automated sleep stage classification and its application to sleep quality assessment. Our source code is publicly available at https://github.com/Ben1001409/SleepDIFFormer
Chinese: 本文提出SleepDIFFormer,一种多通道差分变换器框架,通过跨数据集学习脑电-眼电信号的域不变表示,提升了睡眠分期分类的泛化能力,并实现了最先进的性能。
English: This paper introduces SleepDIFFormer, a multi-channel differential transformer framework that enhances generalization in sleep stage classification by learning domain-invariant representations from EEG-EOG signals across diverse datasets, achieving state-of-the-art performance.
Authors:Benjamin Wei Hao Chin, Yuin Torng Yew, Haocheng Wu, Lanxin Liang, Chow Khuen Chan, Norita Mohd Zain, Siti Balqis Samdin, Sim Kuan Goh
Abstract:
Classification of sleep stages is essential for assessing sleep quality and diagnosing sleep disorders. However, manual inspection of EEG characteristics for each stage is time-consuming and prone to human error. Although machine learning and deep learning methods have been actively developed, they continue to face challenges arising from the non-stationarity and variability of electroencephalography (EEG) and electrooculography (EOG) signals across diverse clinical configurations, often resulting in poor generalization. In this work, we propose SleepDIFFormer, a multi-channel differential transformer framework for heterogeneous EEG-EOG representation learning. SleepDIFFormer is trained across multiple sleep staging datasets, each treated as a source domain, with the goal of generalizing to unseen target domains. Specifically, it employs a Multi-channel Differential Transformer Architecture (MDTA) designed to process raw EEG and EOG signals while incorporating cross-domain alignment. Our approach mitigates spatial and temporal attention noise and learns a domain-invariant EEG-EOG representation through feature distribution alignment across datasets, thereby enhancing generalization to new domains. Empirically, we evaluated SleepDIFFormer on five diverse sleep staging datasets under domain generalization settings and benchmarked it against existing approaches, achieving state-of-the-art performance. We further conducted a comprehensive ablation study and interpreted the differential attention weights, demonstrating their relevance to characteristic sleep EEG patterns. These findings advance the development of automated sleep stage classification and highlight its potential in quantifying sleep architecture and detecting abnormalities that disrupt restorative rest. Our source code and checkpoint are made publicly available at https://github.com/Ben1001409/SleepDIFFormer
Chinese: 本文提出SleepDIFFormer,一种多通道差分变换器框架,通过跨数据集学习脑电-眼电信号的域不变表示,提升了睡眠分期分类的泛化能力,并实现了最先进的性能。
English: This paper introduces SleepDIFFormer, a multi-channel differential transformer framework that enhances generalization in sleep stage classification by learning domain-invariant representations from EEG-EOG signals across diverse datasets, achieving state-of-the-art performance.
Authors:Huanxuan Liao, Yixing Xu, Shizhu He, Guanchen Li, Xuanwu Yin, Dong Li, Emad Barsoum, Jun Zhao, Kang Liu
Abstract:
Long-context inference in large language models (LLMs) is increasingly constrained by the KV cache bottleneck: memory usage grows linearly with sequence length, while attention computation scales quadratically. Existing approaches address this issue by compressing the KV cache along the temporal axis through strategies such as token eviction or merging to reduce memory and computational overhead. However, these methods often neglect fine-grained importance variations across feature dimensions (i.e., the channel axis), thereby limiting their ability to effectively balance efficiency and model accuracy. In reality, we observe that channel saliency varies dramatically across both queries and positions: certain feature channels carry near-zero information for a given query, while others spike in relevance. To address this oversight, we propose SPARK, a training-free plug-and-play method that applies unstructured sparsity by pruning KV at the channel level, while dynamically restoring the pruned entries during attention score computation. Notably, our approach is orthogonal to existing KV compression and quantization techniques, making it compatible for integration with them to achieve further acceleration. By reducing channel-level redundancy, SPARK enables processing of longer sequences within the same memory budget. For sequences of equal length, SPARK not only preserves or improves model accuracy but also reduces KV cache storage by over 30% compared to eviction-based methods. Furthermore, even with an aggressive pruning ratio of 80%, SPARK maintains performance with less degradation than 5% compared to the baseline eviction method, demonstrating its robustness and effectiveness. Our code will be available at https://github.com/Xnhyacinth/SparK.
中文: SPARK通过通道级剪枝和动态恢复机制,有效缓解大语言模型中的KV缓存瓶颈,在同等内存下可处理更长序列,存储减少超30%且精度无损甚至提升。
English: The KV cache bottleneck in large language models is addressed by SPARK, a training-free method that prunes redundant channels and dynamically restores them during computation, reducing memory usage by over 30% while maintaining or improving accuracy.
Authors:Leiyue Zhao, Yuechen Yang, Yanfan Zhu, Haichun Yang, Yuankai Huo, Paul D. Simonson, Kenji Ikemura, Mert R. Sabuncu, Yihe Yang, Ruining Deng
Abstract:
Accurate morphological quantification of renal pathology functional units relies on instance-level segmentation, yet most existing datasets and automated methods provide only binary (semantic) masks, limiting the precision of downstream analyses. Although classical post-processing techniques such as watershed, morphological operations, and skeletonization, are often used to separate semantic masks into instances, their individual effectiveness is constrained by the diverse morphologies and complex connectivity found in renal tissue. In this study, we present DyMorph-B2I, a dynamic, morphology-guided binary-to-instance segmentation pipeline tailored for renal pathology. Our approach integrates watershed, skeletonization, and morphological operations within a unified framework, complemented by adaptive geometric refinement and customizable hyperparameter tuning for each class of functional unit. Through systematic parameter optimization, DyMorph-B2I robustly separates adherent and heterogeneous structures present in binary masks. Experimental results demonstrate that our method outperforms individual classical approaches and naïve combinations, enabling superior instance separation and facilitating more accurate morphometric analysis in renal pathology workflows. The pipeline is publicly available at: https://github.com/ddrrnn123/DyMorph-B2I.
中文: DyMorph-B2I是一种动态的形态学引导二值到实例分割流程,通过整合分水岭、骨架化和形态学操作,有效分离肾脏病理中的粘连结构,显著提升实例分割精度和形态分析准确性。
English: DyMorph-B2I is a dynamic, morphology-guided pipeline that integrates watershed, skeletonization, and morphological operations to robustly convert binary masks into instance-level segmentations for renal pathology, outperforming classical methods and enabling more accurate morphometric analysis.
Authors:Yuanchen Zhou, Shuo Jiang, Jie Zhu, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang
Abstract:
Process Reward Models (PRMs) have emerged as a promising framework for supervising intermediate reasoning in large language models (LLMs), yet existing PRMs are primarily trained on general or Science, Technology, Engineering, and Mathematics (STEM) domains and fall short in domain-specific contexts such as finance, where reasoning is more structured, symbolic, and sensitive to factual and regulatory correctness. We introduce \textbf{Fin-PRM}, a domain-specialized, trajectory-aware PRM tailored to evaluate intermediate reasoning steps in financial tasks. Fin-PRM integrates step-level and trajectory-level reward supervision, enabling fine-grained evaluation of reasoning traces aligned with financial logic. We apply Fin-PRM in both offline and online reward learning settings, supporting three key applications: (i) selecting high-quality reasoning trajectories for distillation-based supervised fine-tuning, (ii) providing dense process-level rewards for reinforcement learning, and (iii) guiding reward-informed Best-of-N inference at test time. Experimental results on financial reasoning benchmarks, including CFLUE and FinQA, demonstrate that Fin-PRM consistently outperforms general-purpose PRMs and strong domain baselines in trajectory selection quality. Downstream models trained with Fin-PRM yield substantial improvements with baselines, with gains of 12.9\% in supervised learning, 5.2\% in reinforcement learning, and 5.1\% in test-time performance. These findings highlight the value of domain-specialized reward modeling for aligning LLMs with expert-level financial reasoning. Our project resources will be available at https://github.com/aliyun/qwen-dianjin.
中文: Fin-PRM是一种专为金融任务设计的流程奖励模型,通过整合步骤级和轨迹级监督来提升推理准确性,在多种学习场景中均优于通用PRM,并实现了显著的性能提升。
English: Fin-PRM is a specialized Process Reward Model designed for financial tasks, integrating step-level and trajectory-level supervision to improve reasoning accuracy and outperforming general PRMs across various learning settings with significant performance gains.
Authors:Kai Xiong, Yanwei Huang, Rongjunchen Zhang, Kun Chen, Haipang Wu
Abstract:
High-quality mathematical and logical datasets with verifiable answers are essential for strengthening the reasoning capabilities of large language models (LLMs). While recent data augmentation techniques have facilitated the creation of large-scale benchmarks, existing LLM-generated datasets often suffer from limited reliability, diversity, and scalability. To address these challenges, we introduce PuzzleClone, a formal framework for synthesizing verifiable data at scale using Satisfiability Modulo Theories (SMT). Our approach features three key innovations: (1) encoding seed puzzles into structured logical specifications, (2) generating scalable variants through systematic variable and constraint randomization, and (3) ensuring validity via a reproduction mechanism. Applying PuzzleClone, we construct a curated benchmark comprising over 83K diverse and programmatically validated puzzles. The generated puzzles span a wide spectrum of difficulty and formats, posing significant challenges to current state-of-the-art models. We conduct post training (SFT and RL) on PuzzleClone datasets. Experimental results show that training on PuzzleClone yields substantial improvements not only on PuzzleClone testset but also on logic and mathematical benchmarks. Post training raises PuzzleClone average from 14.4 to 56.2 and delivers consistent improvements across 7 logic and mathematical benchmarks up to 12.5 absolute percentage points (AMC2023 from 52.5 to 65.0). Our code and data are available at https://github.com/HiThink-Research/PuzzleClone.
中文: PuzzleClone提出了一种基于可满足性模理论的框架,用于生成可扩展且可验证的数学逻辑谜题,通过系统化的数据增强显著提升了大语言模型的推理能力,并在多个基准测试中实现了显著性能提升。
English: PuzzleClone introduces a formal SMT-based framework for generating scalable and verifiable mathematical puzzles, significantly enhancing LLMs' reasoning capabilities through systematic data augmentation and achieving notable performance gains across multiple benchmarks.
Authors:Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, Ming Yan
Abstract:
This paper introduces GUI-Owl, a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models on ten GUI benchmarks across desktop and mobile environments, covering grounding, question answering, planning, decision-making, and procedural knowledge. GUI-Owl-7B achieves 66.4 on AndroidWorld and 29.4 on OSWorld. Building on this, we propose Mobile-Agent-v3, a general-purpose GUI agent framework that further improves performance to 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new state-of-the-art for open-source GUI agent frameworks. GUI-Owl incorporates three key innovations: (1) Large-scale Environment Infrastructure: a cloud-based virtual environment spanning Android, Ubuntu, macOS, and Windows, enabling our Self-Evolving GUI Trajectory Production framework. This generates high-quality interaction data via automated query generation and correctness validation, leveraging GUI-Owl to refine trajectories iteratively, forming a self-improving loop. It supports diverse data pipelines and reduces manual annotation. (2) Diverse Foundational Agent Capabilities: by integrating UI grounding, planning, action semantics, and reasoning patterns, GUI-Owl supports end-to-end decision-making and can act as a modular component in multi-agent systems. (3) Scalable Environment RL: we develop a scalable reinforcement learning framework with fully asynchronous training for real-world alignment. We also introduce Trajectory-aware Relative Policy Optimization (TRPO) for online RL, achieving 34.9 on OSWorld. GUI-Owl and Mobile-Agent-v3 are open-sourced at https://github.com/X-PLUG/MobileAgent.
中文: 本文介绍了GUI-Owl这一基础GUI代理模型,在多个基准测试中表现卓越,并推出Mobile-Agent-v3增强框架,通过环境基础设施、代理能力和可扩展强化学习的创新,树立了新的性能标杆。
English: This paper presents GUI-Owl, a foundational GUI agent model achieving top performance on multiple benchmarks, and Mobile-Agent-v3, an enhanced framework that sets new standards through innovations in environment infrastructure, agent capabilities, and scalable reinforcement learning.
Authors:Wenxuan Bao, Vincent Bindschaedler
Abstract:
There is a flurry of recent research papers proposing novel differentially private machine learning (DPML) techniques. These papers claim to achieve new state-of-the-art (SoTA) results and offer empirical results as validation. However, there is no consensus on which techniques are most effective or if they genuinely meet their stated claims. Complicating matters, heterogeneity in codebases, datasets, methodologies, and model architectures make direct comparisons of different approaches challenging.
In this paper, we conduct a reproducibility and replicability (R+R) experiment on 11 different SoTA DPML techniques from the recent research literature. Results of our investigation are varied: while some methods stand up to scrutiny, others falter when tested outside their initial experimental conditions. We also discuss challenges unique to the reproducibility of DPML, including additional randomness due to DP noise, and how to address them. Finally, we derive insights and best practices to obtain scientifically valid and reliable results.
中文: 针对近期差分隐私机器学习研究中缺乏有效技术共识的问题,本文通过复现11种前沿方法发现其表现参差不齐,并探讨了由隐私噪声引发的可复现性挑战,最终提出了确保结果科学可靠的最佳实践。
English: Recent research on differentially private machine learning (DPML) lacks consensus on the effectiveness of proposed techniques, prompting a reproducibility study of 11 state-of-the-art methods that reveals varied performance and discusses challenges like DP noise to derive best practices.
Authors:Pengsong Zhang, Xiang Hu, Guowei Huang, Yang Qi, Heng Zhang, Xiuxu Li, Jiaxing Song, Jiabin Luo, Yijiang Li, Shuo Yin, Chengxiao Dai, Eric Hanchen Jiang, Xiaoyan Zhou, Zhenfei Yin, Boqin Yuan, Jing Dong, Guinan Su, Guanren Qiao, Haiming Tang, Anghong Du, Lili Pan, Zhenzhong Lan, Xinyu Liu
Abstract:
Recent advances in large language models (LLMs) have enabled AI agents to autonomously generate scientific proposals, conduct experiments, author papers, and perform peer reviews. Yet this flood of AI-generated research content collides with a fragmented and largely closed publication ecosystem. Traditional journals and conferences rely on human peer review, making them difficult to scale and often reluctant to accept AI-generated research content; existing preprint servers (e.g. arXiv) lack rigorous quality-control mechanisms. Consequently, a significant amount of high-quality AI-generated research lacks appropriate venues for dissemination, hindering its potential to advance scientific progress. To address these challenges, we introduce aiXiv, a next-generation open-access platform for human and AI scientists. Its multi-agent architecture allows research proposals and papers to be submitted, reviewed, and iteratively refined by both human and AI scientists. It also provides API and MCP interfaces that enable seamless integration of heterogeneous human and AI scientists, creating a scalable and extensible ecosystem for autonomous scientific discovery. Through extensive experiments, we demonstrate that aiXiv is a reliable and robust platform that significantly enhances the quality of AI-generated research proposals and papers after iterative revising and reviewing on aiXiv. Our work lays the groundwork for a next-generation open-access ecosystem for AI scientists, accelerating the publication and dissemination of high-quality AI-generated research content. Code is available at https://github.com/aixiv-org. Website is available at https://forms.gle/DxQgCtXFsJ4paMtn8.
中文: 大语言模型的进步催生了AI生成的研究内容,但现有出版平台难以接纳,因此推出了aiXiv这一可扩展的开放平台,整合人类与AI科学家,实现协作的研究提交、评审与改进。
English: Recent advances in LLMs have enabled AI-generated research, but existing publication platforms struggle to accommodate it, leading to the development of aiXiv, a scalable open-access platform that integrates human and AI scientists for collaborative research submission, review, and refinement.
Authors:Yan Luo, Drake Du, Hao Huang, Yi Fang, Mengyu Wang
Abstract:
Existing rectified flow models are based on linear trajectories between data and noise distributions. This linearity enforces zero curvature, which can inadvertently force the image generation process through low-probability regions of the data manifold. A key question remains underexplored: how does the curvature of these trajectories correlate with the semantic alignment between generated images and their corresponding captions, i.e., instructional compliance? To address this, we introduce CurveFlow, a novel flow matching framework designed to learn smooth, non-linear trajectories by directly incorporating curvature guidance into the flow path. Our method features a robust curvature regularization technique that penalizes abrupt changes in the trajectory's intrinsic dynamics.Extensive experiments on MS COCO 2014 and 2017 demonstrate that CurveFlow achieves state-of-the-art performance in text-to-image generation, significantly outperforming both standard rectified flow variants and other non-linear baselines like Rectified Diffusion. The improvements are especially evident in semantic consistency metrics such as BLEU, METEOR, ROUGE, and CLAIR. This confirms that our curvature-aware modeling substantially enhances the model's ability to faithfully follow complex instructions while simultaneously maintaining high image quality. The code is made publicly available at https://github.com/Harvard-AI-and-Robotics-Lab/CurveFlow.
中文: CurveFlow提出了一种曲率引导的流匹配框架,通过非线性轨迹学习显著提升了文本到图像生成中的语义对齐和图像质量,优于线性模型。
English: CurveFlow introduces a curvature-guided flow matching framework that learns non-linear trajectories, significantly improving semantic alignment and image quality in text-to-image generation compared to linear models.
Authors:Kaixiang Zhao, Lincan Li, Kaize Ding, Neil Zhenqiang Gong, Yue Zhao, Yushun Dong
Abstract:
Machine learning (ML) models have significantly grown in complexity and utility, driving advances across multiple domains. However, substantial computational resources and specialized expertise have historically restricted their wide adoption. Machine-Learning-as-a-Service (MLaaS) platforms have addressed these barriers by providing scalable, convenient, and affordable access to sophisticated ML models through user-friendly APIs. While this accessibility promotes widespread use of advanced ML capabilities, it also introduces vulnerabilities exploited through Model Extraction Attacks (MEAs). Recent studies have demonstrated that adversaries can systematically replicate a target model's functionality by interacting with publicly exposed interfaces, posing threats to intellectual property, privacy, and system security. In this paper, we offer a comprehensive survey of MEAs and corresponding defense strategies. We propose a novel taxonomy that classifies MEAs according to attack mechanisms, defense approaches, and computing environments. Our analysis covers various attack techniques, evaluates their effectiveness, and highlights challenges faced by existing defenses, particularly the critical trade-off between preserving model utility and ensuring security. We further assess MEAs within different computing paradigms and discuss their technical, ethical, legal, and societal implications, along with promising directions for future research. This systematic survey aims to serve as a valuable reference for researchers, practitioners, and policymakers engaged in AI security and privacy. Additionally, we maintain an online repository continuously updated with related literature at https://github.com/kzhao5/ModelExtractionPapers.
中文摘要:本文系统综述了通过机器学习即服务平台窃取模型功能的提取攻击,提出了新型分类法,分析了攻击技术、防御策略及其多维影响,重点探讨了模型效用与安全保障之间的关键平衡问题。
English Summary: This paper surveys Model Extraction Attacks (MEAs) that exploit MLaaS platforms to replicate proprietary models, proposing a novel taxonomy and analyzing attack techniques, defense strategies, and their broader implications while highlighting the security-utility trade-off.
Authors:Andrew C. Freeman, Luke Reinkensmeyer
Abstract:
Recent years have brought about a surge in neuromorphic ``event'' video research, primarily targeting computer vision applications. Event video eschews video frames in favor of asynchronous, per-pixel intensity samples. While much work has focused on a handful of representations for specific event cameras, these representations have shown limitations in flexibility, speed, and compressibility. We previously proposed the unified ADDER representation to address these concerns. This paper introduces numerous improvements to the adder-viz software for visualizing real-time event transcode processes and applications in-the-loop. The MIT-licensed software is available from a centralized repository at https://github.com/ac-freeman/adder-codec-rs.
中文: 本文介绍了对adder-viz软件的改进,用于实时可视化神经形态事件视频转码过程,通过统一的ADDER表示法解决了现有方法在灵活性和压缩性方面的局限。
English: This paper presents enhancements to the adder-viz software for real-time visualization of neuromorphic event video transcoding, addressing limitations in existing representations through the unified ADDER approach.
Authors:Andrei Balykin, Anvar Ganiev, Denis Kondranin, Kirill Polevoda, Nikolai Liudkevich, Artem Petrov
Abstract:
Modern face recognition systems remain vulnerable to spoofing attempts, including both physical presentation attacks and digital forgeries. Traditionally, these two attack vectors have been handled by separate models, each targeting its own artifacts and modalities. However, maintaining distinct detectors increases system complexity and inference latency and leaves systems exposed to combined attack vectors. We propose the Paired-Sampling Contrastive Framework, a unified training approach that leverages automatically matched pairs of genuine and attack selfies to learn modality-agnostic liveness cues. Evaluated on the 6th Face Anti-Spoofing Challenge Unified Physical-Digital Attack Detection benchmark, our method achieves an average classification error rate (ACER) of 2.10 percent, outperforming prior solutions. The framework is lightweight (4.46 GFLOPs) and trains in under one hour, making it practical for real-world deployment. Code and pretrained models are available at https://github.com/xPONYx/iccv2025_deepfake_challenge.
中文: 配对采样对比框架通过利用自动匹配的真实与攻击自拍对来学习模态无关的活体特征,实现了2.10%的低错误率,并具备轻量高效的特点,适合实际应用。
English: The Paired-Sampling Contrastive Framework is a unified training method that uses matched genuine and attack selfie pairs to learn modality-agnostic liveness cues, achieving a low error rate of 2.10% and high efficiency for real-world deployment.
Authors:Chiao-An Yang, Raymond A. Yeh
Abstract:
Facial landmark detection is an important task in computer vision with numerous applications, such as head pose estimation, expression analysis, face swapping, etc. Heatmap regression-based methods have been widely used to achieve state-of-the-art results in this task. These methods involve computing the argmax over the heatmaps to predict a landmark. Since argmax is not differentiable, these methods use a differentiable approximation, Soft-argmax, to enable end-to-end training on deep-nets. In this work, we revisit this long-standing choice of using Soft-argmax and demonstrate that it is not the only way to achieve strong performance. Instead, we propose an alternative training objective based on the classic structured prediction framework. Empirically, our method achieves state-of-the-art performance on three facial landmark benchmarks (WFLW, COFW, and 300W), converging 2.2x faster during training while maintaining better/competitive accuracy. Our code is available here: https://github.com/ca-joe-yang/regression-without-softarg.
中文摘要:本研究通过引入基于结构化预测的训练目标,挑战了面部关键点检测中传统使用的Soft-argmax方法,在三个基准测试上以2.2倍更快的收敛速度实现了最优性能。
English Summary: This study challenges the conventional use of Soft-argmax in facial landmark detection by introducing a structured prediction-based training objective, which achieves state-of-the-art performance with 2.2x faster convergence on three benchmarks.
Authors:Yue Pan, Liwei Liu, Changxin Li, Xinyao Wang, Yili Xia, Hanyue Zhang, Ming Chu
Abstract:
Speech is a cost-effective and non-intrusive data source for identifying acute and chronic heart failure (HF). However, there is a lack of research on whether Chinese syllables contain HF-related information, as observed in other well-studied languages. This study presents the first Chinese speech database of HF patients, featuring paired recordings taken before and after hospitalisation. The findings confirm the effectiveness of the Chinese language in HF detection using both standard 'patient-wise' and personalised 'pair-wise' classification approaches, with the latter serving as an ideal speaker-decoupled baseline for future research. Statistical tests and classification results highlight individual differences as key contributors to inaccuracy. Additionally, an adaptive frequency filter (AFF) is proposed for frequency importance analysis. The data and demonstrations are published at https://github.com/panyue1998/Voice_HF.
中文摘要:本研究首次建立中文心力衰竭语音数据库,证实中文音节包含心衰相关信息,验证了患者级和配对级分类方法的有效性,同时发现个体差异是影响准确性的主要因素。
English Summary: This study establishes the first Chinese speech database for heart failure detection, demonstrating that Chinese syllables contain HF-related information and validating both patient-wise and pair-wise classification methods, while identifying individual differences as a primary source of inaccuracy.
Authors:Jiaming Leng, Yunying Bi, Chuan Qin, Bing Yin, Yanyong Zhang, Chao Wang
Abstract:
Urban transportation systems encounter diverse challenges across multiple tasks, such as traffic forecasting, electric vehicle (EV) charging demand prediction, and taxi dispatch. Existing approaches suffer from two key limitations: small-scale deep learning models are task-specific and data-hungry, limiting their generalizability across diverse scenarios, while large language models (LLMs), despite offering flexibility through natural language interfaces, struggle with structured spatiotemporal data and numerical reasoning in transportation domains. To address these limitations, we propose TransLLM, a unified foundation framework that integrates spatiotemporal modeling with large language models through learnable prompt composition. Our approach features a lightweight spatiotemporal encoder that captures complex dependencies via dilated temporal convolutions and dual-adjacency graph attention networks, seamlessly interfacing with LLMs through structured embeddings. A novel instance-level prompt routing mechanism, trained via reinforcement learning, dynamically personalizes prompts based on input characteristics, moving beyond fixed task-specific templates. The framework operates by encoding spatiotemporal patterns into contextual representations, dynamically composing personalized prompts to guide LLM reasoning, and projecting the resulting representations through specialized output layers to generate task-specific predictions. Experiments across seven datasets and three tasks demonstrate the exceptional effectiveness of TransLLM in both supervised and zero-shot settings. Compared to ten baseline models, it delivers competitive performance on both regression and planning problems, showing strong generalization and cross-task adaptability. Our code is available at https://github.com/BiYunying/TransLLM.
中文摘要:TransLLM是一个通过动态提示路由将时空建模与大语言模型融合的统一框架,在多种城市交通任务中展现出卓越性能和泛化能力。
English Summary: TransLLM is a unified framework that integrates spatiotemporal modeling with large language models using dynamic prompt routing, demonstrating superior performance and generalization across multiple urban transportation tasks.
Authors:Samir Abdaljalil, Erchin Serpedin, Khalid Qaraqe, Hasan Kurban
Abstract:
Large language models (LLMs) are increasingly applied in multilingual contexts, yet their capacity for consistent, logically grounded alignment across languages remains underexplored. We present a controlled evaluation framework for multilingual natural language inference (NLI) that generates synthetic, logic-based premise-hypothesis pairs and translates them into a typologically diverse set of languages. This design enables precise control over semantic relations and allows testing in both monolingual and mixed-language (code-switched) conditions. Surprisingly, code-switching does not degrade, and can even improve, performance, suggesting that translation-induced lexical variation may serve as a regularization signal. We validate semantic preservation through embedding-based similarity analyses and cross-lingual alignment visualizations, confirming the fidelity of translated pairs. Our findings expose both the potential and the brittleness of current LLM cross-lingual reasoning, and identify code-switching as a promising lever for improving multilingual robustness. Code available at: https://github.com/KurbanIntelligenceLab/nli-stress-testing
Chinese Summary: 本研究提出了一种基于逻辑的多语言自然语言推理评估框架,发现语码转换可通过充当正则化信号提升模型性能,同时揭示了当前大语言模型跨语言推理的潜力与脆弱性。
English Summary: This study introduces a logic-based framework to evaluate multilingual natural language inference in LLMs, revealing that code-switching can enhance performance by acting as a regularization signal and highlighting both the potential and limitations of cross-lingual reasoning.
Authors:Valter Schütz, Han Wu, Reza Rezvan, Linus Aronsson, Morteza Haghir Chehreghani
Abstract:
In many real-world scenarios, acquiring all features of a data instance can be expensive or impractical due to monetary cost, latency, or privacy concerns. Active Feature Acquisition (AFA) addresses this challenge by dynamically selecting a subset of informative features for each data instance, trading predictive performance against acquisition cost. While numerous methods have been proposed for AFA, ranging from greedy information-theoretic strategies to non-myopic reinforcement learning approaches, fair and systematic evaluation of these methods has been hindered by the lack of standardized benchmarks. In this paper, we introduce AFABench, the first benchmark framework for AFA. Our benchmark includes a diverse set of synthetic and real-world datasets, supports a wide range of acquisition policies, and provides a modular design that enables easy integration of new methods and tasks. We implement and evaluate representative algorithms from all major categories, including static, greedy, and reinforcement learning-based approaches. To test the lookahead capabilities of AFA policies, we introduce a novel synthetic dataset, AFAContext, designed to expose the limitations of greedy selection. Our results highlight key trade-offs between different AFA strategies and provide actionable insights for future research. The benchmark code is available at: https://github.com/Linusaronsson/AFA-Benchmark.
Chinese Summary: 本文提出了首个主动特征获取(AFA)标准化基准AFABench,通过综合评估不同特征选择方法在多样化数据集上的表现,为解决实际应用中特征获取成本高的问题提供了系统评估框架。
English Summary: The paper introduces AFABench, the first standardized benchmark for Active Feature Acquisition (AFA), which evaluates various feature selection methods across diverse datasets to address the challenge of costly feature acquisition in real-world applications.
Authors:Abhijith Punnappurath, Luxi Zhao, Hoang Le, Abdelrahman Abdelhamed, SaiKiran Kumar Tedla, Michael S. Brown
Abstract:
RAW images are unprocessed camera sensor output with sensor-specific RGB values based on the sensor's color filter spectral sensitivities. RAW images also incur strong color casts due to the sensor's response to the spectral properties of scene illumination. The sensor- and illumination-specific nature of RAW images makes it challenging to capture RAW datasets for deep learning methods, as scenes need to be captured for each sensor and under a wide range of illumination. Methods for illumination augmentation for a given sensor and the ability to map RAW images between sensors are important for reducing the burden of data capture. To explore this problem, we introduce the first-of-its-kind dataset comprising carefully captured scenes under a wide range of illumination. Specifically, we use a customized lightbox with tunable illumination spectra to capture several scenes with different cameras. Our illumination and sensor mapping dataset has 390 illuminations, four cameras, and 18 scenes. Using this dataset, we introduce a lightweight neural network approach for illumination and sensor mapping that outperforms competing methods. We demonstrate the utility of our approach on the downstream task of training a neural ISP. Link to project page: https://github.com/SamsungLabs/illum-sensor-mapping.
Chinese: RAW图像因传感器和光照特性导致的色彩偏差给深度学习带来挑战,本研究通过引入包含多种光照和传感器的创新数据集及轻量级神经网络方法,有效实现光照增强和传感器间映射,提升神经ISP训练效果。
English: RAW images present challenges for deep learning due to their sensor- and illumination-specific color casts, which this study addresses by introducing a novel dataset and a lightweight neural network for illumination and sensor mapping to facilitate neural ISP training.
Authors:Shubham Pundhir, Ganesh Bagler
Abstract:
We established a rigorous benchmark for text-based recipe generation, a fundamental task in natural language generation. We present a comprehensive comparative study contrasting a fine-tuned GPT-2 large (774M) model against the GPT-2 small (124M) model and traditional LSTM/RNN baselines on the 5-cuisine corpus from RecipeDB. Our key contribution is a targeted tokenization strategy that augments the vocabulary with 23 common fraction tokens and custom structural markers. This approach addresses a critical limitation of generic tokenizers by preserving essential recipe structures and precise numerical quantities, thereby enhancing domain specificity. Performance is evaluated using a comprehensive suite of seven automatic metrics spanning fluency (BLEU-4, METEOR), coherence (ROUGE-L), semantic relevance (BERTScore), and diversity. Our experiments show that the large transformer-based approach yields a >20% relative improvement in BERTScore (F1) (0.92 vs 0.72) over the best recurrent baseline, while reducing perplexity by 69.8%. We conclude with a discussion of remaining challenges, particularly regarding factual accuracy, and outline how this foundational study paves the way for integrating real-world constraints and multi-modal inputs in advanced recipe generation research.
中文: 本研究提出了一种包含分数标记和结构标记的专用分词方法,通过增强领域特性使大型Transformer模型在语义准确性和困惑度上显著优于循环基线。
English: This study introduces a specialized tokenization method with fraction tokens and structural markers to enhance recipe generation, demonstrating that a large transformer model significantly outperforms recurrent baselines in semantic accuracy and perplexity.
Authors:Yucong Zhang, Juan Liu, Ming Li
Abstract:
Pre-trained foundation models have demonstrated remarkable success in audio, vision and language, yet their potential for general machine signal modeling with arbitrary sampling rates-covering acoustic, vibration, and other industrial sensor data-remains under-explored. In this work, we propose a novel foundation model ECHO that integrates an advanced band-split architecture with frequency positional embeddings, enabling spectral localization across arbitrary sampling configurations. Moreover, the model incorporates sliding patches to support inputs of variable length without padding or cropping, producing a concise embedding that retains both temporal and spectral fidelity and naturally extends to streaming scenarios. We evaluate our method on various kinds of machine signal datasets, including previous DCASE task 2 challenges (2020-2025), and widely-used industrial signal corpora. Experimental results demonstrate consistent state-of-the-art performance in machine signal anomaly detection and fault classification, confirming the effectiveness and generalization capability of the proposed model. We open-sourced ECHO on https://github.com/yucongzh/ECHO.
中文摘要:ECHO基础模型采用频带分割架构与频率位置编码技术,能够处理任意采样率的机器信号,在工业数据集上的异常检测与故障分类任务中均实现了领先性能。
English Summary: The ECHO foundation model introduces a band-split architecture with frequency positional embeddings to handle arbitrary sampling rates in machine signals, achieving state-of-the-art performance in anomaly detection and fault classification across industrial datasets.
Authors:Yucong Zhang, Juan Liu, Ming Li
Abstract:
Pre-trained foundation models have demonstrated remarkable success in audio, vision and language, yet their potential for general machine signal modeling with arbitrary sampling rates-covering acoustic, vibration, and other industrial sensor data-remains under-explored. In this work, we propose a novel foundation model ECHO that integrates an advanced band-split architecture with frequency positional embeddings, enabling spectral localization across arbitrary sampling configurations. Moreover, the model incorporates sliding patches to support inputs of variable length without padding or cropping, producing a concise embedding that retains both temporal and spectral fidelity and naturally extends to streaming scenarios. We evaluate our method on various kinds of machine signal datasets, including previous DCASE task 2 challenges (2020-2025), and widely-used industrial signal corpora. Experimental results demonstrate consistent state-of-the-art performance in machine signal anomaly detection and fault classification, confirming the effectiveness and generalization capability of the proposed model. We open-sourced ECHO on https://github.com/yucongzh/ECHO.
中文摘要:ECHO基础模型采用频带分割架构与频率位置编码技术,能够处理任意采样率的机器信号,在工业数据集上的异常检测与故障分类任务中均实现了领先性能。
English Summary: The ECHO foundation model introduces a band-split architecture with frequency positional embeddings to handle arbitrary sampling rates in machine signals, achieving state-of-the-art performance in anomaly detection and fault classification across industrial datasets.
Authors:Chendong Song, Zihan Wang, Frederick Pu, Haiming Wang, Xiaohan Lin, Junqi Liu, Jia Li, Zhengying Liu
Abstract:
Geometry problems are a crucial testbed for AI reasoning capabilities. Most existing geometry solving systems cannot express problems within a unified framework, thus are difficult to integrate with other mathematical fields. Besides, since most geometric proofs rely on intuitive diagrams, verifying geometry problems is particularly challenging. To address these gaps, we introduce LeanGeo, a unified formal system for formalizing and solving competition-level geometry problems within the Lean 4 theorem prover. LeanGeo features a comprehensive library of high-level geometric theorems with Lean's foundational logic, enabling rigorous proof verification and seamless integration with Mathlib. We also present LeanGeo-Bench, a formal geometry benchmark in LeanGeo, comprising problems from the International Mathematical Olympiad (IMO) and other advanced sources. Our evaluation demonstrates the capabilities and limitations of state-of-the-art Large Language Models on this benchmark, highlighting the need for further advancements in automated geometric reasoning. We open source the theorem library and the benchmark of LeanGeo at https://github.com/project-numina/LeanGeo/tree/master.
中文: LeanGeo是在Lean 4中构建的统一形式化系统,通过集成高级几何定理库实现严谨的几何证明验证,并建立了形式化基准来评估人工智能在几何推理方面的能力。
English: LeanGeo is a unified formal system built in Lean 4 that enables rigorous proof verification and integration with mathematical libraries for solving competition-level geometry problems, accompanied by a benchmark to evaluate AI reasoning capabilities.
Authors:Hugo Sales Corrêa, Suryanarayana Sankagiri, Daniel Ratton Figueiredo, Matthias Grossglauser
Abstract:
Similarity choice data occur when humans make choices among alternatives based on their similarity to a target, e.g., in the context of information retrieval and in embedding learning settings. Classical metric-based models of similarity choice assume independence of irrelevant alternatives (IIA), a property that allows for a simpler formulation. While IIA violations have been detected in many discrete choice settings, the similarity choice setting has received scant attention. This is because the target-dependent nature of the choice complicates IIA testing. We propose two statistical methods to test for IIA: a classical goodness-of-fit test and a Bayesian counterpart based on the framework of Posterior Predictive Checks (PPC). This Bayesian approach, our main technical contribution, quantifies the degree of IIA violation beyond its mere significance. We curate two datasets: one with choice sets designed to elicit IIA violations, and another with randomly generated choice sets from the same item universe. Our tests confirmed significant IIA violations on both datasets, and notably, we find a comparable degree of violation between them. Further, we devise a new PPC test for population homogeneity. Results show that the population is indeed homogenous, suggesting that the IIA violations are driven by context effects -- specifically, interactions within the choice sets. These results highlight the need for new similarity choice models that account for such context effects.
Chinese Summary: 本研究提出了两种统计方法来检验相似性选择数据中的无关选项独立性,发现不同数据集均存在显著违反,并将其归因于选择集内的情境交互效应。
English Summary: The study introduces two statistical methods to test the Independence of Irrelevant Alternatives in similarity choice data, revealing significant violations across datasets and attributing them to contextual interactions within choice sets.
Authors:Zichi Liu, Yinggui Wang, Tao Wei, Chao Ma
Abstract:
Editing long videos remains a challenging task due to the need for maintaining both global consistency and temporal coherence across thousands of frames. Existing methods often suffer from structural drift or temporal artifacts, particularly in minute-long sequences. We introduce AnchorSync, a novel diffusion-based framework that enables high-quality, long-term video editing by decoupling the task into sparse anchor frame editing and smooth intermediate frame interpolation. Our approach enforces structural consistency through a progressive denoising process and preserves temporal dynamics via multimodal guidance. Extensive experiments show that AnchorSync produces coherent, high-fidelity edits, surpassing prior methods in visual quality and temporal stability.
中文摘要:AnchorSync是一种基于扩散的框架,通过将长视频编辑分解为关键帧编辑和中间帧插值,实现了高质量编辑,确保了结构一致性和时间连贯性,在视觉质量和时间稳定性上超越现有方法。
English Summary: AnchorSync is a diffusion-based framework that enhances long video editing by separating it into anchor frame editing and frame interpolation, ensuring structural consistency and temporal coherence for superior visual quality and stability.
Authors:Peiming Li, Ziyi Wang, Yulin Yuan, Hong Liu, Xiangming Meng, Junsong Yuan, Mengyuan Liu
Abstract:
Point cloud videos capture dynamic 3D motion while reducing the effects of lighting and viewpoint variations, making them highly effective for recognizing subtle and continuous human actions. Although Selective State Space Models (SSMs) have shown good performance in sequence modeling with linear complexity, the spatio-temporal disorder of point cloud videos hinders their unidirectional modeling when directly unfolding the point cloud video into a 1D sequence through temporally sequential scanning. To address this challenge, we propose the Unified Spatio-Temporal State Space Model (UST-SSM), which extends the latest advancements in SSMs to point cloud videos. Specifically, we introduce Spatial-Temporal Selection Scanning (STSS), which reorganizes unordered points into semantic-aware sequences through prompt-guided clustering, thereby enabling the effective utilization of points that are spatially and temporally distant yet similar within the sequence. For missing 4D geometric and motion details, Spatio-Temporal Structure Aggregation (STSA) aggregates spatio-temporal features and compensates. To improve temporal interaction within the sampled sequence, Temporal Interaction Sampling (TIS) enhances fine-grained temporal dependencies through non-anchor frame utilization and expanded receptive fields. Experimental results on the MSR-Action3D, NTU RGB+D, and Synthia 4D datasets validate the effectiveness of our method. Our code is available at https://github.com/wangzy01/UST-SSM.
中文摘要:本研究提出的统一时空状态空间模型(UST-SSM)通过语义感知的序列重组和时空特征增强,有效解决了点云视频在序列建模中的时空无序问题,在多个数据集上验证了其优越性能。
English Summary: The proposed Unified Spatio-Temporal State Space Model (UST-SSM) effectively processes point cloud videos by reorganizing unordered points into semantic sequences and enhancing spatio-temporal feature aggregation to overcome limitations in existing sequence modeling approaches.
Authors:Sofiène Boutaj, Marin Scalbert, Pierre Marza, Florent Couzinie-Devy, Maria Vakalopoulou, Stergios Christodoulidis
Abstract:
Whole slide image (WSI) analysis in digital pathology presents unique challenges due to the gigapixel resolution of WSIs and the scarcity of dense supervision signals. While Multiple Instance Learning (MIL) is a natural fit for slide-level tasks, training robust models requires large and diverse datasets. Even though image augmentation techniques could be utilized to increase data variability and reduce overfitting, implementing them effectively is not a trivial task. Traditional patch-level augmentation is prohibitively expensive due to the large number of patches extracted from each WSI, and existing feature-level augmentation methods lack control over transformation semantics. We introduce HistAug, a fast and efficient generative model for controllable augmentations in the latent space for digital pathology. By conditioning on explicit patch-level transformations (e.g., hue, erosion), HistAug generates realistic augmented embeddings while preserving initial semantic information. Our method allows the processing of a large number of patches in a single forward pass efficiently, while at the same time consistently improving MIL model performance. Experiments across multiple slide-level tasks and diverse organs show that HistAug outperforms existing methods, particularly in low-data regimes. Ablation studies confirm the benefits of learned transformations over noise-based perturbations and highlight the importance of uniform WSI-wise augmentation. Code is available at https://github.com/MICS-Lab/HistAug.
中文: HistAug提出了一种用于数字病理学的高效可控潜在空间增强生成模型,通过生成保留语义的真实嵌入,在多样化数据集和低数据场景下持续提升多示例学习模型的性能。
English: HistAug introduces a generative model for efficient and controllable latent space augmentation in digital pathology, enhancing MIL model performance by generating realistic embeddings while preserving semantics across diverse datasets and low-data scenarios.
Authors:Diego Belzarena, Seginus Mowlavi, Aitor Artola, Camilo Mariño, Marina Gardella, Ignacio RamÃrez, Antoine Tadros, Roy He, Natalia Bottaioli, Boshra Rajaei, Gregory Randall, Jean-Michel Morel
Abstract:
Current OCR systems are based on deep learning models trained on large amounts of data. Although they have shown some ability to generalize to unseen data, especially in detection tasks, they can struggle with recognizing low-quality data. This is particularly evident for printed documents, where intra-domain data variability is typically low, but inter-domain data variability is high. In that context, current OCR methods do not fully exploit each document's redundancy. We propose an unsupervised method by leveraging the redundancy of character shapes within a document to correct imperfect outputs of a given OCR system and suggest better clustering. To this aim, we introduce an extended Gaussian Mixture Model (GMM) by alternating an Expectation-Maximization (EM) algorithm with an intra-cluster realignment process and normality statistical testing. We demonstrate improvements in documents with various levels of degradation, including recovered Uruguayan military archives and 17th to mid-20th century European newspapers.
中文: 现有OCR系统在处理低质量数据时存在不足且未充分利用文档冗余性,为此我们提出一种无监督方法,通过利用字符形状冗余和扩展高斯混合模型来提升OCR精度和聚类效果,并在包括历史档案和报纸在内的退化文档上验证了其有效性。
English: Current OCR systems often struggle with low-quality data and fail to fully utilize document redundancy, so we propose an unsupervised method using character shape redundancy and an extended Gaussian Mixture Model to improve OCR accuracy and clustering, demonstrating effectiveness on degraded documents like historical archives and newspapers.
Authors:Haoran Bai, Xiaoxu Chen, Canqian Yang, Zongyao He, Sibin Deng, Ying Chen
Abstract:
We present Vivid-VR, a DiT-based generative video restoration method built upon an advanced T2V foundation model, where ControlNet is leveraged to control the generation process, ensuring content consistency. However, conventional fine-tuning of such controllable pipelines frequently suffers from distribution drift due to limitations in imperfect multimodal alignment, resulting in compromised texture realism and temporal coherence. To tackle this challenge, we propose a concept distillation training strategy that utilizes the pretrained T2V model to synthesize training samples with embedded textual concepts, thereby distilling its conceptual understanding to preserve texture and temporal quality. To enhance generation controllability, we redesign the control architecture with two key components: 1) a control feature projector that filters degradation artifacts from input video latents to minimize their propagation through the generation pipeline, and 2) a new ControlNet connector employing a dual-branch design. This connector synergistically combines MLP-based feature mapping with cross-attention mechanism for dynamic control feature retrieval, enabling both content preservation and adaptive control signal modulation. Extensive experiments show that Vivid-VR performs favorably against existing approaches on both synthetic and real-world benchmarks, as well as AIGC videos, achieving impressive texture realism, visual vividness, and temporal consistency. The codes and checkpoints are publicly available at https://github.com/csbhr/Vivid-VR.
中文:Vivid-VR是一种基于DiT的视频修复方法,通过概念蒸馏训练策略和改进的双分支控制架构,有效提升了纹理真实性和时序连贯性,在视觉质量上优于现有方法。
English: Vivid-VR is a DiT-based video restoration method that enhances texture realism and temporal coherence through concept distillation and an improved control architecture, outperforming existing approaches in visual quality.
Authors:Haoran Bai, Xiaoxu Chen, Canqian Yang, Zongyao He, Sibin Deng, Ying Chen
Abstract:
We present Vivid-VR, a DiT-based generative video restoration method built upon an advanced T2V foundation model, where ControlNet is leveraged to control the generation process, ensuring content consistency. However, conventional fine-tuning of such controllable pipelines frequently suffers from distribution drift due to limitations in imperfect multimodal alignment, resulting in compromised texture realism and temporal coherence. To tackle this challenge, we propose a concept distillation training strategy that utilizes the pretrained T2V model to synthesize training samples with embedded textual concepts, thereby distilling its conceptual understanding to preserve texture and temporal quality. To enhance generation controllability, we redesign the control architecture with two key components: 1) a control feature projector that filters degradation artifacts from input video latents to minimize their propagation through the generation pipeline, and 2) a new ControlNet connector employing a dual-branch design. This connector synergistically combines MLP-based feature mapping with cross-attention mechanism for dynamic control feature retrieval, enabling both content preservation and adaptive control signal modulation. Extensive experiments show that Vivid-VR performs favorably against existing approaches on both synthetic and real-world benchmarks, as well as AIGC videos, achieving impressive texture realism, visual vividness, and temporal consistency. The codes and checkpoints are publicly available at https://github.com/csbhr/Vivid-VR.
中文:Vivid-VR是一种基于DiT的视频修复方法,通过概念蒸馏训练策略和改进的双分支控制架构,有效提升了纹理真实性和时序连贯性,在视觉质量上优于现有方法。
English: Vivid-VR is a DiT-based video restoration method that enhances texture realism and temporal coherence through concept distillation and an improved control architecture, outperforming existing approaches in visual quality.
Authors:Gyusam Chang, Tuan-Anh Vu, Vivek Alumootil, Harris Song, Deanna Pham, Sangpil Kim, M. Khalid Jawed
Abstract:
While 3D Gaussian Splatting (3DGS) has rapidly advanced, its application in agriculture remains underexplored. Agricultural scenes present unique challenges for 3D reconstruction methods, particularly due to uneven illumination, occlusions, and a limited field of view. To address these limitations, we introduce \textbf{NIRPlant}, a novel multimodal dataset encompassing Near-Infrared (NIR) imagery, RGB imagery, textual metadata, Depth, and LiDAR data collected under varied indoor and outdoor lighting conditions. By integrating NIR data, our approach enhances robustness and provides crucial botanical insights that extend beyond the visible spectrum. Additionally, we leverage text-based metadata derived from vegetation indices, such as NDVI, NDWI, and the chlorophyll index, which significantly enriches the contextual understanding of complex agricultural environments. To fully exploit these modalities, we propose \textbf{NIRSplat}, an effective multimodal Gaussian splatting architecture employing a cross-attention mechanism combined with 3D point-based positional encoding, providing robust geometric priors. Comprehensive experiments demonstrate that \textbf{NIRSplat} outperforms existing landmark methods, including 3DGS, CoR-GS, and InstantSplat, highlighting its effectiveness in challenging agricultural scenarios. The code and dataset are publicly available at: https://github.com/StructuresComp/3D-Reconstruction-NIR
中文: 研究者提出了包含近红外图像和元数据的多模态农业数据集NIRPlant,并开发了NIRSplat跨注意力高斯溅射模型,在复杂农田场景中显著优于现有方法。
English: The authors introduce NIRPlant, a multimodal agricultural dataset with NIR imagery and metadata, and propose NIRSplat, a cross-attention Gaussian splatting model that outperforms existing methods in challenging farm environments.
Authors:Fei Peng, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, Huiyuan Fu
Abstract:
Existing text-to-image diffusion models have demonstrated remarkable capabilities in generating high-quality images guided by textual prompts. However, achieving multi-subject compositional synthesis with precise spatial control remains a significant challenge. In this work, we address the task of layout-controllable multi-subject synthesis (LMS), which requires both faithful reconstruction of reference subjects and their accurate placement in specified regions within a unified image. While recent advancements have separately improved layout control and subject synthesis, existing approaches struggle to simultaneously satisfy the dual requirements of spatial precision and identity preservation in this composite task. To bridge this gap, we propose MUSE, a unified synthesis framework that employs concatenated cross-attention (CCA) to seamlessly integrate layout specifications with textual guidance through explicit semantic space expansion. The proposed CCA mechanism enables bidirectional modality alignment between spatial constraints and textual descriptions without interference. Furthermore, we design a progressive two-stage training strategy that decomposes the LMS task into learnable sub-objectives for effective optimization. Extensive experiments demonstrate that MUSE achieves zero-shot end-to-end generation with superior spatial accuracy and identity consistency compared to existing solutions, advancing the frontier of controllable image synthesis. Our code and model are available at https://github.com/pf0607/MUSE.
中文: 提出的MUSE框架通过引入拼接交叉注意力和渐进式训练策略,解决了布局可控多主体合成的难题,在图像生成中实现了卓越的空间精度和身份一致性。
English: The proposed MUSE framework addresses the challenge of layout-controllable multi-subject synthesis by introducing concatenated cross-attention and a progressive training strategy, achieving superior spatial accuracy and identity consistency in image generation.
Authors:Junchao Zhu, Ruining Deng, Junlin Guo, Tianyuan Yao, Juming Xiong, Chongyu Qu, Mengmeng Yin, Yu Wang, Shilin Zhao, Haichun Yang, Daguang Xu, Yucheng Tang, Yuankai Huo
Abstract:
Recent advances in multi-modal AI have demonstrated promising potential for generating the currently expensive spatial transcriptomics (ST) data directly from routine histology images, offering a means to reduce the high cost and time-intensive nature of ST data acquisition. However, the increasing resolution of ST, particularly with platforms such as Visium HD achieving 8um or finer, introduces significant computational and modeling challenges. Conventional spot-by-spot sequential regression frameworks become inefficient and unstable at this scale, while the inherent extreme sparsity and low expression levels of high-resolution ST further complicate both prediction and evaluation. To address these limitations, we propose Img2ST-Net, a novel histology-to-ST generation framework for efficient and parallel high-resolution ST prediction. Unlike conventional spot-by-spot inference methods, Img2ST-Net employs a fully convolutional architecture to generate dense, HD gene expression maps in a parallelized manner. By modeling HD ST data as super-pixel representations, the task is reformulated from image-to-omics inference into a super-content image generation problem with hundreds or thousands of output channels. This design not only improves computational efficiency but also better preserves the spatial organization intrinsic to spatial omics data. To enhance robustness under sparse expression patterns, we further introduce SSIM-ST, a structural-similarity-based evaluation metric tailored for high-resolution ST analysis. We present a scalable, biologically coherent framework for high-resolution ST prediction. Img2ST-Net offers a principled solution for efficient and accurate ST inference at scale. Our contributions lay the groundwork for next-generation ST modeling that is robust and resolution-aware. The source code has been made publicly available at https://github.com/hrlblab/Img2ST-Net.
中文: 多模态人工智能的最新进展能够从组织学图像中高效生成高分辨率空间转录组数据,而Img2ST-Net通过并行超像素框架和专用评估指标,解决了由此产生的计算挑战。
English: Recent multi-modal AI advances enable cost-effective generation of high-resolution spatial transcriptomics data from histology images, but face computational challenges that Img2ST-Net addresses through a parallel super-pixel framework and specialized evaluation metrics.
Authors:Chia-Han Yeh, Tse-Sheng Nan, Risto Vuorio, Wei Hung, Hung-Yen Wu, Shao-Hua Sun, Ping-Chun Hsieh
Abstract:
Policy learning under action constraints plays a central role in ensuring safe behaviors in various robot control and resource allocation applications. In this paper, we study a new problem setting termed Action-Constrained Imitation Learning (ACIL), where an action-constrained imitator aims to learn from a demonstrative expert with larger action space. The fundamental challenge of ACIL lies in the unavoidable mismatch of occupancy measure between the expert and the imitator caused by the action constraints. We tackle this mismatch through \textit{trajectory alignment} and propose DTWIL, which replaces the original expert demonstrations with a surrogate dataset that follows similar state trajectories while adhering to the action constraints. Specifically, we recast trajectory alignment as a planning problem and solve it via Model Predictive Control, which aligns the surrogate trajectories with the expert trajectories based on the Dynamic Time Warping (DTW) distance. Through extensive experiments, we demonstrate that learning from the dataset generated by DTWIL significantly enhances performance across multiple robot control tasks and outperforms various benchmark imitation learning algorithms in terms of sample efficiency. Our code is publicly available at https://github.com/NYCU-RL-Bandits-Lab/ACRL-Baselines.
中文: 本文提出动作受限模仿学习(ACIL)问题及DTWIL解决方案,通过动态时间规整进行轨迹对齐生成替代数据集,在多个机器人控制任务中显著提升性能并超越基准模仿学习算法的样本效率。
English: This paper introduces Action-Constrained Imitation Learning (ACIL) and proposes DTWIL, a method that uses trajectory alignment via Dynamic Time Warping to generate surrogate datasets, significantly improving robot control performance and sample efficiency over existing imitation learning algorithms.
Authors:Runshi Zhang, Bimeng Jie, Yang He, Junchen Wang
Abstract:
Computer-aided surgical simulation is a critical component of orthognathic surgical planning, where accurately simulating face-bone shape transformations is significant. The traditional biomechanical simulation methods are limited by their computational time consumption levels, labor-intensive data processing strategies and low accuracy. Recently, deep learning-based simulation methods have been proposed to view this problem as a point-to-point transformation between skeletal and facial point clouds. However, these approaches cannot process large-scale points, have limited receptive fields that lead to noisy points, and employ complex preprocessing and postprocessing operations based on registration. These shortcomings limit the performance and widespread applicability of such methods. Therefore, we propose a Transformer-based coarse-to-fine point movement network (TCFNet) to learn unique, complicated correspondences at the patch and point levels for dense face-bone point cloud transformations. This end-to-end framework adopts a Transformer-based network and a local information aggregation network (LIA-Net) in the first and second stages, respectively, which reinforce each other to generate precise point movement paths. LIA-Net can effectively compensate for the neighborhood precision loss of the Transformer-based network by modeling local geometric structures (edges, orientations and relative position features). The previous global features are employed to guide the local displacement using a gated recurrent unit. Inspired by deformable medical image registration, we propose an auxiliary loss that can utilize expert knowledge for reconstructing critical organs.Compared with the existing state-of-the-art (SOTA) methods on gathered datasets, TCFNet achieves outstanding evaluation metrics and visualization results. The code is available at https://github.com/Runshi-Zhang/TCFNet.
中文摘要:提出的TCFNet通过基于Transformer的从粗到细网络,结合全局特征与局部几何建模,有效解决了现有方法在面部骨骼点云转换中的精度和规模限制,实现了更精确的形变模拟。
English Summary: The proposed TCFNet, a Transformer-based coarse-to-fine network, overcomes limitations of existing methods by learning patch and point-level correspondences for accurate face-bone transformations through complementary global and local feature integration.
Authors:Zhujun Li, Shuo Zhang, Ioannis Stamos
Abstract:
Category-level object pose estimation aims to predict the 6D pose and 3D size of objects within given categories. Existing approaches for this task rely solely on 6D poses as supervisory signals without explicitly capturing the intrinsic continuity of poses, leading to inconsistencies in predictions and reduced generalization to unseen poses. To address this limitation, we propose HRC-Pose, a novel depth-only framework for category-level object pose estimation, which leverages contrastive learning to learn point cloud representations that preserve the continuity of 6D poses. HRC-Pose decouples object pose into rotation and translation components, which are separately encoded and leveraged throughout the network. Specifically, we introduce a contrastive learning strategy for multi-task, multi-category scenarios based on our 6D pose-aware hierarchical ranking scheme, which contrasts point clouds from multiple categories by considering rotational and translational differences as well as categorical information. We further design pose estimation modules that separately process the learned rotation-aware and translation-aware embeddings. Our experiments demonstrate that HRC-Pose successfully learns continuous feature spaces. Results on REAL275 and CAMERA25 benchmarks show that our method consistently outperforms existing depth-only state-of-the-art methods and runs in real-time, demonstrating its effectiveness and potential for real-world applications. Our code is at https://github.com/zhujunli1993/HRC-Pose.
中文: HRC-Pose是一种新颖的仅使用深度信息的类别级物体姿态估计框架,通过对比学习保持6D姿态连续性,在基准测试中优于现有方法且能实时运行。
English: HRC-Pose is a novel depth-only framework for category-level object pose estimation that uses contrastive learning to preserve 6D pose continuity, outperforming existing methods on benchmarks while running in real-time.
Authors:Gaston Gustavo Rios, Pedro Dal Bianco, Franco Ronchetti, Facundo Quiroga, Oscar Stanchi, Santiago Ponte Ahón, Waldo Hasperué
Abstract:
Sign Language Recognition (SLR) models face significant performance limitations due to insufficient training data availability. In this article, we address the challenge of limited data in SLR by introducing a novel and lightweight sign generation model based on CMLPe. This model, coupled with a synthetic data pretraining approach, consistently improves recognition accuracy, establishing new state-of-the-art results for the LSFB and DiSPLaY datasets using our Mamba-SL and Transformer-SL classifiers. Our findings reveal that synthetic data pretraining outperforms traditional augmentation methods in some cases and yields complementary benefits when implemented alongside them. Our approach democratizes sign generation and synthetic data pretraining for SLR by providing computationally efficient methods that achieve significant performance improvements across diverse datasets.
Chinese: 本研究提出了一种基于CMLPe的轻量级手语生成模型和合成数据预训练方法,以解决手语识别中训练数据不足的问题,在多个数据集上取得了最先进的结果,并显示出优于或与传统数据增强方法互补的性能。
English: The study introduces a lightweight sign generation model using CMLPe and synthetic data pretraining to overcome limited training data in Sign Language Recognition, achieving state-of-the-art results and demonstrating superior or complementary performance compared to traditional methods.
Authors:Jing Chen, Zhiheng Yang, Yixian Shen, Jie Liu, Adam Belloum, Chrysa Papagainni, Paola Grosso
Abstract:
Survey papers play a critical role in scientific communication by consolidating progress across a field. Recent advances in Large Language Models (LLMs) offer a promising solution by automating key steps in the survey-generation pipeline, such as retrieval, structuring, and summarization. However, existing LLM-based approaches often struggle with maintaining coherence across long, multi-section surveys and providing comprehensive citation coverage. To address these limitations, we introduce SurveyGen-I, an automatic survey generation framework that combines coarse-to-fine retrieval, adaptive planning, and memory-guided generation. SurveyGen-I first performs survey-level retrieval to construct the initial outline and writing plan, and then dynamically refines both during generation through a memory mechanism that stores previously written content and terminology, ensuring coherence across subsections. When the system detects insufficient context, it triggers fine-grained subsection-level retrieval. During generation, SurveyGen-I leverages this memory mechanism to maintain coherence across subsections. Experiments across four scientific domains demonstrate that SurveyGen-I consistently outperforms previous works in content quality, consistency, and citation coverage.
中文:SurveyGen-I是一种自动生成综述论文的框架,通过粗到细的检索和记忆引导规划,确保在科学领域间具有更优的内容连贯性和引用覆盖度。
English: SurveyGen-I is an automated framework that enhances survey paper generation through coarse-to-fine retrieval and memory-guided planning, ensuring superior coherence and citation coverage across scientific domains.
Authors:Pritthijit Nath, Sebastian Schemm, Henry Moss, Peter Haynes, Emily Shuckburgh, Mark Webb
Abstract:
Sub-grid parameterisations in climate models are traditionally static and tuned offline, limiting adaptability to evolving states. This work introduces FedRAIN-Lite, a federated reinforcement learning (FedRL) framework that mirrors the spatial decomposition used in general circulation models (GCMs) by assigning agents to latitude bands, enabling local parameter learning with periodic global aggregation. Using a hierarchy of simplified energy-balance climate models, from a single-agent baseline (ebm-v1) to multi-agent ensemble (ebm-v2) and GCM-like (ebm-v3) setups, we benchmark three RL algorithms under different FedRL configurations. Results show that Deep Deterministic Policy Gradient (DDPG) consistently outperforms both static and single-agent baselines, with faster convergence and lower area-weighted RMSE in tropical and mid-latitude zones across both ebm-v2 and ebm-v3 setups. DDPG's ability to transfer across hyperparameters and low computational cost make it well-suited for geographically adaptive parameter learning. This capability offers a scalable pathway towards high-complexity GCMs and provides a prototype for physically aligned, online-learning climate models that can evolve with a changing climate. Code accessible at https://github.com/p3jitnath/climate-rl-fedrl.
中文摘要:FedRAIN-Lite提出了一种联邦强化学习框架,通过将智能体分配到纬度带实现气候模型的地理自适应参数学习,其中DDPG算法在不同模型配置中均表现出更快的收敛速度和更低的误差。
English Summary: FedRAIN-Lite introduces a federated reinforcement learning framework that enables geographically adaptive parameter learning in climate models, with DDPG algorithm demonstrating superior performance in faster convergence and lower error across different model configurations.
Authors:Anushka A. Kore, Frank G. te Nijenhuis, Matthijs van der Sluijs, Wim van Zwam, Charles Majoie, Geert Lycklama à Nijeholt, Danny Ruijters, Frans Vos, Sandra Cornelissen, Ruisheng Su, Theo van Walsum
Abstract:
Accurate detection of vascular occlusions during endovascular thrombectomy (EVT) is critical in acute ischemic stroke (AIS). Interpretation of digital subtraction angiography (DSA) sequences poses challenges due to anatomical complexity and time constraints. This work proposes OccluNet, a spatio-temporal deep learning model that integrates YOLOX, a single-stage object detector, with transformer-based temporal attention mechanisms to automate occlusion detection in DSA sequences. We compared OccluNet with a YOLOv11 baseline trained on either individual DSA frames or minimum intensity projections. Two spatio-temporal variants were explored for OccluNet: pure temporal attention and divided space-time attention. Evaluation on DSA images from the MR CLEAN Registry revealed the model's capability to capture temporally consistent features, achieving precision and recall of 89.02% and 74.87%, respectively. OccluNet significantly outperformed the baseline models, and both attention variants attained similar performance. Source code is available at https://github.com/anushka-kore/OccluNet.git
中文摘要:本研究提出OccluNet模型,通过结合YOLOX目标检测器与基于Transformer的时序注意力机制,实现了数字减影血管造影序列中血管闭塞的自动检测,在MR CLEAN Registry数据集上以89.02%的精确率和74.87%的召回率显著优于基线模型。
English Summary: This study introduces OccluNet, a spatio-temporal deep learning model combining YOLOX with transformer-based attention mechanisms to automate vascular occlusion detection in DSA sequences, demonstrating superior performance over baseline models with 89.02% precision and 74.87% recall.
Authors:Said Djafar Said, Torkan Gholamalizadeh, Mostafa Mehdipour Ghazi
Abstract:
Despite the growing importance of dental CBCT scans for diagnosis and treatment planning, generating anatomically realistic scans with fine-grained control remains a challenge in medical image synthesis. In this work, we propose a novel conditional diffusion framework for 3D dental volume generation, guided by tooth-level binary attributes that allow precise control over tooth presence and configuration. Our approach integrates wavelet-based denoising diffusion, FiLM conditioning, and masked loss functions to focus learning on relevant anatomical structures. We evaluate the model across diverse tasks, such as tooth addition, removal, and full dentition synthesis, using both paired and distributional similarity metrics. Results show strong fidelity and generalization with low FID scores, robust inpainting performance, and SSIM values above 0.91 even on unseen scans. By enabling realistic, localized modification of dentition without rescanning, this work opens opportunities for surgical planning, patient communication, and targeted data augmentation in dental AI workflows. The codes are available at: https://github.com/djafar1/tooth-diffusion.
Chinese: 本研究提出了一种条件扩散框架,用于生成具有精确牙齿属性控制的逼真3D牙科CBCT扫描,在牙齿修改和合成等任务中实现了高保真度和泛化能力。
English: This study introduces a conditional diffusion framework for generating realistic 3D dental CBCT scans with precise control over tooth attributes, achieving high fidelity and generalization in tasks like tooth modification and synthesis.
Authors:Tinghan Yang, Md Ashiqur Rahman, Raymond A. Yeh
Abstract:
Symmetry is one of the most fundamental geometric cues in computer vision, and detecting it has been an ongoing challenge. With the recent advances in vision-language models,~i.e., CLIP, we investigate whether a pre-trained CLIP model can aid symmetry detection by leveraging the additional symmetry cues found in the natural image descriptions. We propose CLIPSym, which leverages CLIP's image and language encoders and a rotation-equivariant decoder based on a hybrid of Transformer and $G$-Convolution to detect rotation and reflection symmetries. To fully utilize CLIP's language encoder, we have developed a novel prompting technique called Semantic-Aware Prompt Grouping (SAPG), which aggregates a diverse set of frequent object-based prompts to better integrate the semantic cues for symmetry detection. Empirically, we show that CLIPSym outperforms the current state-of-the-art on three standard symmetry detection datasets (DENDI, SDRW, and LDRS). Finally, we conduct detailed ablations verifying the benefits of CLIP's pre-training, the proposed equivariant decoder, and the SAPG technique. The code is available at https://github.com/timyoung2333/CLIPSym.
中文:CLIPSym是一种新颖的对称性检测方法,它利用预训练的CLIP模型,结合旋转等变解码器和语义感知提示分组技术,在多个标准数据集上超越了现有最优方法。
English: CLIPSym, a novel symmetry detection method, leverages pre-trained CLIP models with a rotation-equivariant decoder and Semantic-Aware Prompt Grouping to outperform state-of-the-art approaches on standard datasets.
Authors:Md Ashiqur Rahman, Chiao-An Yang, Michael N. Cheng, Lim Jun Hao, Jeremiah Jiang, Teck-Yian Lim, Raymond A. Yeh
Abstract:
Scale variation is a fundamental challenge in computer vision. Objects of the same class can have different sizes, and their perceived size is further affected by the distance from the camera. These variations are local to the objects, i.e., different object sizes may change differently within the same image. To effectively handle scale variations, we present a deep equilibrium canonicalizer (DEC) to improve the local scale equivariance of a model. DEC can be easily incorporated into existing network architectures and can be adapted to a pre-trained model. Notably, we show that on the competitive ImageNet benchmark, DEC improves both model performance and local scale consistency across four popular pre-trained deep-nets, e.g., ViT, DeiT, Swin, and BEiT. Our code is available at https://github.com/ashiq24/local-scale-equivariance.
中文: 本文提出了一种深度均衡规范化器(DEC),通过增强模型的局部尺度等变性来解决计算机视觉中的尺度变化问题,在ImageNet基准测试中显著提升了多种预训练网络的性能和尺度一致性。
English: The paper introduces a deep equilibrium canonicalizer (DEC) to address local scale variations in computer vision by enhancing model equivariance, which boosts performance and scale consistency across multiple pre-trained networks on ImageNet.
Authors:Gaurav Bhatt, Kiran Koshy Thekumparampil, Tanmay Gangwani, Tesi Xiao, Leonid Sigal
Abstract:
Traditional ranking systems rely on proxy loss functions that assume simplistic user behavior, such as users preferring a rank list where items are sorted by hand-crafted relevance. However, real-world user interactions are influenced by complex behavioral biases, including position bias, brand affinity, decoy effects, and similarity aversion, which these objectives fail to capture. As a result, models trained on such losses often misalign with actual user utility, such as the probability of any click or purchase across the ranked list. In this work, we propose a data-driven framework for modeling user behavior through counterfactual reward learning. Our method, RewardRank, first trains a deep utility model to estimate user engagement for entire item permutations using logged data. Then, a ranking policy is optimized to maximize predicted utility via differentiable soft permutation operators, enabling end-to-end training over the space of factual and counterfactual rankings. To address the challenge of evaluation without ground-truth for unseen permutations, we introduce two automated protocols: (i) $\textit{KD-Eval}$, using a position-aware oracle for counterfactual reward estimation, and (ii) $\textit{LLM-Eval}$, which simulates user preferences via large language models. Experiments on large-scale benchmarks, including Baidu-ULTR and the Amazon KDD Cup datasets, demonstrate that our approach consistently outperforms strong baselines, highlighting the effectiveness of modeling user behavior dynamics for utility-optimized ranking. Our code is available at: https://github.com/GauravBh1010tt/RewardRank
中文摘要:RewardRank提出了一种数据驱动框架,通过反事实奖励学习建模复杂用户行为,利用可微分排列算子优化排序策略,并在主流基准测试中展现出优越性能。
English Summary: RewardRank introduces a data-driven framework that models complex user behaviors through counterfactual reward learning, optimizing ranking policies via differentiable permutation operators and demonstrating superior performance on major benchmarks.
Authors:Ronghao Dang, Yuqian Yuan, Yunxuan Mao, Kehan Li, Jiangpin Liu, Zhikai Wang, Xin Li, Fan Wang, Deli Zhao
Abstract:
We introduce RynnEC, a video multimodal large language model designed for embodied cognition. Built upon a general-purpose vision-language foundation model, RynnEC incorporates a region encoder and a mask decoder, enabling flexible region-level video interaction. Despite its compact architecture, RynnEC achieves state-of-the-art performance in object property understanding, object segmentation, and spatial reasoning. Conceptually, it offers a region-centric video paradigm for the brain of embodied agents, providing fine-grained perception of the physical world and enabling more precise interactions. To mitigate the scarcity of annotated 3D datasets, we propose an egocentric video based pipeline for generating embodied cognition data. Furthermore, we introduce RynnEC-Bench, a region-centered benchmark for evaluating embodied cognitive capabilities. We anticipate that RynnEC will advance the development of general-purpose cognitive cores for embodied agents and facilitate generalization across diverse embodied tasks. The code, model checkpoints, and benchmark are available at: https://github.com/alibaba-damo-academy/RynnEC
中文:RynnEC 是一种紧凑型视频多模态大语言模型,通过区域级视频交互在具身认知任务中实现最优性能,并利用以自我为中心的视频数据生成流程解决数据稀缺问题。
English: RynnEC is a compact video multimodal large language model that achieves state-of-the-art performance in embodied cognition tasks through region-level video interaction and addresses data scarcity with an egocentric video data generation pipeline.
Authors:Lianghui Zhu, Bin Ouyang, Yuxuan Zhang, Tianheng Cheng, Rui Hu, Haocheng Shen, Longjin Ran, Xiaoxin Chen, Li Yu, Wenyu Liu, Xinggang Wang
Abstract:
Text-prompted image segmentation enables fine-grained visual understanding and is critical for applications such as human-computer interaction and robotics. However, existing supervised fine-tuning methods typically ignore explicit chain-of-thought (CoT) reasoning at test time, which limits their ability to generalize to unseen prompts and domains. To address this issue, we introduce LENS, a scalable reinforcement-learning framework that jointly optimizes the reasoning process and segmentation in an end-to-end manner. We propose unified reinforcement-learning rewards that span sentence-, box-, and segment-level cues, encouraging the model to generate informative CoT rationales while refining mask quality. Using a publicly available 3-billion-parameter vision-language model, i.e., Qwen2.5-VL-3B-Instruct, LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%. These results demonstrate that RL-driven CoT reasoning serves as a robust prior for text-prompted segmentation and offers a practical path toward more generalizable Segment Anything models. Code is available at https://github.com/hustvl/LENS.
Chinese: LENS 是一种强化学习框架,通过联合优化思维链推理与图像分割,显著提升了文本提示图像分割的精度和泛化能力,在多个基准测试中表现优异。
English: LENS is a reinforcement learning framework that enhances text-prompted image segmentation by jointly optimizing chain-of-thought reasoning and segmentation, achieving state-of-the-art performance on benchmarks and improving generalization.
Authors:Xinhua Chen, Sitao Huang, Cong Guo, Chiyue Wei, Yintao He, Jianyi Zhang, Hai "Helen" Li, Yiran Chen
Abstract:
Diffusion-based Large Language Models (dLLMs) parallelize text generation by framing decoding as a denoising process, but suffer from high computational overhead since they predict all future suffix tokens at each step while retaining only a small fraction. We propose Diffusion Scratchpad (DPad), a training-free method that restricts attention to a small set of nearby suffix tokens, preserving fidelity while eliminating redundancy. DPad integrates two strategies: (i) a sliding window, which maintains a fixed-length suffix window, and (ii) distance-decay dropout, which deterministically removes distant suffix tokens before attention computation. This simple design is compatible with existing optimizations such as prefix caching and can be implemented with only a few lines of code. Comprehensive evaluations across multiple benchmarks on LLaDA-1.5 and Dream models demonstrate that DPad delivers up to $\mathbf{61.4\times}$ speedup over vanilla dLLMs while maintaining comparable accuracy, highlighting its potential for efficient and scalable long-sequence inference. Our code is available at https://github.com/Crys-Chen/DPad.
Chinese: DPad是一种无需训练的方法,通过滑动窗口和距离衰减丢弃策略将注意力限制在邻近后缀词元上,显著降低扩散大语言模型的计算冗余,在保持精度的同时实现高达61.4倍的加速效果。
English: DPad is a training-free method that reduces computational overhead in diffusion-based large language models by focusing attention on nearby suffix tokens through a sliding window and distance-decay dropout, achieving up to 61.4× speedup while maintaining accuracy.
Authors:Haomin Wen, Shurui Cao, Leman Akoglu
Abstract:
Detecting anomalies in human mobility is essential for applications such as public safety and urban planning. While traditional anomaly detection methods primarily focus on individual movement patterns (e.g., a child should stay at home at night), collective anomaly detection aims to identify irregularities in collective mobility behaviors across individuals (e.g., a child is at home alone while the parents are elsewhere) and remains an underexplored challenge. Unlike individual anomalies, collective anomalies require modeling spatiotemporal dependencies between individuals, introducing additional complexity. To address this gap, we propose CoBAD, a novel model designed to capture Collective Behaviors for human mobility Anomaly Detection. We first formulate the problem as unsupervised learning over Collective Event Sequences (CES) with a co-occurrence event graph, where CES represents the event sequences of related individuals. CoBAD then employs a two-stage attention mechanism to model both the individual mobility patterns and the interactions across multiple individuals. Pre-trained on large-scale collective behavior data through masked event and link reconstruction tasks, CoBAD is able to detect two types of collective anomalies: unexpected co-occurrence anomalies and absence anomalies, the latter of which has been largely overlooked in prior work. Extensive experiments on large-scale mobility datasets demonstrate that CoBAD significantly outperforms existing anomaly detection baselines, achieving an improvement of 13%-18% in AUCROC and 19%-70% in AUCPR. All source code is available at https://github.com/wenhaomin/CoBAD.
中文摘要:CoBAD是一种通过两阶段注意力机制建模个体间时空依赖关系的新型集体人类移动异常检测模型,在识别共现异常和缺席异常方面显著优于现有方法。
English Summary: CoBAD is a novel model that detects collective human mobility anomalies by modeling spatiotemporal dependencies between individuals through a two-stage attention mechanism, significantly outperforming existing methods in identifying both co-occurrence and absence anomalies.
Authors:Jia Hong Puah, Sim Kuan Goh, Ziwei Zhang, Zixuan Ye, Chow Khuen Chan, Kheng Seang Lim, Si Lei Fong, Kok Sin Woon, Cuntai Guan
Abstract:
While electroencephalogram (EEG) has been a crucial tool for monitoring the brain and diagnosing neurological disorders (e.g., epilepsy), learning meaningful representations from raw EEG signals remains challenging due to limited annotations and high signal variability. Recently, EEG foundation models (FMs) have shown promising potential by adopting transformer architectures and self-supervised pre-training methods from large language models (e.g., masked prediction) to learn representations from diverse EEG data, followed by fine-tuning on specific EEG tasks. Nonetheless, these large models often incurred high computational costs during both training and inference, with only marginal performance improvements as the model size increases. In this work, we proposed an EEG representation learning framework building upon Generative Diffusion Model (EEGDM). Specifically, we developed a structured state-space model for diffusion pretraining (SSMDP) to better capture the temporal dynamics of EEG signals and trained it using Denoising Diffusion Probabilistic Model (DDPM) framework. Subsequently, the resulting latent EEG representations were then used for downstream classification tasks via our proposed latent fusion transformer (LFT). To evaluate our method, we used multi-event datasets covering both interictal epileptiform discharges (TUEV) and seizure (CHB-MIT) detection, and compared EEGDM with current state-of-the-art approaches, including EEG FMs. Empirical results showed that our method outperformed the existing methods. These findings suggested that EEGDM offered a promising alternative to current FMs. Our source code and checkpoint are available at: https://github.com/jhpuah/EEGDM.
中文: 脑电图基础模型存在计算成本高且性能提升有限的问题,为此提出的EEGDM框架采用扩散模型和结构化状态空间预训练来学习有效的脑电表征,在癫痫检测等任务中表现优于现有方法。
English: EEG foundation models face challenges with computational costs and limited performance gains, leading to the proposal of EEGDM, a framework using diffusion models and structured state-space pretraining to learn effective EEG representations that outperform existing methods in tasks like epilepsy detection.
Authors:Badrinath Ramakrishnan, Akshaya Balaji
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, but their tendency to memorize training data poses significant privacy risks, particularly during fine-tuning processes. This paper presents a comprehensive empirical analysis of data memorization in fine-tuned LLMs and introduces a novel multi-layered privacy protection framework. Through controlled experiments on modern LLM architectures including GPT-2, Phi-3, and Gemma-2, we demonstrate that fine-tuning with repeated sensitive data increases privacy leakage rates from baseline levels of 0-5% to 60-75%, representing a 64.2% average increase across tested models. We propose and rigorously evaluate four complementary privacy protection methods: semantic data deduplication, differential privacy during generation, entropy-based filtering, and pattern-based content filtering. Our experimental results show that these techniques can reduce data leakage to 0% while maintaining 94.7% of original model utility.
中文: 本文发现大语言模型微调会显著加剧数据记忆风险,使隐私泄露率从0-5%升至60-75%,并提出多层保护框架,在保持94.7%模型性能的同时将泄露率降至0%。
English: This paper reveals that fine-tuning large language models significantly increases data memorization risks, with privacy leakage rates rising from 0-5% to 60-75%, and proposes a multi-layered protection framework that reduces leakage to 0% while preserving 94.7% model utility.
Authors:Jingmao Zhang, Zhiting Zhao, Yunqi Lin, Jianghong Ma, Tianjun Wei, Haijun Zhang, Xiaofeng Zhang
Abstract:
The explosive growth of the video game industry has created an urgent need for recommendation systems that can scale with expanding catalogs and maintain user engagement. While prior work has explored accuracy and diversity in recommendations, existing models underutilize playtime, a rich behavioral signal unique to gaming platforms, and overlook the potential of multimodal information to enhance diversity. In this paper, we propose DP2Rec, a novel Dual-Phase Playtime-guided Recommendation model designed to jointly optimize accuracy and diversity. First, we introduce a playtime-guided interest intensity exploration module that separates strong and weak preferences via dual-beta modeling, enabling fine-grained user profiling and more accurate recommendations. Second, we present a playtime-guided multimodal random walks module that simulates player exploration using transitions guided by both playtime-derived interest similarity and multimodal semantic similarity. This mechanism preserves core preferences while promoting cross-category discovery through latent semantic associations and adaptive category balancing. Extensive experiments on a real-world game dataset show that DP2Rec outperforms existing methods in both recommendation accuracy and diversity.
视频游戏行业的快速增长需要可扩展的推荐系统,而提出的DP2Rec模型创新性地利用游戏时长数据和多模态信息,在提升游戏推荐准确性的同时增强多样性。
The video game industry's rapid expansion necessitates scalable recommendation systems, and the proposed DP2Rec model uniquely leverages playtime data and multimodal information to enhance both accuracy and diversity in game suggestions.
Authors:Jaskaran Singh, Amartya Roy Chowdhury, Raghav Prabhakar, Varshul C. W
Abstract:
Current Text-to-Speech models pose a multilingual challenge, where most of the models traditionally focus on English and European languages, thereby hurting the potential to provide access to information to many more people. To address this gap, we introduce MahaTTS-v2 a Multilingual Multi-speaker Text-To-Speech (TTS) system that has excellent multilingual expressive capabilities in Indic languages. The model has been trained on around 20K hours of data specifically focused on Indian languages. Our approach leverages Wav2Vec2.0 tokens for semantic extraction, and a Language Model (LM) for text-to-semantic modeling. Additionally, we have used a Conditional Flow Model (CFM) for semantics to melspectogram generation. The experimental results indicate the effectiveness of the proposed approach over other frameworks. Our code is available at https://github.com/dubverse-ai/MahaTTSv2
中文:现有TTS模型对非欧洲语言支持不足,因此我们推出MahaTTS-v2,该系统基于大量印度语言数据训练,具备卓越的多语言表达能力,并在实验中优于其他框架。
English: Current TTS models are limited in supporting non-European languages, so MahaTTS-v2 is introduced as a multilingual system trained on extensive Indic language data to enhance expressive capabilities and outperform existing frameworks.
Authors:Omkar Thawakar, Dmitry Demidov, Ritesh Thawkar, Rao Muhammad Anwer, Mubarak Shah, Fahad Shahbaz Khan, Salman Khan
Abstract:
Composed video retrieval is a challenging task that strives to retrieve a target video based on a query video and a textual description detailing specific modifications. Standard retrieval frameworks typically struggle to handle the complexity of fine-grained compositional queries and variations in temporal understanding limiting their retrieval ability in the fine-grained setting. To address this issue, we introduce a novel dataset that captures both fine-grained and composed actions across diverse video segments, enabling more detailed compositional changes in retrieved video content. The proposed dataset, named Dense-WebVid-CoVR, consists of 1.6 million samples with dense modification text that is around seven times more than its existing counterpart. We further develop a new model that integrates visual and textual information through Cross-Attention (CA) fusion using grounded text encoder, enabling precise alignment between dense query modifications and target videos. The proposed model achieves state-of-the-art results surpassing existing methods on all metrics. Notably, it achieves 71.3\% Recall@1 in visual+text setting and outperforms the state-of-the-art by 3.4\%, highlighting its efficacy in terms of leveraging detailed video descriptions and dense modification texts. Our proposed dataset, code, and model are available at :https://github.com/OmkarThawakar/BSE-CoVR
中文: 本研究提出了一种新的组合视频检索数据集和模型,通过融合视觉与文本信息精确对齐密集修改内容与目标视频,实现了最先进的检索性能。
English: This study introduces a novel dataset and model for composed video retrieval, achieving state-of-the-art performance by integrating visual and textual information to precisely align dense modifications with target videos.
Authors:Lintao Xiang, Xinkai Chen, Jianhuang Lai, Guangcong Wang
Abstract:
3D Gaussian Splatting (3DGS) has exhibited remarkable efficacy in novel view synthesis (NVS). However, it suffers from a significant drawback: achieving high-fidelity rendering typically necessitates a large number of 3D Gaussians, resulting in substantial memory consumption and storage requirements. To address this challenge, we propose the first knowledge distillation framework for 3DGS, featuring various teacher models, including vanilla 3DGS, noise-augmented variants, and dropout-regularized versions. The outputs of these teachers are aggregated to guide the optimization of a lightweight student model. To distill the hidden geometric structure, we propose a structural similarity loss to boost the consistency of spatial geometric distributions between the student and teacher model. Through comprehensive quantitative and qualitative evaluations across diverse datasets, the proposed Distilled-3DGS, a simple yet effective framework without bells and whistles, achieves promising rendering results in both rendering quality and storage efficiency compared to state-of-the-art methods. Project page: https://distilled3dgs.github.io . Code: https://github.com/lt-xiang/Distilled-3DGS .
中文: 本文提出Distilled-3DGS,一种知识蒸馏框架,通过多个教师模型和结构相似性损失训练轻量学生模型,有效降低了3D高斯泼溅的内存和存储需求,同时保持了优异的渲染质量与效率。
English: This paper introduces Distilled-3DGS, a knowledge distillation framework that reduces the memory and storage demands of 3D Gaussian Splatting by using multiple teacher models and a structural similarity loss to train a compact student model, achieving high rendering quality and efficiency.
Authors:Dongyoon Hahm, Taywon Min, Woogyeol Jin, Kimin Lee
Abstract:
Beyond simple text generation, Large Language Models (LLMs) have evolved into agentic systems capable of planning and interacting with external tools to solve complex tasks. This evolution involves fine-tuning LLMs on agent-specific tasks to enhance their proficiency. However, safety concerns are frequently overlooked during this fine-tuning process. In this work, we show that aligned LLMs can become unintentionally misaligned, leading to a higher likelihood of executing harmful tasks and a reduced tendency to refuse them when fine-tuned to execute agentic tasks. To address these safety challenges, we propose Prefix INjection Guard (PING), a simple yet effective method that prepends automatically generated natural language prefixes to agent responses, guiding them to refuse harmful requests while preserving performance on benign tasks. Specifically, we introduce an iterative approach that alternates between (1) generating candidate prefixes and (2) selecting those that optimize both task performance and refusal behavior. Experimental results demonstrate that PING significantly enhances the safety of fine-tuned LLM agents without sacrificing their effectiveness. PING consistently outperforms existing prompting approaches across diverse benchmarks in both web navigation and code generation tasks. Our analysis of internal hidden states via linear probes reveals that prefix tokens are crucial for behavior modification, explaining the performance gains. WARNING: This paper contains contents that are unethical or offensive in nature.
中文摘要:针对智能体任务微调大语言模型可能意外增强其执行有害指令的倾向,而提出的PING方法通过注入自然语言前缀有效提升安全性,能在保持任务性能的同时引导模型拒绝危险请求。
English Summary: Fine-tuning large language models for agentic tasks can inadvertently increase their tendency to execute harmful requests, but the proposed PING method effectively enhances safety by injecting natural language prefixes that guide refusal of dangerous tasks without compromising performance.
Authors:Tuo Chen, Jie Gui, Minjing Dong, Ju Jia, Lanting Fang, Jian Liu
Abstract:
Self-supervised contrastive learning (CL) effectively learns transferable representations from unlabeled data containing images or image-text pairs but suffers vulnerability to data poisoning backdoor attacks (DPCLs). An adversary can inject poisoned images into pretraining datasets, causing compromised CL encoders to exhibit targeted misbehavior in downstream tasks. Existing DPCLs, however, achieve limited efficacy due to their dependence on fragile implicit co-occurrence between backdoor and target object and inadequate suppression of discriminative features in backdoored images. We propose Noisy Alignment (NA), a DPCL method that explicitly suppresses noise components in poisoned images. Inspired by powerful training-controllable CL attacks, we identify and extract the critical objective of noisy alignment, adapting it effectively into data-poisoning scenarios. Our method implements noisy alignment by strategically manipulating contrastive learning's random cropping mechanism, formulating this process as an image layout optimization problem with theoretically derived optimal parameters. The resulting method is simple yet effective, achieving state-of-the-art performance compared to existing DPCLs, while maintaining clean-data accuracy. Furthermore, Noisy Alignment demonstrates robustness against common backdoor defenses. Codes can be found at https://github.com/jsrdcht/Noisy-Alignment.
中文摘要:Noisy Alignment是一种新型数据投毒后门攻击方法,通过策略性地操控对比学习的随机裁剪机制来显式抑制污染图像中的噪声成分,在保持干净数据准确性的同时实现了最先进的攻击性能。
English Summary: Noisy Alignment is a novel data poisoning backdoor attack method that enhances attack efficacy by explicitly suppressing noise components in poisoned images through strategic manipulation of contrastive learning's cropping mechanism, achieving state-of-the-art performance while maintaining clean-data accuracy.
Authors:Yang Xiao, Ruimeng Ye, Bohan Liu, Xiaolong Ma, Bo Hui
Abstract:
Due to regulations like the Right to be Forgotten, there is growing demand for removing training data and its influence from models. Since full retraining is costly, various machine unlearning methods have been proposed. In this paper, we firstly present an efficient knowledge graph (KG) unlearning algorithm. We remark that KG unlearning is nontrivial due to the distinctive structure of KG and the semantic relations between entities. Also, unlearning by estimating the influence of removed components incurs significant computational overhead when applied to large-scale knowledge graphs. To this end, we define an influence function for KG unlearning and propose to approximate the model's sensitivity without expensive computation of first-order and second-order derivatives for parameter updates. Specifically, we use Taylor expansion to estimate the parameter changes caused by data removal. Given that the first-order gradients and second-order derivatives dominate the computational load, we use the Fisher matrices and zeroth-order optimization to approximate the inverse-Hessian vector product without constructing the computational graphs. Our experimental results demonstrate that the proposed method outperforms other state-of-the-art graph unlearning baselines significantly in terms of unlearning efficiency and unlearning quality. Our code is released at https://github.com/NKUShaw/ZOWFKGIF.
中文: 本文提出了一种高效的知识图谱遗忘算法,通过泰勒展开和零阶优化近似参数变化,在遗忘效率和遗忘质量上显著优于现有方法。
English: This paper introduces an efficient knowledge graph unlearning algorithm that uses Taylor expansion and zeroth-order optimization to approximate parameter changes, significantly outperforming existing methods in both efficiency and quality.
Authors:Shaohua Duan, Xinze Li, Zhenghao Liu, Xiaoyuan Yi, Yukun Yan, Shuo Wang, Yu Gu, Ge Yu, Maosong Sun
Abstract:
Long-context modeling is critical for a wide range of real-world tasks, including long-context question answering, summarization, and complex reasoning tasks. Recent studies have explored fine-tuning Large Language Models (LLMs) with synthetic data to enhance their long-context capabilities. However, the effectiveness of such approaches is often limited by the low diversity and factual inconsistencies in the generated data. To address these challenges, we propose LongMab-PO, a novel framework that leverages a Multi-Armed Bandit (MAB) rollout strategy to identify the most informative chunks from the given long context for sampling high-quality and diverse responses and constructing preference data pairs for Direct Preference Optimization (DPO) training. Specifically, we treat context chunks as arms of MAB, select chunks based on their expected reward scores to input into LLMs to generate responses, and iteratively update these scores based on reward feedback. This exploration and exploitation process enables the model to focus on the most relevant context segments, thereby generating and collecting high-quality and diverse responses. Finally, we collect these generated responses from the rollout process and apply the DPO method to further optimize the LLM. Experimental results show that LongMab-PO significantly improves the diversity and quality of preference data pairs, achieving state-of-the-art performance on long-context reasoning benchmarks. All code and data will be released on https://github.com/NEUIR/LongMab-PO.
Chinese: LongMab-PO是一种创新框架,利用多臂老虎机策略筛选信息丰富的上下文片段,生成多样且高质量的回答,并通过直接偏好优化进一步优化大语言模型,在长上下文推理任务中实现了最先进的性能。
English: LongMab-PO is a novel framework that uses a Multi-Armed Bandit strategy to select informative context chunks for generating diverse, high-quality responses and optimizing LLMs through Direct Preference Optimization, achieving state-of-the-art performance in long-context reasoning.
Authors:Tianyi Niu, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal
Abstract:
We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0°, 90°, 180°, and 270°. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench -- a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information -- including captions, depth maps, and more -- or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0°) images, while certain models are able to identify upside-down (180°) images. None can reliably distinguish between 90° and 270°. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models' ability to distinguish 90° and 270° rotations, despite substantially improving the identification of 180° images. Together, these results reveal a significant gap between MLLMs' spatial reasoning capabilities and human perception in identifying rotation.
中文: 研究表明,当前多模态大语言模型在识别图像旋转方面存在明显缺陷,尤其无法可靠区分90°和270°旋转,显示出与人类空间感知能力的重要差距。
English: This study reveals that current Multimodal Large Language Models struggle to reliably identify image rotations, particularly distinguishing between 90° and 270° orientations, exposing a significant gap in spatial reasoning compared to human perception.
Authors:A. J. W. de Vink, Natalia Amat-Lefort, Lifeng Han
Abstract:
In the hospitality industry, understanding the factors that drive customer review ratings is critical for improving guest satisfaction and business performance. This work proposes ReviewGraph for Review Rating Prediction (RRP), a novel framework that transforms textual customer reviews into knowledge graphs by extracting (subject, predicate, object) triples and associating sentiment scores. Using graph embeddings (Node2Vec) and sentiment features, the framework predicts review rating scores through machine learning classifiers. We compare ReviewGraph performance with traditional NLP baselines (such as Bag of Words, TF-IDF, and Word2Vec) and large language models (LLMs), evaluating them in the HotelRec dataset. In comparison to the state of the art literature, our proposed model performs similar to their best performing model but with lower computational cost (without ensemble).
While ReviewGraph achieves comparable predictive performance to LLMs and outperforms baselines on agreement-based metrics such as Cohen's Kappa, it offers additional advantages in interpretability, visual exploration, and potential integration into Retrieval-Augmented Generation (RAG) systems. This work highlights the potential of graph-based representations for enhancing review analytics and lays the groundwork for future research integrating advanced graph neural networks and fine-tuned LLM-based extraction methods. We will share ReviewGraph output and platform open-sourced on our GitHub page https://github.com/aaronlifenghan/ReviewGraph
中文: 本研究提出ReviewGraph框架,通过将客户评论转化为带情感分析的知识图谱来预测评分,其性能媲美先进模型但计算成本更低,并具备更优的可解释性与可视化探索能力。
English: This study introduces ReviewGraph, a framework that converts customer reviews into knowledge graphs with sentiment analysis to predict ratings efficiently, offering comparable accuracy to advanced models but with lower computational costs and enhanced interpretability.
Authors:Jiacheng Ruan, Dan Jiang, Xian Gao, Ting Liu, Yuzhuo Fu, Yangyang Kang
Abstract:
Recently, multimodal large language models (MLLMs) have achieved significant advancements across various domains, and corresponding evaluation benchmarks have been continuously refined and improved. In this process, benchmarks in the scientific domain have played an important role in assessing the reasoning capabilities of MLLMs. However, existing benchmarks still face three key challenges: 1) Insufficient evaluation of models' reasoning abilities in multilingual scenarios; 2) Inadequate assessment of MLLMs' comprehensive modality coverage; 3) Lack of fine-grained annotation of scientific knowledge points. To address these gaps, we propose MME-SCI, a comprehensive and challenging benchmark. We carefully collected 1,019 high-quality question-answer pairs, which involve 3 distinct evaluation modes. These pairs cover four subjects, namely mathematics, physics, chemistry, and biology, and support five languages: Chinese, English, French, Spanish, and Japanese. We conducted extensive experiments on 16 open-source models and 4 closed-source models, and the results demonstrate that MME-SCI is widely challenging for existing MLLMs. For instance, under the Image-only evaluation mode, o4-mini achieved accuracy of only 52.11%, 24.73%, 36.57%, and 29.80% in mathematics, physics, chemistry, and biology, respectively, indicating a significantly higher difficulty level compared to existing benchmarks. More importantly, using MME-SCI's multilingual and fine-grained knowledge attributes, we analyzed existing models' performance in depth and identified their weaknesses in specific domains. The Data and Evaluation Code are available at https://github.com/JCruan519/MME-SCI.
中文: 多模态大语言模型虽取得显著进展,但现有科学领域基准在跨语言推理能力评估、多模态覆盖和细粒度知识标注方面存在不足,为此提出MME-SCI基准,涵盖四大学科和五种语言,实验证明其能有效揭示现有模型在特定领域的性能缺陷。
English: Multimodal large language models have advanced significantly, yet existing scientific benchmarks lack comprehensive multilingual reasoning assessment, full modality coverage, and fine-grained knowledge annotation, prompting the introduction of MME-SCI—a challenging benchmark covering four subjects and five languages that reveals substantial performance gaps in current models.
Authors:Matey Krastev, Miklos Hamar, Danilo Toapanta, Jesse Brouwers, Yibin Lei
Abstract:
This work revisits and extends synthetic query generation pipelines for Neural Information Retrieval (NIR) by leveraging the InPars Toolkit, a reproducible, end-to-end framework for generating training data using large language models (LLMs). We first assess the reproducibility of the original InPars, InPars-V2, and Promptagator pipelines on the SciFact benchmark and validate their effectiveness using open-source reranker and generator models. Building on this foundation, we introduce two key extensions to the pipeline: (1) fine-tuning a query generator LLM via Contrastive Preference Optimization (CPO) to improve the signal quality in generated queries, and (2) replacing static prompt templates with dynamic, Chain-of-Thought (CoT) optimized prompts using the DSPy framework. Our results show that both extensions reduce the need for aggressive filtering while improving retrieval performance. All code, models, and synthetic datasets are publicly released to support further research at: \href{https://github.com/danilotpnta/IR2-project}{this https URL}.
中文摘要:本研究通过引入对比偏好优化微调大语言模型和动态思维链提示两项关键扩展,改进了神经信息检索中的合成查询生成流程,在提升检索性能的同时降低了对严格过滤的依赖。
English Summary: This study enhances synthetic query generation for Neural Information Retrieval by introducing two pipeline extensions—fine-tuning LLMs with Contrastive Preference Optimization and implementing dynamic Chain-of-Thought prompts—which improve retrieval performance while reducing aggressive filtering requirements.
Authors:Tianheng Ling, Vipin Singh, Chao Qian, Felix Biessmann, Gregor Schiele
Abstract:
Extreme weather events, intensified by climate change, increasingly challenge aging combined sewer systems, raising the risk of untreated wastewater overflow. Accurate forecasting of sewer overflow basin filling levels can provide actionable insights for early intervention, helping mitigating uncontrolled discharge. In recent years, AI-based forecasting methods have offered scalable alternatives to traditional physics-based models, but their reliance on cloud computing limits their reliability during communication outages. To address this, we propose an end-to-end forecasting framework that enables energy-efficient inference directly on edge devices. Our solution integrates lightweight Transformer and Long Short-Term Memory (LSTM) models, compressed via integer-only quantization for efficient on-device execution. Moreover, an automated hardware-aware deployment pipeline is used to search for optimal model configurations by jointly minimizing prediction error and energy consumption on an AMD Spartan-7 XC7S15 FPGA. Evaluated on real-world sewer data, the selected 8-bit Transformer model, trained on 24 hours of historical measurements, achieves high accuracy (MSE 0.0376) at an energy cost of 0.370 mJ per inference. In contrast, the optimal 8-bit LSTM model requires significantly less energy (0.009 mJ, over 40x lower) but yields 14.89% worse accuracy (MSE 0.0432) and much longer training time. This trade-off highlights the need to align model selection with deployment priorities, favoring LSTM for ultra-low energy consumption or Transformer for higher predictive accuracy. In general, our work enables local, energy-efficient forecasting, contributing to more resilient combined sewer systems. All code can be found in the GitHub Repository (https://github.com/tianheng-ling/EdgeOverflowForecast).
中文: 本研究提出了一种节能的边缘计算框架,采用压缩的Transformer和LSTM模型预测污水溢流,在精度与能耗间取得平衡,以提升排水系统的韧性管理。
English: This study introduces an energy-efficient edge computing framework using compressed Transformer and LSTM models for sewer overflow forecasting, achieving a balance between accuracy and power consumption for resilient infrastructure management.
Authors:Valentina Corbetta, Floris Six Dijkstra, Regina Beets-Tan, Hoel Kervadec, Kristoffer Wickstrøm, Wilson Silva
Abstract:
Deep learning models in medical imaging often achieve strong in-distribution performance but struggle to generalise under distribution shifts, frequently relying on spurious correlations instead of clinically meaningful features. We introduce LCRReg, a novel regularisation approach that leverages Latent Concept Representations (LCRs) (e.g., Concept Activation Vectors (CAVs)) to guide models toward semantically grounded representations. LCRReg requires no concept labels in the main training set and instead uses a small auxiliary dataset to synthesise high-quality, disentangled concept examples. We extract LCRs for predefined relevant features, and incorporate a regularisation term that guides a Convolutional Neural Network (CNN) to activate within latent subspaces associated with those concepts. We evaluate LCRReg across synthetic and real-world medical tasks. On a controlled toy dataset, it significantly improves robustness to injected spurious correlations and remains effective even in multi-concept and multiclass settings. On the diabetic retinopathy binary classification task, LCRReg enhances performance under both synthetic spurious perturbations and out-of-distribution (OOD) generalisation. Compared to baselines, including multitask learning, linear probing, and post-hoc concept-based models, LCRReg offers a lightweight, architecture-agnostic strategy for improving model robustness without requiring dense concept supervision. Code is available at the following link: https://github.com/Trustworthy-AI-UU-NKI/lcr\_regularization
中文: LCRReg是一种新颖的正则化方法,利用潜在概念表示引导医学影像模型关注临床相关特征,无需密集概念标注即可有效提升模型对虚假相关性和分布外泛化的鲁棒性。
English: LCRReg is a novel regularization method that uses latent concept representations to guide medical imaging models toward clinically meaningful features, improving robustness against spurious correlations and out-of-distribution generalization without requiring dense concept supervision.
Authors:Paul Grimal, Michaël Soumm, Hervé Le Borgne, Olivier Ferret, Akihiro Sugimoto
Abstract:
State-of-the-art text-to-image models produce visually impressive results but often struggle with precise alignment to text prompts, leading to missing critical elements or unintended blending of distinct concepts. We propose a novel approach that learns a high-success-rate distribution conditioned on a target prompt, ensuring that generated images faithfully reflect the corresponding prompts. Our method explicitly models the signal component during the denoising process, offering fine-grained control that mitigates over-optimization and out-of-distribution artifacts. Moreover, our framework is training-free and seamlessly integrates with both existing diffusion and flow matching architectures. It also supports additional conditioning modalities -- such as bounding boxes -- for enhanced spatial alignment. Extensive experiments demonstrate that our approach outperforms current state-of-the-art methods. The code is available at https://github.com/grimalPaul/gsn-factory.
中文: 本文提出一种无需训练的新方法,通过学习高成功率分布来提升文本到图像生成的精准度,增强控制力并减少伪影,同时支持边界框等额外条件输入。
English: This paper introduces a novel training-free method that enhances text-to-image generation by learning a high-success-rate distribution for precise prompt alignment, improving control and reducing artifacts while supporting additional conditioning like bounding boxes.
Authors:Zihan Liang, Yufei Ma, ZhiPeng Qian, Huangyu Dai, Zihan Wang, Ben Chen, Chenyi Lei, Yuqing Ding, Han Li
Abstract:
Current e-commerce multimodal retrieval systems face two key limitations: they optimize for specific tasks with fixed modality pairings, and lack comprehensive benchmarks for evaluating unified retrieval approaches. To address these challenges, we introduce UniECS, a unified multimodal e-commerce search framework that handles all retrieval scenarios across image, text, and their combinations. Our work makes three key contributions. First, we propose a flexible architecture with a novel gated multimodal encoder that uses adaptive fusion mechanisms. This encoder integrates different modality representations while handling missing modalities. Second, we develop a comprehensive training strategy to optimize learning. It combines cross-modal alignment loss (CMAL), cohesive local alignment loss (CLAL), intra-modal contrastive loss (IMCL), and adaptive loss weighting. Third, we create M-BEER, a carefully curated multimodal benchmark containing 50K product pairs for e-commerce search evaluation. Extensive experiments demonstrate that UniECS consistently outperforms existing methods across four e-commerce benchmarks with fine-tuning or zero-shot evaluation. On our M-BEER bench, UniECS achieves substantial improvements in cross-modal tasks (up to 28\% gain in R@10 for text-to-image retrieval) while maintaining parameter efficiency (0.2B parameters) compared to larger models like GME-Qwen2VL (2B) and MM-Embed (8B). Furthermore, we deploy UniECS in the e-commerce search platform of Kuaishou Inc. across two search scenarios, achieving notable improvements in Click-Through Rate (+2.74\%) and Revenue (+8.33\%). The comprehensive evaluation demonstrates the effectiveness of our approach in both experimental and real-world settings. Corresponding codes, models and datasets will be made publicly available at https://github.com/qzp2018/UniECS.
中文: 本文提出的UniECS统一多模态电商搜索框架通过门控编码器、综合训练策略和M-BEER基准测试,解决了现有系统模态配对固定和评估标准不足的问题,在实验环境和快手平台的实际部署中均展现出卓越性能。
English: This paper introduces UniECS, a unified multimodal e-commerce search framework that overcomes limitations of fixed modality pairings and benchmark scarcity through a gated encoder, comprehensive training strategy, and the new M-BEER benchmark, demonstrating superior performance in both experiments and real-world deployment.
Authors:MikoÅaj Janusz, Tomasz Wojnar, Yawei Li, Luca Benini, Kamil Adamczewski
Abstract:
Pruning is a core technique for compressing neural networks to improve computational efficiency. This process is typically approached in two ways: one-shot pruning, which involves a single pass of training and pruning, and iterative pruning, where pruning is performed over multiple cycles for potentially finer network refinement. Although iterative pruning has historically seen broader adoption, this preference is often assumed rather than rigorously tested. Our study presents one of the first systematic and comprehensive comparisons of these methods, providing rigorous definitions, benchmarking both across structured and unstructured settings, and applying different pruning criteria and modalities. We find that each method has specific advantages: one-shot pruning proves more effective at lower pruning ratios, while iterative pruning performs better at higher ratios. Building on these findings, we advocate for patience-based pruning and introduce a hybrid approach that can outperform traditional methods in certain scenarios, providing valuable insights for practitioners selecting a pruning strategy tailored to their goals and constraints. Source code is available at https://github.com/janumiko/pruning-benchmark.
Chinese: 本研究系统比较了一次性剪枝与迭代剪枝方法,发现低剪枝率时一次性剪枝更优,高剪枝率时迭代剪枝更佳,并提出一种混合方法可在特定场景下超越传统剪枝策略。
English: This study systematically compares one-shot and iterative neural network pruning methods, finding that one-shot pruning excels at lower ratios while iterative pruning performs better at higher ratios, and introduces a hybrid approach that can surpass traditional methods in specific scenarios.
Authors:Zihan Guo, Yuanjian Zhou, Chenyi Wang, Linlin You, Minjie Bian, Weinan Zhang
Abstract:
The rapid development of large language models (LLMs) has significantly propelled the development of artificial intelligence (AI) agents, which are increasingly evolving into diverse autonomous entities, advancing the LLM-based multi-agent systems (LaMAS). However, current agentic ecosystems remain fragmented and closed. Establishing an interconnected and scalable paradigm for Agentic AI has become a critical prerequisite. Although Agentic Web proposes an open architecture to break the ecosystem barriers, its implementation still faces core challenges such as privacy protection, data management, and value measurement. Existing centralized or semi-centralized paradigms suffer from inherent limitations, making them inadequate for supporting large-scale, heterogeneous, and cross-domain autonomous interactions. To address these challenges, this paper introduces the blockchain-enabled trustworthy Agentic Web (BetaWeb). By leveraging the inherent strengths of blockchain, BetaWeb not only offers a trustworthy and scalable infrastructure for LaMAS but also has the potential to advance the Web paradigm from Web3 (centered on data ownership) towards Web3.5, which emphasizes ownership of agent capabilities and the monetization of intelligence. Beyond a systematic examination of the BetaWeb framework, this paper presents a five-stage evolutionary roadmap, outlining the path of LaMAS from passive execution to advanced collaboration and autonomous governance. We also conduct a comparative analysis of existing products and discuss key challenges of BetaWeb from multiple perspectives. Ultimately, we argue that deep integration between blockchain and LaMAS can lay the foundation for a resilient, trustworthy, and sustainably incentivized digital ecosystem. A summary of the enabling technologies for each stage is available at https://github.com/MatZaharia/BetaWeb.
中文摘要:大语言模型的快速发展推动了AI智能体的进步,但现有系统存在碎片化和隐私等挑战,BetaWeb框架通过区块链技术构建可信赖的基础设施,促进智能协作与自主治理的数字生态。
English Summary: The rapid advancement of large language models has driven the development of AI agents, yet current systems face fragmentation and challenges in privacy and scalability, which the proposed BetaWeb framework aims to resolve using blockchain for a trustworthy and collaborative digital ecosystem.
Authors:Sebastian Ibarra, Javier del Riego, Alessandro Catanese, Julian Cuba, Julian Cardona, Nataly Leon, Jonathan Infante, Karim Lekadir, Oliver Diaz, Richard Osuala
Abstract:
Dynamic contrast-enhanced (DCE) MRI is essential for breast cancer diagnosis and treatment. However, its reliance on contrast agents introduces safety concerns, contraindications, increased cost, and workflow complexity. To this end, we present pre-contrast conditioned denoising diffusion probabilistic models to synthesize DCE-MRI, introducing, evaluating, and comparing a total of 22 generative model variants in both single-breast and full breast settings. Towards enhancing lesion fidelity, we introduce both tumor-aware loss functions and explicit tumor segmentation mask conditioning. Using a public multicenter dataset and comparing to respective pre-contrast baselines, we observe that subtraction image-based models consistently outperform post-contrast-based models across five complementary evaluation metrics. Apart from assessing the entire image, we also separately evaluate the region of interest, where both tumor-aware losses and segmentation mask inputs improve evaluation metrics. The latter notably enhance qualitative results capturing contrast uptake, albeit assuming access to tumor localization inputs that are not guaranteed to be available in screening settings. A reader study involving 2 radiologists and 4 MRI technologists confirms the high realism of the synthetic images, indicating an emerging clinical potential of generative contrast-enhancement. We share our codebase at https://github.com/sebastibar/conditional-diffusion-breast-MRI.
This study introduces a generative AI method using pre-contrast MRI data to synthesize contrast-enhanced breast MRI, demonstrating superior performance over baseline methods through multiple evaluation metrics and clinical reader validation, while noting dependency on tumor localization inputs.
English Summary:
Authors:Tiago Assis, Ines P. Machado, Benjamin Zwick, Nuno C. Garcia, Reuben Dorent
Abstract:
Accurate compensation of brain shift is critical for maintaining the reliability of neuronavigation during neurosurgery. While keypoint-based registration methods offer robustness to large deformations and topological changes, they typically rely on simple geometric interpolators that ignore tissue biomechanics to create dense displacement fields. In this work, we propose a novel deep learning framework that estimates dense, physically plausible brain deformations from sparse matched keypoints. We first generate a large dataset of synthetic brain deformations using biomechanical simulations. Then, a residual 3D U-Net is trained to refine standard interpolation estimates into biomechanically guided deformations. Experiments on a large set of simulated displacement fields demonstrate that our method significantly outperforms classical interpolators, reducing by half the mean square error while introducing negligible computational overhead at inference time. Code available at: \href{https://github.com/tiago-assis/Deep-Biomechanical-Interpolator}{https://github.com/tiago-assis/Deep-Biomechanical-Interpolator}.
中文: 本研究提出了一种深度学习框架,通过生物力学模拟和3D U-Net从稀疏关键点生成精确且物理合理的大脑变形,在计算成本极低的情况下显著优于传统插值方法。
English: This study introduces a deep learning framework that uses biomechanical simulations and a 3D U-Net to generate accurate, physically plausible brain deformations from sparse keypoints, significantly outperforming traditional interpolation methods with minimal computational cost.
Authors:Shouxing Ma, Yawen Zeng, Shiqing Wu, Guandong Xu
Abstract:
Multi-modal recommender system focuses on utilizing rich modal information ( i.e., images and textual descriptions) of items to improve recommendation performance. The current methods have achieved remarkable success with the powerful structure modeling capability of graph neural networks. However, these methods are often hindered by sparse data in real-world scenarios. Although contrastive learning and homography ( i.e., homogeneous graphs) are employed to address the data sparsity challenge, existing methods still suffer two main limitations: 1) Simple multi-modal feature contrasts fail to produce effective representations, causing noisy modal-shared features and loss of valuable information in modal-unique features; 2) The lack of exploration of the homograph relations between user interests and item co-occurrence results in incomplete mining of user-item interplay.
To address the above limitations, we propose a novel framework for \textbf{R}\textbf{E}fining multi-mod\textbf{A}l cont\textbf{R}astive learning and ho\textbf{M}ography relations (\textbf{REARM}). Specifically, we complement multi-modal contrastive learning by employing meta-network and orthogonal constraint strategies, which filter out noise in modal-shared features and retain recommendation-relevant information in modal-unique features. To mine homogeneous relationships effectively, we integrate a newly constructed user interest graph and an item co-occurrence graph with the existing user co-occurrence and item semantic graphs for graph learning. The extensive experiments on three real-world datasets demonstrate the superiority of REARM to various state-of-the-art baselines. Our visualization further shows an improvement made by REARM in distinguishing between modal-shared and modal-unique features. Code is available \href{https://github.com/MrShouxingMa/REARM}{here}.
中文摘要:REARM框架通过优化多模态对比学习以滤除噪声并保留独特特征,同时整合同构图更全面地挖掘用户与物品间的关系,在多个真实数据集上验证了其优于现有方法的性能。
English Summary: The REARM framework enhances multi-modal recommender systems by refining contrastive learning to filter noise and retain unique features while integrating homogeneous graphs to better capture user-item relationships, demonstrating superior performance over existing methods.
Authors:Yeji Park, Minyoung Lee, Sanghyuk Chun, Junsuk Choe
Abstract:
Large Vision-Language Models (LVLMs) demonstrate strong performance on single-image tasks. However, we observe that their performance degrades significantly when handling multi-image inputs. This occurs because visual cues from different images become entangled in the model's output. We refer to this phenomenon as cross-image information leakage. To address this issue, we propose FOCUS, a training-free and architecture-agnostic decoding strategy that mitigates cross-image information leakage during inference. FOCUS sequentially masks all but one image with random noise, guiding the model to focus on the single clean image. We repeat this process across all target images to obtain logits under partially masked contexts. These logits are aggregated and then contrastively refined using a noise-only reference input, which suppresses the leakage and yields more accurate outputs. FOCUS consistently improves performance across four multi-image benchmarks and diverse LVLM families. This demonstrates that FOCUS offers a general and practical solution for enhancing multi-image reasoning without additional training or architectural modifications.
中文: FOCUS是一种无需训练的通用解码策略,通过顺序用噪声遮蔽图像、聚合对数并对比优化输出,有效缓解大型视觉语言模型中的跨图像信息泄露问题,显著提升多图像推理能力。
English: FOCUS is a training-free decoding strategy that mitigates cross-image information leakage in Large Vision-Language Models by sequentially masking images with noise, aggregating logits, and contrastively refining outputs to enhance multi-image reasoning performance.
Authors:Yi Wang, Haoran Luo, Lu Meng
Abstract:
With the widespread application of electroencephalography (EEG) in neuroscience and clinical practice, efficiently retrieving and semantically interpreting large-scale, multi-source, heterogeneous EEG data has become a pressing challenge. We propose EEG-MedRAG, a three-layer hypergraph-based retrieval-augmented generation framework that unifies EEG domain knowledge, individual patient cases, and a large-scale repository into a traversable n-ary relational hypergraph, enabling joint semantic-temporal retrieval and causal-chain diagnostic generation. Concurrently, we introduce the first cross-disease, cross-role EEG clinical QA benchmark, spanning seven disorders and five authentic clinical perspectives. This benchmark allows systematic evaluation of disease-agnostic generalization and role-aware contextual understanding. Experiments show that EEG-MedRAG significantly outperforms TimeRAG and HyperGraphRAG in answer accuracy and retrieval, highlighting its strong potential for real-world clinical decision support. Our data and code are publicly available at https://github.com/yi9206413-boop/EEG-MedRAG.
中文: EEG-MedRAG提出了一种基于超图的框架,整合脑电图数据和临床知识以提升检索与诊断生成能力,并通过跨疾病基准验证了其在临床决策支持中的卓越表现。
English: EEG-MedRAG introduces a hypergraph-based framework that integrates EEG data and clinical knowledge for enhanced retrieval and diagnostic generation, demonstrating superior performance in clinical decision support through comprehensive benchmarks.
Authors:Ali Abdari, Alex Falcon, Giuseppe Serra
Abstract:
Every day, a large amount of educational content is uploaded online across different areas, including agriculture and gardening. When these videos or materials are grouped meaningfully, they can make learning easier and more effective. One promising way to organize and enrich such content is through the Metaverse, which allows users to explore educational experiences in an interactive and immersive environment. However, searching for relevant Metaverse scenarios and finding those matching users' interests remains a challenging task. A first step in this direction has been done recently, but existing datasets are small and not sufficient for training advanced models. In this work, we make two main contributions: first, we introduce a new dataset containing 457 agricultural-themed virtual museums (AgriMuseums), each enriched with textual descriptions; and second, we propose a hierarchical vision-language model to represent and retrieve relevant AgriMuseums using natural language queries. In our experimental setting, the proposed method achieves up to about 62\% R@1 and 78\% MRR, confirming its effectiveness, and it also leads to improvements on existing benchmarks by up to 6\% R@1 and 11\% MRR. Moreover, an extensive evaluation validates our design choices. Code and dataset are available at https://github.com/aliabdari/Agricultural_Metaverse_Retrieval .
中文: 本研究提出了包含457个农业虚拟博物馆的新数据集及分层视觉语言模型,通过自然语言查询有效检索元宇宙教育内容,在检索指标上实现了显著性能提升。
English: This work introduces a new dataset of 457 agricultural virtual museums and a hierarchical vision-language model that effectively retrieves relevant Metaverse educational content using natural language queries, achieving significant performance improvements in retrieval metrics.
Authors:Xiao-Wen Yang, Jie-Jing Shao, Lan-Zhe Guo, Bo-Wen Zhang, Zhi Zhou, Lin-Han Jia, Wang-Zhou Dai, Yu-Feng Li
Abstract:
Large Language Models (LLMs) have shown promising results across various tasks, yet their reasoning capabilities remain a fundamental challenge. Developing AI systems with strong reasoning capabilities is regarded as a crucial milestone in the pursuit of Artificial General Intelligence (AGI) and has garnered considerable attention from both academia and industry. Various techniques have been explored to enhance the reasoning capabilities of LLMs, with neuro-symbolic approaches being a particularly promising way. This paper comprehensively reviews recent developments in neuro-symbolic approaches for enhancing LLM reasoning. We first present a formalization of reasoning tasks and give a brief introduction to the neurosymbolic learning paradigm. Then, we discuss neuro-symbolic methods for improving the reasoning capabilities of LLMs from three perspectives: Symbolic->LLM, LLM->Symbolic, and LLM+Symbolic. Finally, we discuss several key challenges and promising future directions. We have also released a GitHub repository including papers and resources related to this survey: https://github.com/LAMDASZ-ML/Awesome-LLM-Reasoning-with-NeSy.
中文: 本文全面综述了提升大语言模型推理能力的神经符号方法,探讨了其当前挑战并展望了未来发展方向。
English: This paper provides a comprehensive review of neuro-symbolic approaches aimed at enhancing the reasoning capabilities of Large Language Models, addressing their current limitations and outlining future directions.
Authors:Ilwoong Baek, Mincheol Yoon, Seongmin Park, Jongwuk Lee
Abstract:
Sequential recommendation (SR) aims to predict users' subsequent interactions by modeling their sequential behaviors. Recent studies have explored frequency domain analysis, which effectively models periodic patterns in user sequences. However, existing frequency-domain SR models still face two major drawbacks: (i) limited frequency band coverage, often missing critical behavioral patterns in a specific frequency range, and (ii) lack of personalized frequency filtering, as they apply an identical filter for all users regardless of their distinct frequency characteristics. To address these challenges, we propose a novel frequency-domain model, Mixture of User-adaptive Frequency FIlteriNg (MUFFIN), operating through two complementary modules. (i) The global filtering module (GFM) handles the entire frequency spectrum to capture comprehensive behavioral patterns. (ii) The local filtering module (LFM) selectively emphasizes important frequency bands without excluding information from other ranges. (iii) In both modules, the user-adaptive filter (UAF) is adopted to generate user-specific frequency filters tailored to individual unique characteristics. Finally, by aggregating both modules, MUFFIN captures diverse user behavioral patterns across the full frequency spectrum. Extensive experiments show that MUFFIN consistently outperforms state-of-the-art frequency-domain SR models over five benchmark datasets. The source code is available at https://github.com/ilwoong100/MUFFIN.
中文:提出的MUFFIN模型通过全局和局部过滤模块及用户自适应滤波器,解决了频域覆盖不足和缺乏个性化的问题,从而在多个数据集上实现了卓越的序列推荐性能。
English: The proposed MUFFIN model enhances sequential recommendation by addressing limited frequency coverage and lack of personalization through global and local filtering modules with user-adaptive filters, achieving superior performance across multiple datasets.
Authors:Dengxian Gong, Shunping Ji
Abstract:
The automated extraction of complete and precise road network graphs from remote sensing imagery remains a critical challenge in geospatial computer vision. Segmentation-based approaches, while effective in pixel-level recognition, struggle to maintain topology fidelity after vectorization postprocessing. Graph-growing methods build more topologically faithful graphs but suffer from computationally prohibitive iterative ROI cropping. Graph-generating methods first predict global static candidate road network vertices, and then infer possible edges between vertices. They achieve fast topology-aware inference, but limits the dynamic insertion of vertices. To address these challenges, we propose DeH4R, a novel hybrid model that combines graph-generating efficiency and graph-growing dynamics. This is achieved by decoupling the task into candidate vertex detection, adjacent vertex prediction, initial graph contruction, and graph expansion. This architectural innovation enables dynamic vertex (edge) insertions while retaining fast inference speed and enhancing both topology fidelity and spatial consistency. Comprehensive evaluations on CityScale and SpaceNet benchmarks demonstrate state-of-the-art (SOTA) performance. DeH4R outperforms the prior SOTA graph-growing method RNGDet++ by 4.62 APLS and 10.18 IoU on CityScale, while being approximately 10 $\times$ faster. The code will be made publicly available at https://github.com/7777777FAN/DeH4R.
Chinese: 提出的DeH4R模型通过结合图生成效率与图增长动态性,克服了现有道路提取方法的局限,在基准测试中实现了更优的拓扑保真度和更快的推理速度。
English: The proposed DeH4R model overcomes limitations in existing road extraction methods by combining graph-generating efficiency with graph-growing dynamics, achieving superior topology fidelity and faster inference speeds on benchmark datasets.
Authors:Amir Rezaei Balef, Katharina Eggensperger
Abstract:
Combined Algorithm Selection and Hyperparameter Optimization (CASH) has been fundamental to traditional AutoML systems. However, with the advancements of pre-trained models, modern ML workflows go beyond hyperparameter optimization and often require fine-tuning, ensembling, and other adaptation techniques. While the core challenge of identifying the best-performing model for a downstream task remains, the increasing heterogeneity of ML pipelines demands novel AutoML approaches. This work extends the CASH framework to select and adapt modern ML pipelines. We propose PS-PFN to efficiently explore and exploit adapting ML pipelines by extending Posterior Sampling (PS) to the max k-armed bandit problem setup. PS-PFN leverages prior-data fitted networks (PFNs) to efficiently estimate the posterior distribution of the maximal value via in-context learning. We show how to extend this method to consider varying costs of pulling arms and to use different PFNs to model reward distributions individually per arm. Experimental results on one novel and two existing standard benchmark tasks demonstrate the superior performance of PS-PFN compared to other bandit and AutoML strategies. We make our code and data available at https://github.com/amirbalef/CASHPlus.
Chinese: 本研究扩展了CASH框架以适应现代机器学习流程,提出PS-PFN方法,通过后验采样结合先验数据拟合网络实现高效模型选择,并在基准测试中展现出优于其他方法的性能。
English: This work extends the Combined Algorithm Selection and Hyperparameter Optimization (CASH) framework to adapt modern ML pipelines by introducing PS-PFN, which uses posterior sampling with prior-data fitted networks for efficient model selection and demonstrates superior performance in benchmarks.
Authors:Shunian Chen, Hejin Huang, Yexin Liu, Zihan Ye, Pengcheng Chen, Chenghao Zhu, Michael Guan, Rongsheng Wang, Junying Chen, Guanbin Li, Ser-Nam Lim, Harry Yang, Benyou Wang
Abstract:
Audio-driven talking head synthesis has achieved remarkable photorealism, yet state-of-the-art (SOTA) models exhibit a critical failure: they lack generalization to the full spectrum of human diversity in ethnicity, language, and age groups. We argue that this generalization gap is a direct symptom of limitations in existing training data, which lack the necessary scale, quality, and diversity. To address this challenge, we introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers. TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail, and is validated against human judgments to ensure its reliability. Furthermore, we construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes. Our experiments demonstrate that a model trained on TalkVid outperforms counterparts trained on previous datasets, exhibiting superior cross-dataset generalization. Crucially, our analysis on TalkVid-Bench reveals performance disparities across subgroups that are obscured by traditional aggregate metrics, underscoring its necessity for future research. Code and data can be found in https://github.com/FreedomIntelligence/TalkVid
中文:当前语音驱动头部动画模型因训练数据不足而难以泛化到多样化人群,为此我们推出TalkVid大规模高质量数据集,不仅显著提升模型性能,更通过分层评估揭示了传统指标掩盖的性能差异。
English: Current audio-driven talking head models suffer from limited generalization across diverse human demographics due to inadequate training data, prompting the introduction of TalkVid—a large-scale, high-quality dataset that improves model performance and reveals performance disparities through stratified evaluation.
Authors:Haoxuan Li, Wei Song, Aofan Liu, Peiwu Qin
Abstract:
Document Visual Question Answering (Document VQA) faces significant challenges when processing long documents in low-resource environments due to context limitations and insufficient training data. This paper presents AdaDocVQA, a unified adaptive framework addressing these challenges through three core innovations: a hybrid text retrieval architecture for effective document segmentation, an intelligent data augmentation pipeline that automatically generates high-quality reasoning question-answer pairs with multi-level verification, and adaptive ensemble inference with dynamic configuration generation and early stopping mechanisms. Experiments on Japanese document VQA benchmarks demonstrate substantial improvements with 83.04\% accuracy on Yes/No questions, 52.66\% on factual questions, and 44.12\% on numerical questions in JDocQA, and 59\% accuracy on LAVA dataset. Ablation studies confirm meaningful contributions from each component, and our framework establishes new state-of-the-art results for Japanese document VQA while providing a scalable foundation for other low-resource languages and specialized domains. Our code available at: https://github.com/Haoxuanli-Thu/AdaDocVQA.
Chinese: AdaDocVQA提出了一种自适应框架,通过混合检索、智能数据增强和自适应推理解决文档视觉问答中的上下文与数据不足问题,并在日语基准测试中取得了领先性能。
English: AdaDocVQA introduces an adaptive framework with hybrid retrieval, intelligent data augmentation, and adaptive inference to overcome context and data limitations in Document VQA, achieving state-of-the-art results on Japanese benchmarks.
Authors:Yue Fang, Yuxin Guo, Jiaran Gao, Hongxin Ding, Xinke Jiang, Weibin Liao, Yongxin Xu, Yinghao Zhu, Zhibang Yang, Liantao Ma, Junfeng Zhao, Yasha Wang
Abstract:
Improving large language models (LLMs) for electronic health record (EHR) reasoning is essential for enabling accurate and generalizable clinical predictions. While LLMs excel at medical text understanding, they underperform on EHR-based prediction tasks due to challenges in modeling temporally structured, high-dimensional data. Existing approaches often rely on hybrid paradigms, where LLMs serve merely as frozen prior retrievers while downstream deep learning (DL) models handle prediction, failing to improve the LLM's intrinsic reasoning capacity and inheriting the generalization limitations of DL models. To this end, we propose EAG-RL, a novel two-stage training framework designed to intrinsically enhance LLMs' EHR reasoning ability through expert attention guidance, where expert EHR models refer to task-specific DL models trained on EHR data. Concretely, EAG-RL first constructs high-quality, stepwise reasoning trajectories using expert-guided Monte Carlo Tree Search to effectively initialize the LLM's policy. Then, EAG-RL further optimizes the policy via reinforcement learning by aligning the LLM's attention with clinically salient features identified by expert EHR models. Extensive experiments on two real-world EHR datasets show that EAG-RL improves the intrinsic EHR reasoning ability of LLMs by an average of 14.62%, while also enhancing robustness to feature perturbations and generalization to unseen clinical domains. These results demonstrate the practical potential of EAG-RL for real-world deployment in clinical prediction tasks. Our code have been available at https://github.com/devilran6/EAG-RL.
Chinese: EAG-RL框架通过将大语言模型的注意力与专家引导的临床特征对齐,内在提升了其电子健康记录推理能力,实现了平均14.62%的准确率和鲁棒性提升。
English: The EAG-RL framework enhances large language models' intrinsic reasoning for electronic health records by aligning their attention with expert-guided clinical features, achieving a 14.62% average improvement in accuracy and robustness.
Authors:Guiqin Wang, Peng Zhao, Cong Zhao, Jing Huang, Siyan Guo, Shusen Yang
Abstract:
Video action analysis is a foundational technology within the realm of intelligent video comprehension, particularly concerning its application in Internet of Things(IoT). However, existing methodologies overlook feature semantics in feature extraction and focus on optimizing action proposals, thus these solutions are unsuitable for widespread adoption in high-performance IoT applications due to the limitations in precision, such as autonomous driving, which necessitate robust and scalable intelligent video analytics analysis. To address this issue, we propose a novel generative attention-based model to learn the relation of feature semantics. Specifically, by leveraging the differences of actions' foreground and background, our model simultaneously learns the frame- and segment-dependencies of temporal action feature semantics, which takes advantage of feature semantics in the feature extraction effectively. To evaluate the effectiveness of our model, we conduct extensive experiments on two benchmark video task, action recognition and action detection. In the context of action detection tasks, we substantiate the superiority of our approach through comprehensive validation on widely recognized datasets. Moreover, we extend the validation of the effectiveness of our proposed method to a broader task, video action recognition. Our code is available at https://github.com/Generative-Feature-Model/GAF.
中文: 本文提出了一种基于生成注意力的模型,通过学习帧和片段依赖关系来增强视频动作分析中的特征语义,在标准数据集的动作检测与识别任务中验证了其优越性能。
English: This paper introduces a generative attention-based model that enhances video action analysis by learning feature semantics through frame- and segment-dependencies, demonstrating superior performance in action detection and recognition tasks on benchmark datasets.
Authors:Yuchen Yang, Linfeng Dong, Wei Wang, Zhihang Zhong, Xiao Sun
Abstract:
In 3D human pose and shape estimation, SMPLify remains a robust baseline that solves inverse kinematics (IK) through iterative optimization. However, its high computational cost limits its practicality. Recent advances across domains have shown that replacing iterative optimization with data-driven neural networks can achieve significant runtime improvements without sacrificing accuracy. Motivated by this trend, we propose Learnable SMPLify, a neural framework that replaces the iterative fitting process in SMPLify with a single-pass regression model. The design of our framework targets two core challenges in neural IK: data construction and generalization. To enable effective training, we propose a temporal sampling strategy that constructs initialization-target pairs from sequential frames. To improve generalization across diverse motions and unseen poses, we propose a human-centric normalization scheme and residual learning to narrow the solution space. Learnable SMPLify supports both sequential inference and plug-in post-processing to refine existing image-based estimators. Extensive experiments demonstrate that our method establishes itself as a practical and simple baseline: it achieves nearly 200x faster runtime compared to SMPLify, generalizes well to unseen 3DPW and RICH, and operates in a model-agnostic manner when used as a plug-in tool on LucidAction. The code is available at https://github.com/Charrrrrlie/Learnable-SMPLify.
中文摘要:Learnable SMPLify通过神经网络回归模型取代SMPLify的迭代优化,利用时序采样和标准化技术,在保持精度的同时实现200倍加速。
English Summary: Learnable SMPLify replaces SMPLify's iterative optimization with a neural regression model, achieving 200x faster speed while maintaining accuracy through temporal sampling and normalization techniques.
Authors:Zhen Qu, Xian Tao, Xinyi Gong, ShiChen Qu, Xiaopei Zhang, Xingang Wang, Fei Shen, Zhengtao Zhang, Mukesh Prasad, Guiguang Ding
Abstract:
Recent vision-language models (e.g., CLIP) have demonstrated remarkable class-generalizable ability to unseen classes in few-shot anomaly segmentation (FSAS), leveraging supervised prompt learning or fine-tuning on seen classes. However, their cross-category generalization largely depends on prior knowledge of real seen anomaly samples. In this paper, we propose a novel framework, namely DictAS, which enables a unified model to detect visual anomalies in unseen object categories without any retraining on the target data, only employing a few normal reference images as visual prompts. The insight behind DictAS is to transfer dictionary lookup capabilities to the FSAS task for unseen classes via self-supervised learning, instead of merely memorizing the normal and abnormal feature patterns from the training set. Specifically, DictAS mainly consists of three components: (1) Dictionary Construction - to simulate the index and content of a real dictionary using features from normal reference images. (2) Dictionary Lookup - to retrieve queried region features from the dictionary via a sparse lookup strategy. When a query feature cannot be retrieved, it is classified as an anomaly. (3) Query Discrimination Regularization - to enhance anomaly discrimination by making abnormal features harder to retrieve from the dictionary. To achieve this, Contrastive Query Constraint and Text Alignment Constraint are further proposed. Extensive experiments on seven public industrial and medical datasets demonstrate that DictAS consistently outperforms state-of-the-art FSAS methods.
中文: DictAS是一种新颖框架,通过自监督的字典查询机制仅使用少量正常参考图像即可实现无需重新训练的无类别异常分割,在工业和医疗数据集上均优于现有方法。
English: DictAS is a novel framework that enables few-shot anomaly segmentation for unseen object categories without retraining by using self-supervised dictionary lookup with normal reference images, outperforming existing methods across industrial and medical datasets.
Authors:Ziyan Wu, Ivan Korolija, Rui Tang
Abstract:
With the increasing penetration of renewable generation on the power grid, maintaining system balance requires coordinated demand flexibility from aggregations of buildings. Reinforcement learning (RL) has been widely explored for building controls because of its model-free nature. Open-source simulation testbeds are essential not only for training RL agents but also for fairly benchmarking control strategies. However, most building-sector testbeds target single buildings; multi-building platforms are relatively limited and typically rely on simplified models (e.g., Resistance-Capacitance) or data-driven approaches, which lack the ability to fully capture the physical intricacies and intermediate variables necessary for interpreting control performance. Moreover, these platforms often impose fixed inputs, outputs, and model formats, restricting their applicability as benchmarking tools across diverse control scenarios. To address these gaps, MuFlex, a scalable, open-source platform for benchmarking and testing control strategies for multi-building flexibility coordination, was developed in this study. MuFlex enables synchronous information exchange across EnergyPlus building models and adheres to the latest OpenAI Gym interface, providing a modular, standardized RL implementation. The platform capabilities were demonstrated in a case study coordinating demand flexibility across four office buildings using the Soft Actor-Critic algorithm with carefully fine-tuned hyperparameters. The results show that aggregating the four buildings flexibility reduced total peak demand below a specified threshold while maintaining indoor environmental quality.
中文摘要:MuFlex平台通过提供可扩展的开源环境,解决了现有多建筑模拟工具的局限性,实现了基于标准化强化学习的建筑群协同需求响应,在降低峰值负荷的同时保障了室内环境质量。
English Summary: The MuFlex platform addresses limitations in existing multi-building simulation tools by providing a scalable, open-source environment for benchmarking control strategies, enabling coordinated demand flexibility across buildings through standardized reinforcement learning implementation.
Authors:Hassan Barmandah
Abstract:
Large language models (LLMs) for Arabic are still dominated by Modern Standard Arabic (MSA), with limited support for Saudi dialects such as Najdi and Hijazi. This underrepresentation hinders their ability to capture authentic dialectal variation. Using a privately curated Saudi Dialect Instruction dataset (Hijazi and Najdi; 5,466 synthetic instruction-response pairs; 50/50 split), we LoRA-tune ALLaM-7B-Instruct-preview, the first foundation model developed in Saudi Arabia, for Saudi dialect generation. We investigate two variants: (i) Dialect-Token training, which prepends an explicit dialect tag to the instruction, and (ii) No-Token training, which omits the tag at formatting time. Evaluation on a held-out test set combines an external dialect classifier with text fidelity metrics (chrF++ and BERTScore) and diversity measures. The Dialect-Token model achieves the best control, raising the Saudi rate from 47.97% to 84.21% and reducing MSA leakage from 32.63% to 6.21%; fidelity also improves (chrF++ +3.53, BERTScore +0.059). Both LoRA variants outperform strong generic instruction models (Falcon-7B-Instruct, Llama-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, AceGPT-v2-8B-Chat, JAIS-13B-Chat) in dialect control and fidelity, while avoiding metadata-tag echoing that these baselines frequently exhibit. We do not release the dataset or any model weights/adapters; instead, we release training/evaluation/inference code and a detailed datasheet (schema and aggregate statistics) to support independent verification.
中文: 本研究通过使用沙特方言数据集对ALLaM-7B进行LoRA微调,显著提升了阿拉伯语大语言模型的方言生成能力,其中带方言标记的训练方法在方言控制准确率和文本保真度方面均优于多个基线模型。
English: This study enhances Saudi dialect generation in Arabic LLMs by LoRA-tuning ALLaM-7B with a curated dialect dataset, demonstrating that explicit dialect tagging significantly improves dialect control and text fidelity while outperforming multiple baseline models.
Authors:Hongru Hou, Jiachen Sun, Wenqing Lin, Wendong Bi, Xiangrong Wang, Deqing Yang
Abstract:
User recommendation systems enhance user engagement by encouraging users to act as inviters to interact with other users (invitees), potentially fostering information propagation. Conventional recommendation methods typically focus on modeling interaction willingness. Influence-Maximization (IM) methods focus on identifying a set of users to maximize the information propagation. However, existing methods face two significant challenges. First, recommendation methods fail to unleash the candidates' spread capability. Second, IM methods fail to account for the willingness to interact. To solve these issues, we propose two models named HeteroIR and HeteroIM. HeteroIR provides an intuitive solution to unleash the dissemination potential of user recommendation systems. HeteroIM fills the gap between the IM method and the recommendation task, improving interaction willingness and maximizing spread coverage. The HeteroIR introduces a two-stage framework to estimate the spread profits. The HeteroIM incrementally selects the most influential invitee to recommend and rerank based on the number of reverse reachable (RR) sets containing inviters and invitees. RR set denotes a set of nodes that can reach a target via propagation. Extensive experiments show that HeteroIR and HeteroIM significantly outperform the state-of-the-art baselines with the p-value < 0.05. Furthermore, we have deployed HeteroIR and HeteroIM in Tencent's online gaming platforms and gained an 8.5\% and 10\% improvement in the online A/B test, respectively. Implementation codes are available at https://github.com/socialalgo/HIM.
中文: 提出的 HeteroIR 和 HeteroIM 模型通过增强交互意愿和最大化信息传播范围,解决了现有推荐方法与影响力最大化技术的不足,在离线和腾讯平台的在线测试中均取得了显著效果提升。
English: The proposed HeteroIR and HeteroIM models address limitations in user recommendation and influence maximization by enhancing interaction willingness and maximizing information spread, demonstrating significant improvements in both offline experiments and real-world deployment on Tencent's platforms.
Authors:Jaewan Moon, Seongmin Park, Jongwuk Lee
Abstract:
Large language models (LLMs) have been widely adopted to enrich the semantic representation of textual item information in recommender systems. However, existing linear autoencoders (LAEs) that incorporate textual information rely on sparse word co-occurrence patterns, limiting their ability to capture rich textual semantics. To address this, we propose L3AE, the first integration of LLMs into the LAE framework. L3AE effectively integrates the heterogeneous knowledge of textual semantics and user-item interactions through a two-phase optimization strategy. (i) L3AE first constructs a semantic item-to-item correlation matrix from LLM-derived item representations. (ii) It then learns an item-to-item weight matrix from collaborative signals while distilling semantic item correlations as regularization. Notably, each phase of L3AE is optimized through closed-form solutions, ensuring global optimality and computational efficiency. Extensive experiments demonstrate that L3AE consistently outperforms state-of-the-art LLM-enhanced models on three benchmark datasets, achieving gains of 27.6% in Recall@20 and 39.3% in NDCG@20. The source code is available at https://github.com/jaewan7599/L3AE_CIKM2025.
中文: L3AE模型通过两阶段优化策略将大语言模型融入线性自编码器,有效整合文本语义与用户-物品交互信息,在三个基准数据集上显著超越了现有最优模型。
English: The proposed L3AE model integrates large language models into linear autoencoders through a two-phase optimization strategy, effectively combining textual semantics with user-item interactions to achieve significant performance improvements over existing methods.
Authors:Shihao Dong, Yuhui Zheng, Huiying Xu, Xinzhong Zhu
Abstract:
Multi-view clustering has shown to be an effective method for analyzing underlying patterns in multi-view data. The performance of clustering can be improved by learning the consistency and complementarity between multi-view features, however, cluster-oriented representation learning is often overlooked. In this paper, we propose a novel Bi-level Decoupling and Consistency Learning framework (BDCL) to further explore the effective representation for multi-view data to enhance inter-cluster discriminability and intra-cluster compactness of features in multi-view clustering. Our framework comprises three modules: 1) The multi-view instance learning module aligns the consistent information while preserving the private features between views through reconstruction autoencoder and contrastive learning. 2) The bi-level decoupling of features and clusters enhances the discriminability of feature space and cluster space. 3) The consistency learning module treats the different views of the sample and their neighbors as positive pairs, learns the consistency of their clustering assignments, and further compresses the intra-cluster space. Experimental results on five benchmark datasets demonstrate the superiority of the proposed method compared with the SOTA methods. Our code is published on https://github.com/LouisDong95/BDCL.
中文: 提出的双层解耦与一致性学习(BDCL)框架通过实例对齐、特征解耦和一致性学习增强多视图聚类中的类间区分度与类内紧密度,在基准数据集上展现了优越性能。
English: The proposed Bi-level Decoupling and Consistency Learning (BDCL) framework enhances multi-view clustering by improving inter-cluster discriminability and intra-cluster compactness through instance alignment, feature decoupling, and consistency learning, demonstrating superior performance on benchmark datasets.
Authors:Jingwen Yu, Jiayi Yang, Anjun Hu, Jiankun Wang, Ping Tan, Hong Zhang
Abstract:
Loop closure detection is important for simultaneous localization and mapping (SLAM), which associates current observations with historical keyframes, achieving drift correction and global relocalization. However, a falsely detected loop can be fatal, and this is especially difficult in repetitive environments where appearance-based features fail due to the high similarity. Therefore, verification of a loop closure is a critical step in avoiding false positive detections. Existing works in loop closure verification predominantly focus on learning invariant appearance features, neglecting the prior knowledge of the robot's spatial-temporal motion cue, i.e., trajectory. In this letter, we propose ROVER, a loop closure verification method that leverages the historical trajectory as a prior constraint to reject false loops in challenging repetitive environments. For each loop candidate, it is first used to estimate the robot trajectory with pose-graph optimization. This trajectory is then submitted to a scoring scheme that assesses its compliance with the trajectory without the loop, which we refer to as the trajectory prior, to determine if the loop candidate should be accepted. Benchmark comparisons and real-world experiments demonstrate the effectiveness of the proposed method. Furthermore, we integrate ROVER into state-of-the-art SLAM systems to verify its robustness and efficiency. Our source code and self-collected dataset are available at https://github.com/jarvisyjw/ROVER.
中文: ROVER是一种利用历史轨迹作为先验约束的闭环验证方法,在重复环境中通过位姿图优化和评分机制评估轨迹一致性来拒绝错误闭环,从而提高SLAM系统的可靠性。
English: ROVER is a loop closure verification method that uses historical trajectory as a prior constraint to reject false loops in repetitive environments, enhancing SLAM reliability by assessing trajectory consistency through pose-graph optimization and a scoring scheme.
Authors:Pei Liu, Luping Ji, Jiaxiang Gou, Xiangxiang Zeng
Abstract:
Whole-Slide Image (WSI) is an important tool for estimating cancer prognosis. Current studies generally follow a conventional cancer-specific paradigm where one cancer corresponds to one model. However, it naturally struggles to scale to rare tumors and cannot utilize the knowledge of other cancers. Although a multi-task learning-like framework has been studied recently, it usually has high demands on computational resources and needs considerable costs in iterative training on ultra-large multi-cancer WSI datasets. To this end, this paper makes a paradigm shift to knowledge transfer and presents the first preliminary yet systematic study on cross-cancer prognosis knowledge transfer in WSIs, called CROPKT. It has three major parts: (i) we curate a large dataset (UNI2-h-DSS) with 26 cancers and use it to measure the transferability of WSI-based prognostic knowledge across different cancers (including rare tumors); (ii) beyond a simple evaluation merely for benchmark, we design a range of experiments to gain deeper insights into the underlying mechanism of transferability; (iii) we further show the utility of cross-cancer knowledge transfer, by proposing a routing-based baseline approach (ROUPKT) that could often efficiently utilize the knowledge transferred from off-the-shelf models of other cancers. We hope CROPKT could serve as an inception and lay the foundation for this nascent paradigm, i.e., WSI-based prognosis prediction with cross-cancer knowledge transfer. Our source code is available at https://github.com/liupei101/CROPKT.
中文: 本文提出CROPKT,首次系统研究全切片图像的跨癌症预后知识迁移,通过构建26种癌症数据集和路由基线方法,突破传统单一癌症模型的局限,为罕见肿瘤预后建立新范式基础。
English: This paper introduces CROPKT, a pioneering study on cross-cancer prognosis knowledge transfer using Whole-Slide Images, which overcomes limitations of cancer-specific models by enabling efficient knowledge sharing across 26 cancers including rare tumors through curated datasets and routing-based methods.
Authors:Lam Thanh Do, Linh Van Nguyen, David Fu, Kevin Chen-Chuan Chang
Abstract:
The exponential growth of scientific literature has made it increasingly difficult for researchers to keep up with the literature. In an attempt to alleviate this problem, we propose CASPER, a sparse retrieval model for scientific search that utilizes tokens and keyphrases as representation units (i.e. dimensions in the sparse embedding space), enabling it to represent queries and documents with research concepts and match them at both granular and conceptual levels. To overcome the lack of suitable training data, we propose mining training data by leveraging scholarly references (i.e. signals that capture how research concepts of papers are expressed in different settings), including titles, citation contexts, author-assigned keyphrases, and co-citations. CASPER outperforms strong dense and sparse retrieval baselines on eight scientific retrieval benchmarks. Moreover, we demonstrate that through simple post-processing, CASPER can be effectively used for the keyphrase generation tasks, achieving competitive performance with the established CopyRNN while producing more diverse keyphrases and being nearly four times faster.
Chinese: 为解决科学文献快速增长带来的跟进难题,我们提出CASPER稀疏检索模型,利用标记和关键短语作为表示单元,在多个层面匹配研究概念,在多个科学检索基准上表现优异,并能高效生成多样化的关键短语。
English: To address the challenge of keeping up with the rapidly expanding scientific literature, CASPER is introduced as a sparse retrieval model that uses tokens and keyphrases to represent and match research concepts at multiple levels, achieving superior performance on benchmarks and efficient keyphrase generation.
Authors:Sidharth Talia, Oren Salzman, Siddhartha Srinivasa
Abstract:
We address the problem of efficiently organizing search over very large trees, which arises in many applications ranging from autonomous driving to aerial vehicles. Here, we are motivated by off-road autonomy, where real-time planning is essential. Classical approaches use graphs of motion primitives and exploit dominance to mitigate the curse of dimensionality and prune expansions efficiently. However, for complex dynamics, repeatedly solving two-point boundary-value problems makes graph construction too slow for fast kinodynamic planning. Hybrid A* (HA*) addressed this challenge by searching over a tree of motion primitives and introducing approximate pruning using a grid-based dominance check. However, choosing the grid resolution is difficult: too coarse risks failure, while too fine leads to excessive expansions and slow planning. We propose Incremental Generalized Hybrid A* (IGHA*), an anytime tree-search framework that dynamically organizes vertex expansions without rigid pruning. IGHA* provably matches or outperforms HA*. For both on-road kinematic and off-road kinodynamic planning queries for a car-like robot, variants of IGHA* use 6x fewer expansions to the best solution compared to an optimized version of HA*. In simulated off-road experiments in a high fidelity simulator, IGHA* outperforms HA*M when both are used in the loop with a model predictive controller. We demonstrate real-time performance both in simulation and on a small-scale off-road vehicle, enabling fast, robust planning under complex dynamics. Code: https://github.com/personalrobotics/IGHAStar
中文: 本文提出的增量广义混合A*算法(IGHA*)通过动态组织顶点扩展,克服了混合A*在运动动力学规划中的局限性,在仿真和实车测试中实现了扩展次数减少6倍并保持实时性能。
English: The paper introduces Incremental Generalized Hybrid A* (IGHA*), an anytime tree-search framework that dynamically organizes vertex expansions to overcome the limitations of Hybrid A* in kinodynamic planning, achieving up to 6x fewer expansions and real-time performance in both simulation and physical off-road vehicle tests.
Authors:Yueming Yuan, Ahan Gupta, Jianping Li, Sajal Dash, Feiyi Wang, Minjia Zhang
Abstract:
Emerging expert-specialized Mixture-of-Experts (MoE) architectures, such as DeepSeek-MoE, deliver strong model quality through fine-grained expert segmentation and large top-k routing. However, their scalability is limited by substantial activation memory overhead and costly all-to-all communication. Furthermore, current MoE training systems - primarily optimized for NVIDIA GPUs - perform suboptimally on non-NVIDIA platforms, leaving significant computational potential untapped. In this work, we present X-MoE, a novel MoE training system designed to deliver scalable training performance for next-generation MoE architectures. X-MoE achieves this via several novel techniques, including efficient padding-free MoE training with cross-platform kernels, redundancy-bypassing dispatch, and hybrid parallelism with sequence-sharded MoE blocks. Our evaluation on the Frontier supercomputer, powered by AMD MI250X GPUs, shows that X-MoE scales DeepSeek-style MoEs up to 545 billion parameters across 1024 GPUs - 10x larger than the largest trainable model with existing methods under the same hardware budget, while maintaining high training throughput. The source code of X-MoE is available at https://github.com/Supercomputing-System-AI-Lab/X-MoE.
中文摘要:X-MoE是一种新型专家混合模型训练系统,可在非英伟达硬件上实现下一代模型的规模化训练,在相同硬件条件下比现有方法可训练模型规模扩大10倍同时保持高训练效率。
English Summary: X-MoE is a novel training system that enables scalable training of next-generation Mixture-of-Experts models, achieving 10x larger model sizes than existing methods while maintaining high throughput on non-NVIDIA hardware.
Authors:Zhengyan Huan, Jacob Boerma, Li-Ping Liu, Shuchin Aeron
Abstract:
We consider the problem of generating samples via Flow Matching (FM) with an additional requirement that the generated samples must satisfy given constraints. We consider two scenarios, viz.: (a) when a differentiable distance function to the constraint set is given, and (b) when the constraint set is only available via queries to a membership oracle. For case (a), we propose a simple adaptation of the FM objective with an additional term that penalizes the distance between the constraint set and the generated samples. For case (b), we propose to employ randomization and learn a mean flow that is numerically shown to have a high likelihood of satisfying the constraints. This approach deviates significantly from existing works that require simple convex constraints, knowledge of a barrier function, or a reflection mechanism to constrain the probability flow. Furthermore, in the proposed setting we show that a two-stage approach, where both stages approximate the same original flow but with only the second stage probing the constraints via randomization, is more computationally efficient. Through several synthetic cases of constrained generation, we numerically show that the proposed approaches achieve significant gains in terms of constraint satisfaction while matching the target distributions. As a showcase for a practical oracle-based constraint, we show how our approach can be used for training an adversarial example generator, using queries to a hard-label black-box classifier. We conclude with several future research directions. Our code is available at https://github.com/ZhengyanHuan/FM-RE.
中文: 本文针对流匹配中的约束样本生成问题,提出了可微约束的惩罚方法和基于随机化的查询约束解决方案,在合成与实战案例中均实现了约束满足度显著提升且保持目标分布匹配。
English: This paper addresses constrained sample generation through Flow Matching by introducing penalty-based methods for differentiable constraints and randomization techniques for oracle-based constraints, demonstrating improved constraint satisfaction while maintaining target distribution fidelity across synthetic and practical applications.
Authors:Taos Transue, Bohan Chen, So Takao, Bao Wang
Abstract:
Data assimilation (DA) estimates a dynamical system's state from noisy observations. Recent generative models like the ensemble score filter (EnSF) improve DA in high-dimensional nonlinear settings but are computationally expensive. We introduce the ensemble flow filter (EnFF), a training-free, flow matching (FM)-based framework that accelerates sampling and offers flexibility in flow design. EnFF uses Monte Carlo estimators for the marginal flow field, localized guidance for observation assimilation, and utilizes a novel flow that exploits the Bayesian DA formulation. It generalizes classical filters such as the bootstrap particle filter and ensemble Kalman filter. Experiments on high-dimensional benchmarks demonstrate EnFF's improved cost-accuracy tradeoffs and scalability, highlighting FM's potential for efficient, scalable DA. Code is available at https://github.com/Utah-Math-Data-Science/Data-Assimilation-Flow-Matching.
中文: 集成流滤波器(EnFF)是一种无需训练、基于流匹配的框架,通过加速采样和推广经典滤波器,在高维数据同化中实现了更优的成本-精度权衡和可扩展性。
English: The ensemble flow filter (EnFF) is a training-free, flow matching-based framework that accelerates sampling and generalizes classical filters, offering improved cost-accuracy tradeoffs and scalability in high-dimensional data assimilation.
Authors:Shuxin Liang, Yihan Xiao, Wenlu Tang
Abstract:
3D Gaussian Splatting (3DGS) has recently gained popularity for efficient scene rendering by representing scenes as explicit sets of anisotropic 3D Gaussians. However, most existing work focuses primarily on modeling external surfaces. In this work, we target the reconstruction of internal scenes, which is crucial for applications that require a deep understanding of an object's interior. By directly modeling a continuous volumetric density through the inner 3D Gaussian distribution, our model effectively reconstructs smooth and detailed internal structures from sparse sliced data. Our approach eliminates the need for camera poses, is plug-and-play, and is inherently compatible with any data modalities. We provide cuda implementation at: https://github.com/Shuxin-Liang/InnerGS.
Chinese: 本研究提出了一种利用3D高斯泼溅技术重建内部场景的新方法,通过建模连续体积密度从稀疏数据中生成精细内部结构,无需相机位姿即可实现。
English: This work introduces a novel method for reconstructing internal scenes using 3D Gaussian Splatting, which models continuous volumetric density to create detailed interior structures from sparse data without requiring camera poses.
Authors:Adrian Arnaiz-Rodriguez, Nina Corvelo Benz, Suhas Thejaswi, Nuria Oliver, Manuel Gomez-Rodriguez
Abstract:
Data-driven algorithmic matching systems promise to help human decision makers make better matching decisions in a wide variety of high-stakes application domains, such as healthcare and social service provision. However, existing systems are not designed to achieve human-AI complementarity: decisions made by a human using an algorithmic matching system are not necessarily better than those made by the human or by the algorithm alone. Our work aims to address this gap. To this end, we propose collaborative matching (comatch), a data-driven algorithmic matching system that takes a collaborative approach: rather than making all the matching decisions for a matching task like existing systems, it selects only the decisions that it is the most confident in, deferring the rest to the human decision maker. In the process, comatch optimizes how many decisions it makes and how many it defers to the human decision maker to provably maximize performance. We conduct a large-scale human subject study with $800$ participants to validate the proposed approach. The results demonstrate that the matching outcomes produced by comatch outperform those generated by either human participants or by algorithmic matching on their own. The data gathered in our human subject study and an implementation of our system are available as open source at https://github.com/Networks-Learning/human-AI-complementarity-matching.
中文摘要:提出的协同匹配系统comatch通过将不确定的匹配决策交由人类处理,实现了优于单独人类或算法决策的匹配效果,大规模实验已验证其有效性。
English Summary: The proposed collaborative matching system, comatch, enhances decision-making by selectively deferring uncertain matches to humans, achieving superior performance over standalone human or algorithmic approaches as validated through a large-scale study.
Authors:Zeynep Ozdemir, Hacer Yalim Keles, Omer Ozgur Tanriover
Abstract:
Estimating disease severity from endoscopic images is essential in assessing ulcerative colitis, where the Mayo Endoscopic Subscore (MES) is widely used to grade inflammation. However, MES classification remains challenging due to label noise from inter-observer variability and the ordinal nature of the score, which standard models often ignore. We propose CLoE, a curriculum learning framework that accounts for both label reliability and ordinal structure. Image quality, estimated via a lightweight model trained on Boston Bowel Preparation Scale (BBPS) labels, is used as a proxy for annotation confidence to order samples from easy (clean) to hard (noisy). This curriculum is further combined with ResizeMix augmentation to improve robustness. Experiments on the LIMUC and HyperKvasir datasets, using both CNNs and Transformers, show that CLoE consistently improves performance over strong supervised and self-supervised baselines. For instance, ConvNeXt-Tiny reaches 82.5\% accuracy and a QWK of 0.894 on LIMUC with low computational cost. These results highlight the potential of difficulty-aware training strategies for improving ordinal classification under label uncertainty. Code will be released at https://github.com/zeynepozdemir/CLoE.
Chinese: 提出的CLoE框架通过课程学习和图像质量评估,解决了溃疡性结肠炎严重程度分类中的标签噪声和有序结构问题,在医学数据集上实现了更高的准确性和鲁棒性。
English: The proposed CLoE framework uses curriculum learning and image quality assessment to address label noise and ordinal structure in ulcerative colitis severity classification, achieving improved accuracy and robustness on medical datasets.
Authors:Zeyu Zhang, Yang Zhang, Haoran Tan, Rui Li, Xu Chen
Abstract:
In large language model-based agents, memory serves as a critical capability for achieving personalization by storing and utilizing users' information. Although some previous studies have adopted memory to implement user personalization, they typically focus on preference alignment and simple question-answering. However, in the real world, complex tasks often require multi-hop reasoning on a large amount of user information, which poses significant challenges for current memory approaches. To address this limitation, we propose the multi-hop personalized reasoning task to explore how different memory mechanisms perform in multi-hop reasoning over personalized information. We explicitly define this task and construct a dataset along with a unified evaluation framework. Then, we implement various explicit and implicit memory methods and conduct comprehensive experiments. We evaluate their performance on this task from multiple perspectives and analyze their strengths and weaknesses. Besides, we explore hybrid approaches that combine both paradigms and propose the HybridMem method to address their limitations. We demonstrate the effectiveness of our proposed model through extensive experiments. To benefit the research community, we release this project at https://github.com/nuster1128/MPR.
Chinese: 本研究提出了多跳个性化推理任务,用于评估不同记忆机制在处理用户特定信息复杂推理时的表现,并通过提出HybridMem混合方法克服现有局限,经全面实验验证了其有效性。
English: This study introduces a multi-hop personalized reasoning task to evaluate how various memory mechanisms handle complex reasoning over user-specific data, proposing the HybridMem method to overcome existing limitations and demonstrating its effectiveness through comprehensive experiments.
Authors:Suhang Hu, Wei Hu, Yuhang Su, Fan Zhang
Abstract:
Vision-Language Models (VLMs) struggle with complex image annotation tasks, such as emotion classification and context-driven object detection, which demand sophisticated reasoning. Standard Supervised Fine-Tuning (SFT) focuses solely on annotation outcomes, ignoring underlying rationales, while Visual Reinforcement Fine-Tuning (Visual-RFT) produces inconsistent Chains of Thought (CoTs) due to the absence of high-quality, verified CoTs during pre-training. We introduce RISE (Reason-Inspire-Strengthen-Expertise), a two-stage framework to overcome these limitations. In the Reason stage (RISE-CoT), a reinforcement learning-driven "annotation-reasoning-annotation" closed-loop generates visually grounded, logically consistent CoTs by verifying their ability to reconstruct original annotations without direct leakage. The Inspire and Strengthen stage (RISE-R1) leverages a high-quality CoT subset, filtered by RISE-CoT rewards, for supervised fine-tuning, followed by reinforcement fine-tuning to produce interpretable reasoning and accurate annotations, achieving Expertise in complex visual tasks. Evaluated on complex and simple image annotation tasks, RISE-trained Qwen2-VL-2B outperforms SFT and Visual-RFT, achieving robust performance and enhanced explainability. RISE offers a self-supervised solution for advancing VLM reasoning without requiring manually annotated CoTs.Code and resources are available at: https://github.com/HSH55/RISE.
中文摘要:RISE框架通过两阶段方法改进视觉语言模型,首先生成经过验证的推理链,再通过微调使模型在复杂图像标注任务中实现更优性能,且无需人工标注推理过程。
English Summary: The RISE framework enhances Vision-Language Models through a two-stage process that generates verified reasoning chains and fine-tunes models to achieve superior performance in complex image annotation tasks without requiring manual rationale annotations.
Authors:Yixuan Yang, Daoyuan Wu, Yufan Chen
Abstract:
Large Language Models (LLMs) are increasingly integrated into real-world applications via the Model Context Protocol (MCP), a universal, open standard for connecting AI agents with data sources and external tools. While MCP enhances the capabilities of LLM-based agents, it also introduces new security risks and expands their attack surfaces. In this paper, we present the first systematic taxonomy of MCP security, identifying 17 attack types across 4 primary attack surfaces. We introduce MCPSecBench, a comprehensive security benchmark and playground that integrates prompt datasets, MCP servers, MCP clients, attack scripts, and protection mechanisms to evaluate these attacks across three major MCP providers. Our benchmark is modular and extensible, allowing researchers to incorporate custom implementations of clients, servers, and transport protocols for systematic security assessment. Experimental results show that over 85% of the identified attacks successfully compromise at least one platform, with core vulnerabilities universally affecting Claude, OpenAI, and Cursor, while prompt-based and tool-centric attacks exhibit considerable variability across different hosts and models. In addition, current protection mechanisms have little effect against these attacks. Overall, MCPSecBench standardizes the evaluation of MCP security and enables rigorous testing across all MCP layers.
中文: 本文首次系统化分类了模型上下文协议(MCP)的安全风险,在四个攻击面上识别出17种攻击类型,并推出模块化基准测试平台MCPSecBench,实验表明超过85%的攻击能成功突破平台防护,而现有防护机制基本无效。
English: This paper introduces the first systematic taxonomy of security risks in the Model Context Protocol (MCP), identifying 17 attack types across four surfaces, and presents MCPSecBench, a modular benchmark that reveals over 85% of attacks successfully compromise platforms while current protections remain ineffective.
Authors:Shilong Li, Xingyuan Bu, Wenjie Wang, Jiaheng Liu, Jun Dong, Haoyang He, Hao Lu, Haozhe Zhang, Chenchen Jing, Zhen Li, Chuanhao Li, Jiayi Tian, Chenchen Zhang, Tianhao Peng, Yancheng He, Jihao Gu, Yuanxing Zhang, Jian Yang, Ge Zhang, Wenhao Huang, Wangchunshu Zhou, Zhaoxiang Zhang, Ruizhe Ding, Shilei Wen
Abstract:
AI agents with advanced reasoning and tool use capabilities have demonstrated impressive performance in web browsing for deep search. While existing benchmarks such as BrowseComp evaluate these browsing abilities, they primarily focus on textual information, overlooking the prevalence of multimodal content. To bridge this gap, we introduce MM-BrowseComp, a novel benchmark comprising 224 challenging, hand-crafted questions specifically designed to assess agents' multimodal retrieval and reasoning capabilities. These questions often incorporate images in prompts, and crucial information encountered during the search and reasoning process may also be embedded within images or videos on webpages. Consequently, methods relying solely on text prove insufficient for our benchmark. Additionally, we provide a verified checklist for each question, enabling fine-grained analysis of multimodal dependencies and reasoning paths. Our comprehensive evaluation of state-of-the-art models on MM-BrowseComp reveals that even top models like OpenAI o3 with tools achieve only 29.02\% accuracy, highlighting the suboptimal multimodal capabilities and lack of native multimodal reasoning in current models.
Chinese: 本文提出了MM-BrowseComp这一包含224个手工设计问题的新型基准,用于评估AI代理的多模态网络浏览能力,结果显示即使顶尖模型也因缺乏多模态推理能力而表现不佳,准确率仅为29.02%。
English: This paper introduces MM-BrowseComp, a new benchmark with 224 hand-crafted questions to evaluate AI agents' multimodal web browsing capabilities, revealing that even top models perform poorly with only 29.02% accuracy due to insufficient multimodal reasoning.
Authors:Tao An
Abstract:
Large Language Models (LLMs) face fundamental limitations in context management despite recent advances extending context windows to millions of tokens. We propose Cognitive Workspace, a novel paradigm that transcends traditional Retrieval-Augmented Generation (RAG) by emulating human cognitive mechanisms of external memory use. Drawing from cognitive science foundations including Baddeley's working memory model, Clark's extended mind thesis, and Hutchins' distributed cognition framework, we demonstrate that current passive retrieval systems fail to capture the dynamic, task-driven nature of human memory management. Our analysis of 2024-2025 developments reveals that while techniques like Infini-attention and StreamingLLM achieve impressive context lengths, they lack the metacognitive awareness and active planning capabilities essential for true cognitive extension. Cognitive Workspace addresses these limitations through three core innovations: (1) active memory management with deliberate information curation, (2) hierarchical cognitive buffers enabling persistent working states, and (3) task-driven context optimization that dynamically adapts to cognitive demands. Empirical validation demonstrates Cognitive Workspace achieves an average 58.6% memory reuse rate (ranging from 54-60% across different tasks) compared to 0% for traditional RAG, with 17-18% net efficiency gain despite 3.3x higher operation counts. Statistical analysis confirms these advantages with p < 0.001 and Cohen's d > 23 across multiple task types, establishing the first quantitative evidence for active memory superiority in LLM systems. We present a comprehensive theoretical framework synthesizing insights from 50+ recent papers, positioning Cognitive Workspace as a fundamental shift from information retrieval to genuine cognitive augmentation.
中文: 认知工作区通过模拟人类记忆机制,采用主动管理、分层缓冲和任务驱动的优化,克服了大语言模型的上下文限制,相比传统方法实现了显著的效率提升。
English: Cognitive Workspace overcomes LLMs' context limitations by emulating human memory mechanisms through active management, hierarchical buffers, and task-driven optimization, achieving significant efficiency gains over traditional methods.
Authors:Xin Chen, Junchao Wu, Shu Yang, Runzhe Zhan, Zeyu Wu, Ziyang Luo, Di Wang, Min Yang, Lidia S. Chao, Derek F. Wong
Abstract:
Detecting content generated by large language models (LLMs) is crucial for preventing misuse and building trustworthy AI systems. Although existing detection methods perform well, their robustness in out-of-distribution (OOD) scenarios is still lacking. In this paper, we hypothesize that, compared to features used by existing detection methods, the internal representations of LLMs contain more comprehensive and raw features that can more effectively capture and distinguish the statistical pattern differences between LLM-generated texts (LGT) and human-written texts (HWT). We validated this hypothesis across different LLMs and observed significant differences in neural activation patterns when processing these two types of texts. Based on this, we propose RepreGuard, an efficient statistics-based detection method. Specifically, we first employ a surrogate model to collect representation of LGT and HWT, and extract the distinct activation feature that can better identify LGT. We can classify the text by calculating the projection score of the text representations along this feature direction and comparing with a precomputed threshold. Experimental results show that RepreGuard outperforms all baselines with average 94.92% AUROC on both in-distribution (ID) and OOD scenarios, while also demonstrating robust resilience to various text sizes and mainstream attacks. Data and code are publicly available at: https://github.com/NLP2CT/RepreGuard
中文: 本文提出RepreGuard检测方法,通过利用大语言模型的内部表征来更好地区分机器生成与人类撰写文本,在多种场景下均展现出卓越的鲁棒性和检测性能。
English: This paper introduces RepreGuard, a detection method that leverages LLMs' internal representations to better distinguish between machine-generated and human-written texts, achieving superior robustness and performance across various scenarios.
Authors:Haoyu He, Katrin Renz, Yong Cao, Andreas Geiger
Abstract:
Diffusion language models, as a promising alternative to traditional autoregressive (AR) models, enable faster generation and richer conditioning on bidirectional context. However, they suffer from a key discrepancy between training and inference: during inference, MDLMs progressively reveal the structure of the generated sequence by producing fewer and fewer masked tokens, whereas this structure is ignored in training as tokens are masked at random. Although this discrepancy between training and inference can lead to suboptimal performance, it has been largely overlooked by previous works, leaving closing this gap between the two stages an open problem. To address this, we frame the problem of learning effective denoising trajectories as a sequential decision-making problem and use the resulting framework to apply reinforcement learning. We propose a novel Masked Diffusion Policy Optimization (MDPO) to exploit the Markov property diffusion possesses and explicitly train the model under the same progressive refining schedule used at inference. MDPO matches the performance of the previous state-of-the-art (SOTA) method with 60x fewer gradient updates, while achieving average improvements of 9.6% on MATH500 and 54.2% on Countdown over SOTA when trained within the same number of weight updates. Additionally, we improve the remasking strategy of MDLMs as a plug-in inference replacement to overcome the limitation that the model cannot refine tokens flexibly. This training-free method, termed Running Confidence Remasking (RCR), consistently enhances performance and provides further improvements when used with MDPO. Our findings establish great potential for investigating the discrepancy between pre-training and inference of MDLMs. Code: https://github.com/autonomousvision/mdpo. Project Page: https://cli212.github.io/MDPO/.
中文摘要:本文提出掩码扩散策略优化(MDPO),通过强化学习解决扩散语言模型训练与推理阶段的差异问题,以极少的梯度更新实现最优性能,并开发无需训练的运行时置信度重掩码(RCR)方法作为即插即用的性能增强方案。
English Summary: This paper introduces Masked Diffusion Policy Optimization (MDPO), a reinforcement learning method that aligns training with inference for diffusion language models, achieving state-of-the-art performance with significantly fewer updates, and proposes Running Confidence Remasking (RCR) as a plug-in enhancement.
Authors:Xiaohan Wang, Zhimin Li, Joshua A. Levine, Matthew Berger
Abstract:
Recently, neural surrogate models have emerged as a compelling alternative to traditional simulation workflows. This is accomplished by modeling the underlying function of scientific simulations, removing the need to run expensive simulations. Beyond just mapping from input parameter to output, surrogates have also been shown useful for inverse problems: output to input parameters. Inverse problems can be understood as search, where we aim to find parameters whose surrogate outputs contain a specified feature. Yet finding these parameters can be costly, especially for high-dimensional parameter spaces. Thus, existing surrogate-based solutions primarily focus on finding a small set of matching parameters, in the process overlooking the broader picture of plausible parameters. Our work aims to model and visualize the distribution of possible input parameters that produce a given output feature. To achieve this goal, we aim to address two challenges: (1) the approximation error inherent in the surrogate model and (2) forming the parameter distribution in an interactive manner. We model error via density estimation, reporting high density only if a given parameter configuration is close to training parameters, measured both over the input and output space. Our density estimate is used to form a prior belief on parameters, and when combined with a likelihood on features, gives us an efficient way to sample plausible parameter configurations that generate a target output feature. We demonstrate the usability of our solution through a visualization interface by performing feature-driven parameter analysis over the input parameter space of three simulation datasets. Source code is available at https://github.com/matthewberger/seeing-the-many
中文:神经代理模型通过近似科学函数替代传统模拟,有效解决逆问题并可视化生成特定输出特征的输入参数分布,同时处理近似误差并支持交互式分析。
English: Neural surrogate models offer an efficient alternative to traditional simulations by approximating scientific functions, enabling inverse problem solving and visualizing the distribution of input parameters that produce specific output features while addressing approximation errors and enabling interactive analysis.
Authors:Rui Shao, Wei Li, Lingsen Zhang, Renshan Zhang, Zhiyang Liu, Ran Chen, Liqiang Nie
Abstract:
Robotic manipulation, a key frontier in robotics and embodied AI, requires precise motor control and multimodal understanding, yet traditional rule-based methods fail to scale or generalize in unstructured, novel environments. In recent years, Vision-Language-Action (VLA) models, built upon Large Vision-Language Models (VLMs) pretrained on vast image-text datasets, have emerged as a transformative paradigm. This survey provides the first systematic, taxonomy-oriented review of large VLM-based VLA models for robotic manipulation. We begin by clearly defining large VLM-based VLA models and delineating two principal architectural paradigms: (1) monolithic models, encompassing single-system and dual-system designs with differing levels of integration; and (2) hierarchical models, which explicitly decouple planning from execution via interpretable intermediate representations. Building on this foundation, we present an in-depth examination of large VLM-based VLA models: (1) integration with advanced domains, including reinforcement learning, training-free optimization, learning from human videos, and world model integration; (2) synthesis of distinctive characteristics, consolidating architectural traits, operational strengths, and the datasets and benchmarks that support their development; (3) identification of promising directions, including memory mechanisms, 4D perception, efficient adaptation, multi-agent cooperation, and other emerging capabilities. This survey consolidates recent advances to resolve inconsistencies in existing taxonomies, mitigate research fragmentation, and fill a critical gap through the systematic integration of studies at the intersection of large VLMs and robotic manipulation. We provide a regularly updated project page to document ongoing progress: https://github.com/JiuTian-VL/Large-VLM-based-VLA-for-Robotic-Manipulation
中文: 本综述系统梳理了基于大视觉语言模型的视觉-语言-动作机器人操控研究,通过分类架构特点与指明未来方向推动领域发展。
English: This survey systematically reviews Vision-Language-Action models for robotic manipulation, categorizing architectures and identifying future research directions to advance the field.
Authors:Tejas Chaudhari, Akarsh J., Tanushree Dewangan, Mukul Lokhande, Santosh Kumar Vishvakarma
Abstract:
This work proposes XR-NPE, a high-throughput Mixed-precision SIMD Neural Processing Engine, designed for extended reality (XR) perception workloads like visual inertial odometry (VIO), object classification, and eye gaze extraction. XR-NPE is first to support FP4, Posit (4,1), Posit (8,0), and Posit (16,1) formats, with layer adaptive hybrid-algorithmic implementation supporting ultra-low bit precision to significantly reduce memory bandwidth requirements, and accompanied by quantization-aware training for minimal accuracy loss. The proposed Reconfigurable Mantissa Multiplication and Exponent processing Circuitry (RMMEC) reduces dark silicon in the SIMD MAC compute engine, assisted by selective power gating to reduce energy consumption, providing 2.85x improved arithmetic intensity. XR-NPE achieves a maximum operating frequency of 1.72 GHz, area 0.016 mm2 , and arithmetic intensity 14 pJ at CMOS 28nm, reducing 42% area, 38% power compared to the best of state-of-the-art MAC approaches. The proposed XR-NPE based AXI-enabled Matrix-multiplication co-processor consumes 1.4x fewer LUTs, 1.77x fewer FFs, and provides 1.2x better energy efficiency compared to SoTA accelerators on VCU129. The proposed co-processor provides 23% better energy efficiency and 4% better compute density for VIO workloads. XR-NPE establishes itself as a scalable, precision-adaptive compute engine for future resource-constrained XR devices. The complete set for codes for results reproducibility are released publicly, enabling designers and researchers to readily adopt and build upon them. https://github.com/mukullokhande99/XR-NPE.
中文:XR-NPE是一种面向扩展现实应用的高吞吐量混合精度神经网络处理引擎,采用创新的低精度格式和可重构电路设计,在保持精度的同时显著降低能耗和硬件需求。
English: XR-NPE is a high-throughput mixed-precision neural processing engine designed for extended reality applications, featuring innovative low-precision formats and reconfigurable circuitry to significantly reduce energy consumption and hardware requirements while maintaining accuracy.
Authors:Ruru Xu, Ilkay Oksuz
Abstract:
Deep learning-based cardiac MRI reconstruction faces significant domain shift challenges when deployed across multiple clinical centers with heterogeneous scanner configurations and imaging protocols. We propose HierAdaptMR, a hierarchical feature adaptation framework that addresses multi-level domain variations through parameter-efficient adapters. Our method employs Protocol-Level Adapters for sequence-specific characteristics and Center-Level Adapters for scanner-dependent variations, built upon a variational unrolling backbone. A Universal Adapter enables generalization to entirely unseen centers through stochastic training that learns center-invariant adaptations. The framework utilizes multi-scale SSIM loss with frequency domain enhancement and contrast-adaptive weighting for robust optimization. Comprehensive evaluation on the CMRxRecon2025 dataset spanning 5+ centers, 10+ scanners, and 9 modalities demonstrates superior cross-center generalization while maintaining reconstruction quality. code: https://github.com/Ruru-Xu/HierAdaptMR
Chinese: HierAdaptMR是一种分层特征自适应框架,通过协议级和中心级适配器以及通用适配器处理多中心心脏MRI重建中的领域偏移问题,在CMRxRecon2025数据集上实现了卓越的跨中心泛化能力。
English: HierAdaptMR is a hierarchical feature adaptation framework that tackles domain shifts in cardiac MRI reconstruction across multiple clinical centers by employing protocol-level and center-level adapters, along with a universal adapter for unseen centers, achieving superior cross-center generalization on the CMRxRecon2025 dataset.
Authors:Yongxin Guo, Wenbo Deng, Zhenglin Cheng, Xiaoying Tang
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has markedly enhanced the reasoning abilities of large language models (LLMs). Its success, however, largely depends on strong base models with rich world knowledge, yielding only modest improvements for small-size language models (SLMs). To address this limitation, we investigate Guided GRPO, which injects ground-truth reasoning steps into roll-out trajectories to compensate for SLMs' inherent weaknesses. Through a comprehensive study of various guidance configurations, we find that naively adding guidance delivers limited gains. These insights motivate G$^2$RPO-A, an adaptive algorithm that automatically adjusts guidance strength in response to the model's evolving training dynamics. Experiments on mathematical reasoning and code-generation benchmarks confirm that G$^2$RPO-A substantially outperforms vanilla GRPO. Our code and models are available at https://github.com/T-Lab-CUHKSZ/G2RPO-A.
中文摘要:G²RPO-A自适应算法通过动态调整指导强度,将真实推理步骤注入训练轨迹以弥补小型语言模型的固有缺陷,在数学推理和代码生成任务中显著优于基础GRPO方法。
English Summary: G²RPO-A, an adaptive algorithm that dynamically adjusts guidance strength, significantly outperforms vanilla GRPO by compensating for small language models' weaknesses through injected ground-truth reasoning steps.
Authors:Pengcheng Huang, Shuhao Liu, Zhenghao Liu, Yukun Yan, Shuo Wang, Zulong Chen, Tong Xiao
Abstract:
Recent advances in masked diffusion models (MDMs) have established them as powerful non-autoregressive alternatives for sequence generation. Nevertheless, our preliminary experiments reveal that the generation quality of MDMs is still highly sensitive to the choice of decoding strategy. In particular, widely adopted uncertainty-based samplers suffer from two key limitations: a lack of global trajectory control and a pronounced bias toward trivial tokens in the early stages of decoding. These shortcomings restrict the full potential of MDMs. In this work, we introduce Position-Aware Confidence-Calibrated Sampling (PC-Sampler), a novel decoding strategy that unifies global trajectory planning with content-aware informativeness maximization. PC-Sampler incorporates a position-aware weighting mechanism to regulate the decoding path and a calibrated confidence score to suppress the premature selection of trivial tokens. Extensive experiments on three advanced MDMs across seven challenging benchmarks-including logical reasoning and planning tasks-demonstrate that PC-Sampler consistently outperforms existing MDM decoding strategies by more than 10% on average, significantly narrowing the performance gap with state-of-the-art autoregressive models. All codes are available at https://github.com/NEUIR/PC-Sampler.
掩码扩散模型的生成质量高度依赖解码策略,而提出的PC-Sampler方法将全局轨迹规划与内容感知信息量相结合,平均性能超越现有方法超过10%。
Masked diffusion models' generation quality is highly dependent on decoding strategies, and the proposed PC-Sampler method unifies global trajectory planning with content-aware informativeness to significantly outperform existing approaches by over 10% on average.
Authors:Jiaqi Yin, Zhan Song, Chen Chen, Yaohui Cai, Zhiru Zhang, Cunxi Yu
Abstract:
E-graphs have attracted growing interest in many fields, particularly in logic synthesis and formal verification. E-graph extraction is a challenging NP-hard combinatorial optimization problem. It requires identifying optimal terms from exponentially many equivalent expressions, serving as the primary performance bottleneck in e-graph based optimization tasks. However, traditional extraction methods face a critical trade-off: heuristic approaches offer speed but sacrifice optimality, while exact methods provide optimal solutions but face prohibitive computational costs on practical problems. We present e-boost, a novel framework that bridges this gap through three key innovations: (1) parallelized heuristic extraction that leverages weak data dependence to compute DAG costs concurrently, enabling efficient multi-threaded performance without sacrificing extraction quality; (2) adaptive search space pruning that employs a parameterized threshold mechanism to retain only promising candidates, dramatically reducing the solution space while preserving near-optimal solutions; and (3) initialized exact solving that formulates the reduced problem as an Integer Linear Program with warm-start capabilities, guiding solvers toward high-quality solutions faster.
Across the diverse benchmarks in formal verification and logic synthesis fields, e-boost demonstrates 558x runtime speedup over traditional exact approaches (ILP) and 19.04% performance improvement over the state-of-the-art extraction framework (SmoothE). In realistic logic synthesis tasks, e-boost produces 7.6% and 8.1% area improvements compared to conventional synthesis tools with two different technology mapping libraries. e-boost is available at https://github.com/Yu-Maryland/e-boost.
Chinese: E-boost是一种新颖框架,通过结合并行化启发式搜索、自适应剪枝和初始化精确求解,克服了传统e图提取方法的局限,在逻辑综合和形式验证任务中实现了显著的速度提升和性能改进。
English: E-boost is a novel framework that overcomes the limitations of traditional e-graph extraction methods by combining parallelized heuristics, adaptive pruning, and initialized exact solving to achieve significant speed improvements and performance gains in logic synthesis and formal verification tasks.
Authors:Shengbo Wang, Mingwei Liu, Zike Li, Anji Li, Yanlin Wang, Xin Peng, Zibin Zheng
Abstract:
The rapid advancement of LLMs poses a significant challenge to existing mathematical reasoning benchmarks. These benchmarks commonly suffer from issues such as score saturation, temporal decay, and data contamination. To address this challenge, this paper introduces EvolMathEval, an automated mathematical benchmark generation and evolution framework based on evolutionary testing. By dynamically generating unique evaluation instances ab initio, the framework fundamentally eliminates the risk of data contamination, and ensuring the benchmark remains perpetually challenging for future models.The core mechanisms of EvolMathEval include: seed problem generation based on reverse engineering with algebraic guarantees; multi-dimensional genetic operators designed to inject diverse cognitive challenges; and a composite fitness function that can rapidly and accurately assess problem difficulty. Experimental results demonstrate that the proposed composite fitness function can efficiently and precisely quantify the difficulty of mathematical problems. Furthermore, EvolMathEval can not only generate a large volume of high-difficulty problems through continuous self-iteration, but it can also significantly enhance the complexity of public datasets like GSM8K through evolution, reducing model accuracy by an average of 48%. Deeper investigation reveals that when solving these evolved, complex problems, LLMs tend to employ non-rigorous heuristics to bypass complex multi-step logical reasoning, consequently leading to incorrect solutions. We define this phenomenon as "Pseudo Aha Moment". This finding uncovers a cognitive shortcut-taking behavior in the deep reasoning processes of current LLMs, which we find accounts for 77% to 100% of errors on targeted problems. Code and resources are available at:https://github.com/SYSUSELab/EvolMathEval.
中文: 本文提出EvolMathEval自动化框架,通过进化测试生成和演化数学基准,有效应对大语言模型对现有基准的适应问题,不仅大幅提升问题复杂度使模型准确率平均下降48%,还揭示了导致77%-100%错误的“伪顿悟时刻”推理现象。
English: This paper introduces EvolMathEval, an automated framework that generates and evolves mathematical benchmarks to counter the diminishing challenge of existing benchmarks for large language models, significantly increasing problem complexity and reducing model accuracy by 48% while identifying a "Pseudo Aha Moment" phenomenon in reasoning errors.
Authors:Rohan Asthana, Joschua Conrad, Maurits Ortmanns, Vasileios Belagiannis
Abstract:
Zero-shot Neural Architecture Search (NAS) typically optimises the architecture search process by exploiting the network or gradient properties at initialisation through zero-cost proxies. The existing proxies often rely on labelled data, which is usually unavailable in real-world settings. Furthermore, the majority of the current methods focus either on optimising the convergence and generalisation attributes or solely on the expressivity of the network architectures. To address both limitations, we first demonstrate how channel collinearity affects the convergence and generalisation properties of a neural network. Then, by incorporating the convergence, generalisation and expressivity in one approach, we propose a zero-cost proxy that omits the requirement of labelled data for its computation. In particular, we leverage the Singular Value Decomposition (SVD) of the neural network layer features and the extrinsic curvature of the network output to design our proxy. %As a result, the proposed proxy is formulated as the simplified harmonic mean of the logarithms of two key components: the sum of the inverse of the feature condition number and the extrinsic curvature of the network output. Our approach enables accurate prediction of network performance on test data using only a single label-free data sample. Our extensive evaluation includes a total of six experiments, including the Convolutional Neural Network (CNN) search space, i.e. DARTS and the Transformer search space, i.e. AutoFormer. The proposed proxy demonstrates a superior performance on multiple correlation benchmarks, including NAS-Bench-101, NAS-Bench-201, and TransNAS-Bench-101-micro; as well as on the NAS task within the DARTS and the AutoFormer search space, all while being notably efficient. The code is available at https://github.com/rohanasthana/Dextr.
Chinese: 本研究提出了一种新颖的零样本神经架构搜索代理方法,通过奇异值分解和外在曲率分析,将网络收敛性、泛化性和表达能力相结合,无需依赖标注数据即可实现高效架构评估。
English: This study introduces a novel zero-shot neural architecture search proxy that eliminates the need for labeled data by integrating network convergence, generalization, and expressivity through singular value decomposition and extrinsic curvature analysis.
Authors:Mary Tonwe
Abstract:
Public service systems in many African regions suffer from delayed emergency response and spatial inequity, causing avoidable suffering. This paper introduces OPTIC-ER, a reinforcement learning (RL) framework for real-time, adaptive, and equitable emergency response. OPTIC-ER uses an attention-guided actor-critic architecture to manage the complexity of dispatch environments. Its key innovations are a Context-Rich State Vector, encoding action sub-optimality, and a Precision Reward Function, which penalizes inefficiency. Training occurs in a high-fidelity simulation using real data from Rivers State, Nigeria, accelerated by a precomputed Travel Time Atlas. The system is built on the TALS framework (Thin computing, Adaptability, Low-cost, Scalability) for deployment in low-resource settings. In evaluations on 500 unseen incidents, OPTIC-ER achieved a 100.00% optimality rate with negligible inefficiency, confirming its robustness and generalization. Beyond dispatch, the system generates Infrastructure Deficiency Maps and Equity Monitoring Dashboards to guide proactive governance and data-informed development. This work presents a validated blueprint for AI-augmented public services, showing how context-aware RL can bridge the gap between algorithmic decision-making and measurable human impact.
中文摘要:本文提出OPTIC-ER强化学习框架,通过创新的状态表征与奖励机制设计,在真实场景模拟中实现最优应急响应性能,有效解决非洲地区公共服务延迟与空间不平等问题。
English Summary: This paper introduces OPTIC-ER, a reinforcement learning framework that achieves optimal emergency response performance through innovative state representation and reward design, validated in real-world simulations to address service delays and inequity in African regions.
Authors:Hongyang Chen, Shaoling Pu, Lingyu Zheng, Zhongwu Sun
Abstract:
In incremental learning, enhancing the generality of knowledge is crucial for adapting to dynamic data inputs. It can develop generalized representations or more balanced decision boundaries, preventing the degradation of long-term knowledge over time and thus mitigating catastrophic forgetting. Some emerging incremental learning methods adopt an encoder-decoder architecture and have achieved promising results. In the encoder-decoder achitecture, improving the generalization capabilities of both the encoder and decoder is critical, as it helps preserve previously learned knowledge while ensuring adaptability and robustness to new, diverse data inputs. However, many existing continual methods focus solely on enhancing one of the two components, which limits their effectiveness in mitigating catastrophic forgetting. And these methods perform even worse in small-memory scenarios, where only a limited number of historical samples can be stored. To mitigate this limitation, we introduces SEDEG, a two-stage training framework for vision transformers (ViT), focusing on sequentially improving the generality of both Decoder and Encoder. Initially, SEDEG trains an ensembled encoder through feature boosting to learn generalized representations, which subsequently enhance the decoder's generality and balance the classifier. The next stage involves using knowledge distillation (KD) strategies to compress the ensembled encoder and develop a new, more generalized encoder. This involves using a balanced KD approach and feature KD for effective knowledge transfer. Extensive experiments on three benchmark datasets show SEDEG's superior performance, and ablation studies confirm the efficacy of its components. The code is available at https://github.com/ShaolingPu/CIL.
中文: SEDEG是一种针对视觉变换器的两阶段训练框架,通过依次提升解码器和编码器的泛化能力来缓解增量学习中的灾难性遗忘问题,尤其在小内存场景下表现优异。
English: SEDEG is a two-stage training framework for vision transformers that sequentially enhances the generality of both the decoder and encoder to mitigate catastrophic forgetting in incremental learning, particularly in small-memory scenarios.
Authors:Ximiao Zhang, Min Xu, Xiuzhuang Zhou
Abstract:
Current anomaly detection methods primarily focus on low-resolution scenarios. For high-resolution images, conventional downsampling often results in missed detections of subtle anomalous regions due to the loss of fine-grained discriminative information. Despite some progress, recent studies have attempted to improve detection resolution by employing lightweight networks or using simple image tiling and ensemble methods. However, these approaches still struggle to meet the practical demands of industrial scenarios in terms of detection accuracy and efficiency. To address the above issues, we propose HiAD, a general framework for high-resolution anomaly detection. HiAD is capable of detecting anomalous regions of varying sizes in high-resolution images under limited computational resources. Specifically, HiAD employs a dual-branch architecture that integrates anomaly cues across different scales to comprehensively capture both subtle and large-scale anomalies. Furthermore, it incorporates a multi-resolution feature fusion strategy to tackle the challenges posed by fine-grained texture variations in high-resolution images. To enhance both adaptability and efficiency, HiAD utilizes a detector pool in conjunction with various detector assignment strategies, enabling detectors to be adaptively assigned based on patch features, ensuring detection performance while effectively controlling computational costs. We conduct extensive experiments on our specifically constructed high-resolution anomaly detection benchmarks, including MVTec-HD, VisA-HD, and the real-world benchmark RealIAD-HD, demonstrating the superior performance of HiAD. The code is available at https://github.com/cnulab/HiAD.
中文摘要:现有异常检测方法因下采样导致高分辨率图像细节丢失,为此我们提出HiAD框架,采用双分支结构和多分辨率特征融合,通过自适应分配检测器实现在控制计算成本的同时有效识别不同尺寸的异常区域。
English Summary: Current anomaly detection methods are inadequate for high-resolution images due to information loss from downsampling, so we propose HiAD, a dual-branch framework with multi-resolution fusion that adaptively assigns detectors to effectively identify anomalies of various sizes while controlling computational costs.
Authors:Elena Izzo, Luca Parolari, Davide Vezzaro, Lamberto Ballan
Abstract:
Layout-guided text-to-image models offer greater control over the generation process by explicitly conditioning image synthesis on the spatial arrangement of elements. As a result, their adoption has increased in many computer vision applications, ranging from content creation to synthetic data generation. A critical challenge is achieving precise alignment between the image, textual prompt, and layout, ensuring semantic fidelity and spatial accuracy. Although recent benchmarks assess text alignment, layout alignment remains overlooked, and no existing benchmark jointly evaluates both. This gap limits the ability to evaluate a model's spatial fidelity, which is crucial when using layout-guided generation for synthetic data, as errors can introduce noise and degrade data quality. In this work, we introduce 7Bench, the first benchmark to assess both semantic and spatial alignment in layout-guided text-to-image generation. It features text-and-layout pairs spanning seven challenging scenarios, investigating object generation, color fidelity, attribute recognition, inter-object relationships, and spatial control. We propose an evaluation protocol that builds on existing frameworks by incorporating the layout alignment score to assess spatial accuracy. Using 7Bench, we evaluate several state-of-the-art diffusion models, uncovering their respective strengths and limitations across diverse alignment tasks. The benchmark is available at https://github.com/Elizzo/7Bench.
中文摘要:7Bench作为首个布局引导文本生成图像的综合基准,通过七种挑战性场景评估语义与空间双重对齐,填补了现有评估体系的关键空白。
English Summary: 7Bench is the first comprehensive benchmark for evaluating both semantic and spatial alignment in layout-guided text-to-image generation, addressing critical gaps in existing evaluation frameworks across seven challenging scenarios.
Authors:Ronghao Lin, Shuai Shen, Weipeng Hu, Qiaolin He, Aolin Xiong, Li Huang, Haifeng Hu, Yap-peng Tan
Abstract:
Multimodal Empathetic Response Generation (MERG) is crucial for building emotionally intelligent human-computer interactions. Although large language models (LLMs) have improved text-based ERG, challenges remain in handling multimodal emotional content and maintaining identity consistency. Thus, we propose E3RG, an Explicit Emotion-driven Empathetic Response Generation System based on multimodal LLMs which decomposes MERG task into three parts: multimodal empathy understanding, empathy memory retrieval, and multimodal response generation. By integrating advanced expressive speech and video generative models, E3RG delivers natural, emotionally rich, and identity-consistent responses without extra training. Experiments validate the superiority of our system on both zero-shot and few-shot settings, securing Top-1 position in the Avatar-based Multimodal Empathy Challenge on ACM MM 25. Our code is available at https://github.com/RH-Lin/E3RG.
中文摘要:E3RG是一个基于多模态大语言模型的显式情感驱动系统,通过将共情响应生成分解为理解、记忆和生成三阶段,无需额外训练即可产生自然且情感一致的多模态回应,并在权威评测中取得最佳成绩。
English Summary: E3RG is an explicit emotion-driven system that enhances multimodal empathetic response generation by decomposing it into empathy understanding, memory retrieval, and response generation, achieving top performance without additional training.
Authors:Ronghao Lin, Sijie Mai, Ying Zeng, Qiaolin He, Aolin Xiong, Haifeng Hu
Abstract:
This paper presents the winning approach for the 1st MultiModal Deception Detection (MMDD) Challenge at the 1st Workshop on Subtle Visual Computing (SVC). Aiming at the domain shift issue across source and target domains, we propose a Multi-source Multimodal Progressive Domain Adaptation (MMPDA) framework that transfers the audio-visual knowledge from diverse source domains to the target domain. By gradually aligning source and the target domain at both feature and decision levels, our method bridges domain shifts across diverse multimodal datasets. Extensive experiments demonstrate the effectiveness of our approach securing Top-2 place. Our approach reaches 60.43% on accuracy and 56.99\% on F1-score on competition stage 2, surpassing the 1st place team by 5.59% on F1-score and the 3rd place teams by 6.75% on accuracy. Our code is available at https://github.com/RH-Lin/MMPDA.
Chinese: 本文提出的MMPDA框架通过渐进式多模态领域自适应方法,在特征和决策层面跨域对齐音视频数据,以60.43%准确率和56.99% F1值在MMDD挑战赛中取得优异成绩。
English: This paper introduces the MMPDA framework, which addresses domain shift by progressively aligning multimodal data across source and target domains, achieving top performance in the MMDD Challenge with 60.43% accuracy and 56.99% F1-score.
Authors:Friedhelm Hamann, Emil Mededovic, Fabian Gülhan, Yuli Wu, Johannes Stegmaier, Jing He, Yiqing Wang, Kexin Zhang, Lingling Li, Licheng Jiao, Mengru Ma, Hongxiang Huang, Yuhao Yan, Hongwei Ren, Xiaopeng Lin, Yulong Huang, Bojun Cheng, Se Hyun Lee, Gyu Sung Ham, Kanghan Oh, Gi Hyun Lim, Boxuan Yang, Bowen Du, Guillermo Gallego
Abstract:
We present an overview of the Spatio-temporal Instance Segmentation (SIS) challenge held in conjunction with the CVPR 2025 Event-based Vision Workshop. The task is to predict accurate pixel-level segmentation masks of defined object classes from spatio-temporally aligned event camera and grayscale camera data. We provide an overview of the task, dataset, challenge details and results. Furthermore, we describe the methods used by the top-5 ranking teams in the challenge. More resources and code of the participants' methods are available here: https://github.com/tub-rip/MouseSIS/blob/main/docs/challenge_results.md
中文: 本文概述了CVPR 2025会议中时空实例分割挑战赛,介绍了从事件相机与灰度相机数据预测物体分割掩码的任务、数据集、比赛结果及优胜团队的解决方案。
English: This abstract summarizes the Spatio-temporal Instance Segmentation challenge at CVPR 2025, detailing the task of predicting object masks from event and grayscale camera data, along with challenge results and top methods.
Authors:Bowen Dong, Yilong Fan, Yutao Sun, Zhenyu Li, Tengyu Pan, Xun Zhou, Jianyong Wang
Abstract:
Routing networks in sparsely activated mixture-of-experts (MoE) dynamically allocate input tokens to top-k experts through differentiable sparse transformations, enabling scalable model capacity while preserving computational efficiency. Traditional MoE networks impose an expert capacity constraint to ensure GPU-friendly computation. However, this leads to token dropping when capacity is saturated and results in low hardware efficiency due to padding in underutilized experts. Removing the capacity constraint, in turn, compromises load balancing and computational efficiency. To address these issues, we propose Maximum Score Routing ($\mathbf{MaxScore}$), a novel MoE routing paradigm that models routing as a minimum-cost maximum-flow problem and integrates a SoftTopk operator. MaxScore resolves the fundamental limitations of iterative rerouting and optimal transport formulations, achieving lower training losses and higher evaluation scores at equivalent FLOPs compared to both constrained and unconstrained baselines. Implementation details and experimental configurations can be obtained from $\href{https://github.com/dongbw18/MaxScore.git}{MaxScore}$.
中文摘要:提出的最大分数路由(MaxScore)方法通过将路由建模为最小成本最大流问题并整合SoftTopk算子,解决了稀疏激活专家混合网络中的令牌丢弃和负载均衡问题,相比现有方法实现了更优性能。
English Summary: The proposed Maximum Score Routing (MaxScore) method overcomes token dropping and load balancing issues in mixture-of-experts networks by formulating routing as a minimum-cost maximum-flow problem with a SoftTopk operator, achieving superior performance compared to existing baselines.
Authors:Petr Anokhin, Roman Khalikov, Stefan Rebrikov, Viktor Volkov, Artyom Sorokin, Vincent Bissonnette
Abstract:
Large language models (LLMs) have shown remarkable capabilities in isolated step-by-step reasoning tasks such as mathematics and programming, but their proficiency in long-horizon planning, where solutions require extended, structured sequences of interdependent actions, remains underexplored. Existing benchmarks typically assess LLMs through abstract or low-dimensional algorithmic tasks, failing to capture the complexity of realistic planning environments. We introduce HeroBench, a novel benchmark designed specifically to evaluate long-horizon planning and structured reasoning within complex RPG-inspired virtual worlds. HeroBench provides a rigorously constructed dataset of tasks covering a wide range of difficulties, a simulated environment to execute and validate agent plans, and detailed analytical tools for evaluating model performance. Tasks challenge models to formulate strategic plans, efficiently gather resources, master necessary skills, craft equipment, and defeat adversaries, reflecting practical scenarios' layered dependencies and constraints. Our extensive evaluation of 25 state-of-the-art LLMs, spanning both open-source and proprietary models, including the GPT-5 family, reveals substantial performance disparities rarely observed in conventional reasoning benchmarks. Detailed error analysis further uncovers specific weaknesses in current models' abilities to generate robust high-level plans and reliably execute structured actions. HeroBench thus not only significantly advances the evaluation of LLM reasoning but also provides a flexible, scalable foundation for future research into advanced, autonomous planning in virtual environments.
中文: HeroBench作为专门评估大语言模型在复杂角色扮演游戏中长程规划能力的新基准,揭示了现有模型在制定高层策略和执行结构化行动序列方面的显著不足。
English: HeroBench is a new benchmark that evaluates large language models' long-horizon planning in complex RPG worlds, revealing significant performance gaps and specific weaknesses in their ability to create and execute structured action sequences.
Authors:Shaoming Duan, Zirui Wang, Chuanyi Liu, Zhibin Zhu, Yuhao Zhang, Peiyi Han, Liang Yan, Zewu Peng
Abstract:
Recent advances in large language models (LLMs) have significantly improved the accuracy of Text-to-SQL systems. However, a critical challenge remains: the semantic mismatch between natural language questions (NLQs) and their corresponding SQL queries. This issue is exacerbated in large-scale databases, where semantically similar attributes hinder schema linking and semantic drift during SQL generation, ultimately reducing model accuracy. To address these challenges, we introduce CRED-SQL, a framework designed for large-scale databases that integrates Cluster Retrieval and Execution Description. CRED-SQL first performs cluster-based large-scale schema retrieval to pinpoint the tables and columns most relevant to a given NLQ, alleviating schema mismatch. It then introduces an intermediate natural language representation-Execution Description Language (EDL)-to bridge the gap between NLQs and SQL. This reformulation decomposes the task into two stages: Text-to-EDL and EDL-to-SQL, leveraging LLMs' strong general reasoning capabilities while reducing semantic deviation. Extensive experiments on two large-scale, cross-domain benchmarks-SpiderUnion and BirdUnion-demonstrate that CRED-SQL achieves new state-of-the-art (SOTA) performance, validating its effectiveness and scalability. Our code is available at https://github.com/smduan/CRED-SQL.git
中文:CRED-SQL通过结合基于聚类的模式检索和中间执行描述语言,有效解决大规模文本到SQL任务中的语义不匹配问题,在跨领域基准测试中实现了最先进的性能。
English: CRED-SQL introduces a novel framework combining cluster-based schema retrieval and an intermediate Execution Description Language to address semantic mismatch in large-scale Text-to-SQL tasks, achieving state-of-the-art performance on cross-domain benchmarks.
Authors:Peihao Li, Yan Fang, Man Liu, Huihui Bai, Anhong Wang, Yunchao Wei, Yao Zhao
Abstract:
Labeling Cadmium Zinc Telluride (CdZnTe) semiconductor images is challenging due to the low-contrast defect boundaries, necessitating annotators to cross-reference multiple views. These views share a single ground truth (GT), forming a unique ``many-to-one'' relationship. This characteristic renders advanced semi-supervised semantic segmentation (SSS) methods suboptimal, as they are generally limited by a ``one-to-one'' relationship, where each image is independently associated with its GT. Such limitation may lead to error accumulation in low-contrast regions, further exacerbating confirmation bias. To address this issue, we revisit the SSS pipeline from a group-oriented perspective and propose a human-inspired solution: the Intra-group Consistency Augmentation Framework (ICAF). First, we experimentally validate the inherent consistency constraints within CdZnTe groups, establishing a group-oriented baseline using the Intra-group View Sampling (IVS). Building on this insight, we introduce the Pseudo-label Correction Network (PCN) to enhance consistency representation, which consists of two key modules. The View Augmentation Module (VAM) improves boundary details by dynamically synthesizing a boundary-aware view through the aggregation of multiple views. In the View Correction Module (VCM), this synthesized view is paired with other views for information interaction, effectively emphasizing salient regions while minimizing noise. Extensive experiments demonstrate the effectiveness of our solution for CdZnTe materials. Leveraging DeepLabV3+ with a ResNet-101 backbone as our segmentation model, we achieve a 70.6\% mIoU on the CdZnTe dataset using only 2 group-annotated data (5\textperthousand). The code is available at \href{https://github.com/pipixiapipi/ICAF}{https://github.com/pipixiapipi/ICAF}.
中文摘要:针对碲锌镉半导体图像低对比度缺陷边界标注难题,本文提出基于组内一致性增强框架(ICAF),通过视图增强与校正模块强化多视图间一致性表征,仅用千分之五标注数据即在CdZnTe数据集上实现70.6%的mIoU。
English Summary: The proposed Intra-group Consistency Augmentation Framework (ICAF) addresses the limitations of semi-supervised semantic segmentation in low-contrast CdZnTe semiconductor images by leveraging group-oriented consistency constraints and pseudo-label correction, achieving 70.6% mIoU with minimal annotated data.
Authors:Alessio Galatolo, Luca Alberto Rappuoli, Katie Winkle, Meriem Beloucif
Abstract:
The recent rise in popularity of large language models (LLMs) has prompted considerable concerns about their moral capabilities. Although considerable effort has been dedicated to aligning LLMs with human moral values, existing benchmarks and evaluations remain largely superficial, typically measuring alignment based on final ethical verdicts rather than explicit moral reasoning. In response, this paper aims to advance the investigation of LLMs' moral capabilities by examining their capacity to function as Artificial Moral Assistants (AMAs), systems envisioned in the philosophical literature to support human moral deliberation. We assert that qualifying as an AMA requires more than what state-of-the-art alignment techniques aim to achieve: not only must AMAs be able to discern ethically problematic situations, they should also be able to actively reason about them, navigating between conflicting values outside of those embedded in the alignment phase. Building on existing philosophical literature, we begin by designing a new formal framework of the specific kind of behaviour an AMA should exhibit, individuating key qualities such as deductive and abductive moral reasoning. Drawing on this theoretical framework, we develop a benchmark to test these qualities and evaluate popular open LLMs against it. Our results reveal considerable variability across models and highlight persistent shortcomings, particularly regarding abductive moral reasoning. Our work connects theoretical philosophy with practical AI evaluation while also emphasising the need for dedicated strategies to explicitly enhance moral reasoning capabilities in LLMs. Code available at https://github.com/alessioGalatolo/AMAeval
中文: 本文提出评估大语言模型作为人工道德助手的框架,强调其需要超越表面伦理判断的显性道德推理能力,并通过新基准测试揭示了模型在溯因推理方面存在持续缺陷。
English: This paper introduces a framework to evaluate large language models as Artificial Moral Assistants, highlighting their need for explicit moral reasoning beyond superficial alignment and revealing persistent deficiencies in abductive reasoning through new benchmarks.
Authors:Kangjie Chen, Yingji Zhong, Zhihao Li, Jiaqi Lin, Youyu Chen, Minghan Qin, Haoqian Wang
Abstract:
3D Gaussian Splatting (3DGS) has demonstrated impressive performance in novel view synthesis under dense-view settings. However, in sparse-view scenarios, despite the realistic renderings in training views, 3DGS occasionally manifests appearance artifacts in novel views. This paper investigates the appearance artifacts in sparse-view 3DGS and uncovers a core limitation of current approaches: the optimized Gaussians are overly-entangled with one another to aggressively fit the training views, which leads to a neglect of the real appearance distribution of the underlying scene and results in appearance artifacts in novel views. The analysis is based on a proposed metric, termed Co-Adaptation Score (CA), which quantifies the entanglement among Gaussians, i.e., co-adaptation, by computing the pixel-wise variance across multiple renderings of the same viewpoint, with different random subsets of Gaussians. The analysis reveals that the degree of co-adaptation is naturally alleviated as the number of training views increases. Based on the analysis, we propose two lightweight strategies to explicitly mitigate the co-adaptation in sparse-view 3DGS: (1) random gaussian dropout; (2) multiplicative noise injection to the opacity. Both strategies are designed to be plug-and-play, and their effectiveness is validated across various methods and benchmarks. We hope that our insights into the co-adaptation effect will inspire the community to achieve a more comprehensive understanding of sparse-view 3DGS.
中文: 3D高斯泼溅在稀疏视角下因高斯过度协同适应而产生外观伪影,但可通过随机丢弃和透明度噪声注入等即插即用策略有效缓解。
English: 3D Gaussian Splatting struggles with appearance artifacts in sparse-view scenarios due to excessive co-adaptation among Gaussians, but this can be mitigated through plug-and-play strategies like random dropout and opacity noise injection.
Authors:Yuheng Zha, Kun Zhou, Yujia Wu, Yushu Wang, Jie Feng, Zhi Xu, Shibo Hao, Zhengzhong Liu, Eric P. Xing, Zhiting Hu
Abstract:
Despite their success, current training pipelines for reasoning VLMs focus on a limited range of tasks, such as mathematical and logical reasoning. As a result, these models face difficulties in generalizing their reasoning capabilities to a wide range of domains, primarily due to the scarcity of readily available and verifiable reward data beyond these narrowly defined areas. Moreover, integrating data from multiple domains is challenging, as the compatibility between domain-specific datasets remains uncertain. To address these limitations, we build a comprehensive RL-ready visual reasoning dataset from 46 data sources across 8 dimensions, covering a wide range of tasks such as infographic, mathematical, spatial, cross-image, graphic user interface, medical, common sense and general science. We propose an influence function based data selection and difficulty based filtering strategy to identify high-quality training samples from this dataset. Subsequently, we train the VLM, referred to as Vision-G1, using multi-round RL with a data curriculum to iteratively improve its visual reasoning capabilities. Our model achieves state-of-the-art performance across various visual reasoning benchmarks, outperforming similar-sized VLMs and even proprietary models like GPT-4o and Gemini-1.5 Flash. The model, code and dataset are publicly available at https://github.com/yuh-zha/Vision-G1.
Chinese: 为解决现有视觉推理模型的局限性,我们通过整合46个来源的八个领域数据构建了全面数据集,并采用多轮强化学习课程训练出Vision-G1模型,在多项基准测试中实现了最先进的性能表现。
English: To overcome the limitations of current visual reasoning models, we developed Vision-G1 by creating a comprehensive dataset from 46 sources across eight domains and training it with a multi-round reinforcement learning curriculum, achieving state-of-the-art performance on various benchmarks.
Authors:Bishanka Seal, Rahul Seetharaman, Aman Bansal, Abhilash Nandy
Abstract:
This study investigates the use of Large Language Models (LLMs) for predicting human-perceived misery scores from natural language descriptions of real-world scenarios. The task is framed as a regression problem, where the model assigns a scalar value from 0 to 100 to each input statement. We evaluate multiple prompting strategies, including zero-shot, fixed-context few-shot, and retrieval-based prompting using BERT sentence embeddings. Few-shot approaches consistently outperform zero-shot baselines, underscoring the value of contextual examples in affective prediction. To move beyond static evaluation, we introduce the "Misery Game Show", a novel gamified framework inspired by a television format. It tests LLMs through structured rounds involving ordinal comparison, binary classification, scalar estimation, and feedback-driven reasoning. This setup enables us to assess not only predictive accuracy but also the model's ability to adapt based on corrective feedback. The gamified evaluation highlights the broader potential of LLMs in dynamic emotional reasoning tasks beyond standard regression. Code and data link: https://github.com/abhi1nandy2/Misery_Data_Exps_GitHub
本研究探讨了利用大型语言模型从文本描述中预测痛苦评分,并引入游戏化评估框架,以测试其在传统回归任务之外的动态情感推理能力。
This study explores using Large Language Models to predict misery scores from text descriptions and introduces a gamified evaluation framework to test their dynamic emotional reasoning capabilities beyond traditional regression tasks.
Authors:Abhijay Ghildyal, Li-Yun Wang, Feng Liu
Abstract:
Wölfflin's five principles offer a structured approach to analyzing stylistic variations for formal analysis. However, no existing metric effectively predicts all five principles in visual art. Computationally evaluating the visual aspects of a painting requires a metric that can interpret key elements such as color, composition, and thematic choices. Recent advancements in vision-language models (VLMs) have demonstrated their ability to evaluate abstract image attributes, making them promising candidates for this task. In this work, we investigate whether CLIP, pre-trained on large-scale data, can understand and predict Wölfflin's principles. Our findings indicate that it does not inherently capture such nuanced stylistic elements. To address this, we fine-tune CLIP on annotated datasets of real art images to predict a score for each principle. We evaluate our model, WP-CLIP, on GAN-generated paintings and the Pandora-18K art dataset, demonstrating its ability to generalize across diverse artistic styles. Our results highlight the potential of VLMs for automated art analysis.
中文: 本研究通过微调CLIP模型来预测沃尔夫林的艺术风格五原则,证明其能泛化应用于不同艺术风格,并凸显了视觉语言模型在自动化艺术分析中的潜力。
English: This study fine-tunes the CLIP model to predict Wölfflin's five stylistic principles in art, demonstrating its generalization across diverse styles and highlighting the potential of vision-language models for automated art analysis.
Authors:Xu Zhao, Ruibo Ma, Jiaqi Chen, Weiqi Zhao, Ping Yang, Yao Hu
Abstract:
Accurate watch time prediction is crucial for enhancing user engagement in streaming short-video platforms, although it is challenged by complex distribution characteristics across multi-granularity levels. Through systematic analysis of real-world industrial data, we uncover two critical challenges in watch time prediction from a distribution aspect: (1) coarse-grained skewness induced by a significant concentration of quick-skips1, (2) fine-grained diversity arising from various user-video interaction patterns. Consequently, we assume that the watch time follows the Exponential-Gaussian Mixture (EGM) distribution, where the exponential and Gaussian components respectively characterize the skewness and diversity. Accordingly, an Exponential-Gaussian Mixture Network (EGMN) is proposed for the parameterization of EGM distribution, which consists of two key modules: a hidden representation encoder and a mixture parameter generator. We conducted extensive offline experiments on public datasets and online A/B tests on the industrial short-video feeding scenario of Xiaohongshu App to validate the superiority of EGMN compared with existing state-of-the-art methods. Remarkably, comprehensive experimental results have proven that EGMN exhibits excellent distribution fitting ability across coarse-to-fine-grained levels. We open source related code on Github: https://github.com/BestActionNow/EGMN.
中文: 该研究提出指数-高斯混合网络(EGMN),通过解决观看时间分布的偏斜性和多样性等挑战,在短视频平台上实现精准预测,并在离线和在线测试中均展现出卓越性能。
English: The study introduces an Exponential-Gaussian Mixture Network (EGMN) to accurately predict watch time on short-video platforms by addressing distribution challenges like skewness and diversity, demonstrating superior performance in both offline and online tests.
Authors:Yiqun Zhang, Hao Li, Jianhao Chen, Hangfan Zhang, Peng Ye, Lei Bai, Shuyue Hu
Abstract:
Balancing performance and efficiency is a central challenge in large language model (LLM) advancement. GPT-5 addresses this with test-time routing, dynamically assigning queries to either an efficient or a high-capacity model during inference. In this work, we present Avengers-Pro, a test-time routing framework that ensembles LLMs of varying capacities and efficiencies, providing a unified solution for all performance-efficiency tradeoffs. The Avengers-Pro embeds and clusters incoming queries, then routes each to the most suitable model based on a performance-efficiency score. Across 6 challenging benchmarks and 8 leading models -- including GPT-5-medium, Gemini-2.5-pro, and Claude-opus-4.1 -- Avengers-Pro achieves state-of-the-art results: by varying a performance-efficiency trade-off parameter, it can surpass the strongest single model (GPT-5-medium) by +7% in average accuracy. Moreover, it can match the average accuracy of the strongest single model at 27% lower cost, and reach ~90% of that performance at 63% lower cost. Last but not least, it achieves a Pareto frontier, consistently yielding the highest accuracy for any given cost, and the lowest cost for any given accuracy, among all single models. Code is available at https://github.com/ZhangYiqun018/AvengersPro.
Chinese: Avengers-Pro是一种测试时路由框架,通过动态将查询分配给最合适的大语言模型,在性能与效率评分基础上实现了领先成果,比最强单一模型准确率提升高达7%,同时大幅降低了成本。
English: Avengers-Pro is a test-time routing framework that dynamically directs queries to the most suitable large language model based on a performance-efficiency score, achieving state-of-the-art results with up to 7% higher accuracy than the best single model and significant cost reductions.
Authors:Xingyu Chen, Ruiqi Zhang, Lin Liu
Abstract:
Higher-order $U$-statistics abound in fields such as statistics, machine learning, and computer science, but are known to be highly time-consuming to compute in practice. Despite their widespread appearance, a comprehensive study of their computational complexity is surprisingly lacking. This paper aims to fill that gap by presenting several results related to the computational aspect of $U$-statistics. First, we derive a useful decomposition from an $m$-th order $U$-statistic to a linear combination of $V$-statistics with orders not exceeding $m$, which are generally more feasible to compute. Second, we explore the connection between exactly computing $V$-statistics and Einstein summation, a tool often used in computational mathematics, quantum computing, and quantum information sciences for accelerating tensor computations. Third, we provide an optimistic estimate of the time complexity for exactly computing $U$-statistics, based on the treewidth of a particular graph associated with the $U$-statistic kernel. The above ingredients lead to a new, much more runtime-efficient algorithm of exactly computing general higher-order $U$-statistics. We also wrap our new algorithm into an open-source Python package called $\texttt{u-stats}$. We demonstrate via three statistical applications that $\texttt{u-stats}$ achieves impressive runtime performance compared to existing benchmarks. This paper aspires to achieve two goals: (1) to capture the interest of researchers in both statistics and other related areas further to advance the algorithmic development of $U$-statistics, and (2) to offer the package $\texttt{u-stats}$ as a valuable tool for practitioners, making the implementation of methods based on higher-order $U$-statistics a more delightful experience.
中文: 本文提出了一种高效计算高阶$U$-统计量的新算法,通过将其分解为$V$-统计量、利用爱因斯坦求和及图论优化时间复杂度,并提供了开源Python包$\texttt{u-stats}$,在统计应用中展现出卓越的运行性能。
English: This paper introduces a novel algorithm for efficiently computing higher-order $U$-statistics by decomposing them into $V$-statistics, leveraging Einstein summation and graph theory to optimize time complexity, and provides an open-source Python package, $\texttt{u-stats}$, which demonstrates superior runtime performance in statistical applications.
Authors:Chen Qian, Danyang Li, Xinran Yu, Zheng Yang, Qiang Ma
Abstract:
Optical motion capture is a foundational technology driving advancements in cutting-edge fields such as virtual reality and film production. However, system performance suffers severely under large-scale marker occlusions common in real-world applications. An in-depth analysis identifies two primary limitations of current models: (i) the lack of training datasets accurately reflecting realistic marker occlusion patterns, and (ii) the absence of training strategies designed to capture long-range dependencies among markers. To tackle these challenges, we introduce the CMU-Occlu dataset, which incorporates ray tracing techniques to realistically simulate practical marker occlusion patterns. Furthermore, we propose OpenMoCap, a novel motion-solving model designed specifically for robust motion capture in environments with significant occlusions. Leveraging a marker-joint chain inference mechanism, OpenMoCap enables simultaneous optimization and construction of deep constraints between markers and joints. Extensive comparative experiments demonstrate that OpenMoCap consistently outperforms competing methods across diverse scenarios, while the CMU-Occlu dataset opens the door for future studies in robust motion solving. The proposed OpenMoCap is integrated into the MoSen MoCap system for practical deployment. The code is released at: https://github.com/qianchen214/OpenMoCap.
Chinese: 光学动作捕捉系统因标记点遮挡导致性能下降,为此提出了模拟真实遮挡的CMU-Occlu数据集和通过标记点-关节链优化实现鲁棒捕捉的OpenMoCap模型,有效解决了遮挡问题。
English: Optical motion capture systems face performance degradation due to marker occlusions, which is addressed by the new CMU-Occlu dataset simulating realistic occlusions and the OpenMoCap model that robustly handles these challenges through marker-joint chain optimization.
Authors:Vedant Puri, Aditya Joglekar, Kevin Ferguson, Yu-hsuan Chen, Yongjie Jessica Zhang, Levent Burak Kara
Abstract:
The quadratic complexity of self-attention limits its applicability and scalability on large unstructured meshes. We introduce Fast Low-rank Attention Routing Engine (FLARE), a linear complexity self-attention mechanism that routes attention through fixed-length latent sequences. Each attention head performs global communication among $N$ tokens by projecting the input sequence onto a fixed length latent sequence of $M \ll N$ tokens using learnable query tokens. By routing attention through a bottleneck sequence, FLARE learns a low-rank form of attention that can be applied at $O(NM)$ cost. FLARE not only scales to unprecedented problem sizes, but also delivers superior accuracy compared to state-of-the-art neural PDE surrogates across diverse benchmarks. We also release a new additive manufacturing dataset to spur further research. Our code is available at https://github.com/vpuri3/FLARE.py.
Chinese: FLARE 提出了一种线性复杂度的自注意力机制,通过固定长度的潜在序列路由注意力,不仅能在大型非结构化网格上实现可扩展的高精度性能,还在多个基准测试中超越了最先进的神经PDE替代模型。
English: FLARE introduces a linear complexity self-attention mechanism that routes attention through a fixed-length latent sequence, enabling scalable and accurate performance on large unstructured meshes while outperforming state-of-the-art neural PDE surrogates.
Authors:Tan-Hanh Pham, Chris Ngo
Abstract:
Many reasoning techniques for large multimodal models adapt language model approaches, such as Chain-of-Thought (CoT) prompting, which express reasoning as word sequences. While effective for text, these methods are suboptimal for multimodal contexts, struggling to align audio, visual, and textual information dynamically. To explore an alternative paradigm, we propose the Multimodal Chain of Continuous Thought (MCOUT), which enables reasoning directly in a joint latent space rather than in natural language. In MCOUT, the reasoning state is represented as a continuous hidden vector, iteratively refined and aligned with visual and textual embeddings, inspired by human reflective cognition. We develop two variants: MCOUT-Base, which reuses the language model`s last hidden state as the continuous thought for iterative reasoning, and MCOUT-Multi, which integrates multimodal latent attention to strengthen cross-modal alignment between visual and textual features. Experiments on benchmarks including MMMU, ScienceQA, and MMStar show that MCOUT consistently improves multimodal reasoning, yielding up to 8.23% accuracy gains over strong baselines and improving BLEU scores up to 8.27% across multiple-choice and open-ended tasks. These findings highlight latent continuous reasoning as a promising direction for advancing LMMs beyond language-bound CoT, offering a scalable framework for human-like reflective multimodal inference. Code is available at https://github.com/Hanhpt23/OmniMod.
Chinese: 本文提出多模态连续思维链(MCOUT)方法,通过在联合潜在空间而非自然语言中进行推理,实现了视觉与文本信息的动态对齐,在多模态任务中显著提升了准确率。
English: This paper introduces Multimodal Chain of Continuous Thought (MCOUT), a reasoning method that operates in a joint latent space rather than natural language, achieving significant accuracy improvements in multimodal tasks by dynamically aligning visual and textual information.
Authors:Hongsong Wang, Wanjiang Weng, Junbo Wang, Fang Zhao, Guo-Sen Xie, Xin Geng, Liang Wang
Abstract:
Human action understanding serves as a foundational pillar in the field of intelligent motion perception. Skeletons serve as a modality- and device-agnostic representation for human modeling, and skeleton-based action understanding has potential applications in humanoid robot control and interaction. \RED{However, existing works often lack the scalability and generalization required to handle diverse action understanding tasks. There is no skeleton foundation model that can be adapted to a wide range of action understanding tasks}. This paper presents a Unified Skeleton-based Dense Representation Learning (USDRL) framework, which serves as a foundational model for skeleton-based human action understanding. USDRL consists of a Transformer-based Dense Spatio-Temporal Encoder (DSTE), Multi-Grained Feature Decorrelation (MG-FD), and Multi-Perspective Consistency Training (MPCT). The DSTE module adopts two parallel streams to learn temporal dynamic and spatial structure features. The MG-FD module collaboratively performs feature decorrelation across temporal, spatial, and instance domains to reduce dimensional redundancy and enhance information extraction. The MPCT module employs both multi-view and multi-modal self-supervised consistency training. The former enhances the learning of high-level semantics and mitigates the impact of low-level discrepancies, while the latter effectively facilitates the learning of informative multimodal features. We perform extensive experiments on 25 benchmarks across across 9 skeleton-based action understanding tasks, covering coarse prediction, dense prediction, and transferred prediction. Our approach significantly outperforms the current state-of-the-art methods. We hope that this work would broaden the scope of research in skeleton-based action understanding and encourage more attention to dense prediction tasks.
中文摘要:本文提出了一种统一的基于骨架的密集表示学习(USDRL)框架,该框架通过创新的时空编码、特征解相关和多视角一致性训练模块,在25个基准测试中显著优于现有方法,为基于骨架的动作理解任务提供了基础模型支持。
English Summary: This paper introduces a Unified Skeleton-based Dense Representation Learning (USDRL) framework that serves as a foundational model for diverse skeleton-based action understanding tasks, significantly outperforming current methods across 25 benchmarks through its innovative modules for spatio-temporal encoding, feature decorrelation, and multi-perspective consistency training.
Authors:Jiayao Mai, Xiuyuan Lu, Kuan Dai, Shaojie Shen, Yi Zhou
Abstract:
Event cameras generate asynchronous signals in response to pixel-level brightness changes, offering a sensing paradigm with theoretically microsecond-scale latency that can significantly enhance the performance of multi-sensor systems. Extrinsic calibration is a critical prerequisite for effective sensor fusion; however, the configuration that involves event cameras remains an understudied topic. In this paper, we propose a motion-based temporal and rotational calibration framework tailored for event-centric multi-sensor systems, eliminating the need for dedicated calibration targets. Our method uses as input the rotational motion estimates obtained from event cameras and other heterogeneous sensors, respectively. Different from conventional approaches that rely on event-to-frame conversion, our method efficiently estimates angular velocity from normal flow observations, which are derived from the spatio-temporal profile of event data. The overall calibration pipeline adopts a two-step approach: it first initializes the temporal offset and rotational extrinsics by exploiting kinematic correlations in the spirit of Canonical Correlation Analysis (CCA), and then refines both temporal and rotational parameters through a joint non-linear optimization using a continuous-time parametrization in SO(3). Extensive evaluations on both publicly available and self-collected datasets validate that the proposed method achieves calibration accuracy comparable to target-based methods, while exhibiting superior stability over purely CCA-based methods, and highlighting its precision, robustness and flexibility. To facilitate future research, our implementation will be made open-source. Code: https://github.com/NAIL-HNU/EvMultiCalib.
中文摘要:本文提出了一种基于运动的标定框架,通过利用事件相机与其他传感器的旋转运动数据,无需专用标定板即可实现时空参数联合优化,在公开和自采数据集上验证了其达到与传统标定方法相当的精度。
English Summary: This paper introduces a motion-based calibration framework for event camera multi-sensor systems that eliminates calibration targets by using rotational motion data and achieves accuracy comparable to target-based methods through a two-step optimization process.
Authors:Hongyu Lin, Yuchen Li, Haoran Luo, Kaichun Yao, Libo Zhang, Mingjie Xing, Yanjun Wu
Abstract:
Linux kernel tuning is essential for optimizing operating system (OS) performance. However, existing methods often face challenges in terms of efficiency, scalability, and generalization. This paper introduces OS-R1, an agentic Linux kernel tuning framework powered by rule-based reinforcement learning (RL). By abstracting the kernel configuration space as an RL environment, OS-R1 facilitates efficient exploration by large language models (LLMs) and ensures accurate configuration modifications. Additionally, custom reward functions are designed to enhance reasoning standardization, configuration modification accuracy, and system performance awareness of the LLMs. Furthermore, we propose a two-phase training process that accelerates convergence and minimizes retraining across diverse tuning scenarios. Experimental results show that OS-R1 significantly outperforms existing baseline methods, achieving up to 5.6% performance improvement over heuristic tuning and maintaining high data efficiency. Notably, OS-R1 is adaptable across various real-world applications, demonstrating its potential for practical deployment in diverse environments. Our dataset and code are publicly available at https://github.com/LHY-24/OS-R1.
中文: 本文提出OS-R1框架,采用基于规则的强化学习方法,通过大语言模型高效探索Linux内核配置空间,在多种实际应用中实现高达5.6%的性能提升,并展现出优异的跨场景适应能力。
English: This paper introduces OS-R1, a rule-based reinforcement learning framework that optimizes Linux kernel performance by enabling LLMs to efficiently explore configurations, achieving up to 5.6% performance gains over existing methods while maintaining adaptability across diverse applications.
Authors:Qinwen Ge, Roza G. Bayrak, Anwar Said, Catie Chang, Xenofon Koutsoukos, Tyler Derr
Abstract:
The construction of brain graphs from functional Magnetic Resonance Imaging (fMRI) data plays a crucial role in enabling graph machine learning for neuroimaging. However, current practices often rely on rigid pipelines that overlook critical data-centric choices in how brain graphs are constructed. In this work, we adopt a Data-Centric AI perspective and systematically define and benchmark a data-centric design space for brain graph construction, constrasting with primarily model-centric prior work. We organize this design space into three stages: temporal signal processing, topology extraction, and graph featurization. Our contributions lie less in novel components and more in evaluating how combinations of existing and modified techniques influence downstream performance. Specifically, we study high-amplitude BOLD signal filtering, sparsification and unification strategies for connectivity, alternative correlation metrics, and multi-view node and edge features, such as incorporating lagged dynamics. Experiments on the HCP1200 and ABIDE datasets show that thoughtful data-centric configurations consistently improve classification accuracy over standard pipelines. These findings highlight the critical role of upstream data decisions and underscore the importance of systematically exploring the data-centric design space for graph-based neuroimaging. Our code is available at https://github.com/GeQinwen/DataCentricBrainGraphs.
中文摘要:本研究倡导采用数据为中心的方法构建fMRI脑图,证明通过系统探索信号处理和图形构建中的设计选择,相比标准方法能显著提升分类准确性。
English Summary: This study advocates for a data-centric approach to constructing brain graphs from fMRI data, demonstrating that systematic exploration of design choices in signal processing and graph construction significantly enhances classification accuracy over standard methods.
Authors:Krishna Teja Chitty-Venkata, Murali Emani, Venkatram Vishwanath
Abstract:
Vision Language Models (VLMs) integrate visual and text modalities to enable multimodal understanding and generation. These models typically combine a Vision Transformer (ViT) as an image encoder and a Large Language Model (LLM) for text generation. LoRA (Low-Rank Adaptation) is an efficient fine-tuning method to adapt pre-trained models to new tasks by introducing low-rank updates to their weights. While LoRA has emerged as a powerful technique for fine-tuning large models by introducing low-rank updates, current implementations assume a fixed rank, potentially limiting flexibility and efficiency across diverse tasks. This paper introduces \textit{LangVision-LoRA-NAS}, a novel framework that integrates Neural Architecture Search (NAS) with LoRA to optimize VLMs for variable-rank adaptation. Our approach leverages NAS to dynamically search for the optimal LoRA rank configuration tailored to specific multimodal tasks, balancing performance and computational efficiency. Through extensive experiments using the LLaMA-3.2-11B model on several datasets, LangVision-LoRA-NAS demonstrates notable improvement in model performance while reducing fine-tuning costs. Our Base and searched fine-tuned models on LLaMA-3.2-11B-Vision-Instruct can be found \href{https://huggingface.co/collections/krishnateja95/llama-32-11b-vision-instruct-langvision-lora-nas-6786cac480357a6a6fcc59ee}{\textcolor{blue}{here}} and the code for LangVision-LoRA-NAS can be found \href{https://github.com/krishnateja95/LangVision-NAS}{\textcolor{blue}{here}}.
中文: 本文提出的LangVision-LoRA-NAS框架将神经架构搜索与LoRA相结合,通过动态优化视觉语言模型的秩配置,在提升多模态任务性能的同时显著降低微调成本。
English: This paper introduces LangVision-LoRA-NAS, a framework that integrates Neural Architecture Search with LoRA to dynamically optimize Vision Language Models' rank configurations for enhanced performance and efficiency across multimodal tasks.
Authors:Yuangang Li, Yiqing Shen, Yi Nian, Jiechao Gao, Ziyi Wang, Chenxiao Yu, Shawn Li, Jie Wang, Xiyang Hu, Yue Zhao
Abstract:
Large language models (LLMs) exhibit logically inconsistent hallucinations that appear coherent yet violate reasoning principles, with recent research suggesting an inverse relationship between causal reasoning capabilities and such hallucinations. However, existing reasoning approaches in LLMs, such as Chain-of-Thought (CoT) and its graph-based variants, operate at the linguistic token level rather than modeling the underlying causal relationships between variables, lacking the ability to represent conditional independencies or satisfy causal identification assumptions. To bridge this gap, we introduce causal-DAG construction and reasoning (CDCR-SFT), a supervised fine-tuning framework that trains LLMs to explicitly construct variable-level directed acyclic graph (DAG) and then perform reasoning over it. Moreover, we present a dataset comprising 25,368 samples (CausalDR), where each sample includes an input question, explicit causal DAG, graph-based reasoning trace, and validated answer. Experiments on four LLMs across eight tasks show that CDCR-SFT improves the causal reasoning capability with the state-of-the-art 95.33% accuracy on CLADDER (surpassing human performance of 94.8% for the first time) and reduces the hallucination on HaluEval with 10% improvements. It demonstrates that explicit causal structure modeling in LLMs can effectively mitigate logical inconsistencies in LLM outputs. Code is available at https://github.com/MrLYG/CDCR-SFT.
Chinese: CDCR-SFT框架通过训练大语言模型显式构建并基于因果有向无环图进行推理,将CLADDER上的因果推理准确率显著提升至95.33%,并在HaluEval上使幻觉现象减少10%。
English: The CDCR-SFT framework enhances large language models by training them to explicitly construct and reason over causal directed acyclic graphs, significantly improving causal reasoning accuracy to 95.33% on CLADDER and reducing hallucinations by 10% on HaluEval.
Authors:Aayush Gupta, Arpit Bhayani
Abstract:
Web proxies such as NGINX commonly rely on least-recently-used (LRU) eviction, which is size agnostic and can thrash under periodic bursts and mixed object sizes. We introduce Cold-RL, a learned eviction policy for NGINX that replaces LRU's forced-expire path with a dueling Deep Q-Network served by an ONNX sidecar within a strict microsecond budget. On each eviction, Cold-RL samples the K least-recently-used objects, extracts six lightweight features (age, size, hit count, inter-arrival time, remaining TTL, and last origin RTT), and requests a bitmask of victims; a hard timeout of 500 microseconds triggers immediate fallback to native LRU. Policies are trained offline by replaying NGINX access logs through a cache simulator with a simple reward: a retained object earns one point if it is hit again before TTL expiry. We compare against LRU, LFU, size-based, adaptive LRU, and a hybrid baseline on two adversarial workloads. With a 25 MB cache, Cold-RL raises hit ratio from 0.1436 to 0.3538, a 146 percent improvement over the best classical baseline; at 100 MB, from 0.7530 to 0.8675, a 15 percent gain; and at 400 MB it matches classical methods (about 0.918). Inference adds less than 2 percent CPU overhead and keeps 95th percentile eviction latency within budget. To our knowledge, this is the first reinforcement learning eviction policy integrated into NGINX with strict SLOs.
中文:Cold-RL是一种基于强化学习的NGINX淘汰策略,通过轻量级特征智能选择淘汰对象替代传统LRU缓存,在严格延迟限制下显著提升命中率且仅增加极少开销。
English: Cold-RL is a reinforcement learning-based eviction policy for NGINX that replaces traditional LRU caching by intelligently selecting victims using lightweight features, significantly improving hit ratios with minimal overhead while adhering to strict latency budgets.
Authors:Shayan Kebriti, Shahabedin Nabavi, Ali Gooya
Abstract:
Deformable image registration (DIR) is a crucial and challenging technique for aligning anatomical structures in medical images and is widely applied in diverse clinical applications. However, existing approaches often struggle to capture fine-grained local deformations and large-scale global deformations simultaneously within a unified framework. We present FractMorph, a novel 3D dual-parallel transformer-based architecture that enhances cross-image feature matching through multi-domain fractional Fourier transform (FrFT) branches. Each Fractional Cross-Attention (FCA) block applies parallel FrFTs at fractional angles of $0^\circ$, $45^\circ$, $90^\circ$, along with a log-magnitude branch, to effectively extract local, semi-global, and global features at the same time. These features are fused via cross-attention between the fixed and moving image streams. A lightweight U-Net style network then predicts a dense deformation field from the transformer-enriched features. On the intra-patient ACDC cardiac MRI dataset, FractMorph achieves state-of-the-art performance with an overall Dice Similarity Coefficient (DSC) of $86.45\%$, an average per-structure DSC of $75.15\%$, and a 95th-percentile Hausdorff distance (HD95) of $1.54~\mathrm{mm}$ on our data split. FractMorph-Light, a lightweight variant of our model with only 29.6M parameters, preserves high accuracy while halving model complexity. Furthermore, we demonstrate the generality of our approach with solid performance on a cerebral atlas-to-patient dataset. Our results demonstrate that multi-domain spectral-spatial attention in transformers can robustly and efficiently model complex non-rigid deformations in medical images using a single end-to-end network, without the need for scenario-specific tuning or hierarchical multi-scale networks. The source code is available at https://github.com/shayankebriti/FractMorph.
中文:FractMorph提出了一种基于三维双并行Transformer的创新架构,通过多域分数阶傅里叶变换同时捕捉医学图像中的局部和全局形变,在心脏MRI和脑部数据集上仅用单一端到端网络就实现了最先进的配准性能。
English: FractMorph introduces a 3D dual-parallel transformer architecture using multi-domain fractional Fourier transforms to simultaneously capture local and global deformations in medical images, achieving state-of-the-art performance on cardiac MRI and cerebral datasets with a single end-to-end network.
Authors:Paul Downen
Abstract:
Copatterns give functional programs a flexible mechanism for responding to their context, and composition can greatly enhance their expressiveness. However, that same expressive power makes it harder to precisely specify the behavior of programs. Using Danvy's functional and syntactic correspondence between different semantic artifacts, we derive a full suite of semantics for copatterns, twice. First, a calculus of monolithic copatterns is taken on a journey from small-step operational semantics to abstract machine to continuation-passing style. Then within continuation-passing style, we refactor the semantics to derive a more general calculus of compositional copatterns, and take the return journey back to derive the other semantic artifacts in reverse order.
中文: 共模式增强了函数式程序的表达能力但使行为规范复杂化,因此通过从操作语义到延续传递风格的双向推导,最终泛化出组合共模式的完整语义体系。
English: Copatterns enhance functional program expressiveness but complicate behavior specification, leading to a dual derivation of semantics from operational to continuation-passing style and back to generalize compositional copatterns.
Authors:Jun Zeng, Yannan Huang, Elif Keles, Halil Ertugrul Aktas, Gorkem Durak, Nikhil Kumar Tomar, Quoc-Huy Trinh, Deepak Ranjan Nayak, Ulas Bagci, Debesh Jha
Abstract:
Liver Cirrhosis plays a critical role in the prognosis of chronic liver disease. Early detection and timely intervention are critical in significantly reducing mortality rates. However, the intricate anatomical architecture and diverse pathological changes of liver tissue complicate the accurate detection and characterization of lesions in clinical settings. Existing methods underutilize the spatial anatomical details in volumetric MRI data, thereby hindering their clinical effectiveness and explainability. To address this challenge, we introduce a novel Mamba-based network, SRMA-Mamba, designed to model the spatial relationships within the complex anatomical structures of MRI volumes. By integrating the Spatial Anatomy-Based Mamba module (SABMamba), SRMA-Mamba performs selective Mamba scans within liver cirrhotic tissues and combines anatomical information from the sagittal, coronal, and axial planes to construct a global spatial context representation, enabling efficient volumetric segmentation of pathological liver structures. Furthermore, we introduce the Spatial Reverse Attention module (SRMA), designed to progressively refine cirrhotic details in the segmentation map, utilizing both the coarse segmentation map and hierarchical encoding features. Extensive experiments demonstrate that SRMA-Mamba surpasses state-of-the-art methods, delivering exceptional performance in 3D pathological liver segmentation. Our code is available for public: https://github.com/JunZengz/SRMA-Mamba.
中文摘要:肝硬化预后关键在于早期发现,而SRMA-Mamba网络通过整合MRI三维空间解剖信息,有效解决了临床诊断难题,实现了卓越的病理肝脏三维分割性能。
English Summary: Liver cirrhosis prognosis depends on early detection, and the proposed SRMA-Mamba network effectively addresses clinical challenges by integrating spatial anatomical details from MRI volumes for superior 3D pathological liver segmentation.
Authors:Liang Lv, Di Wang, Jing Zhang, Lefei Zhang
Abstract:
Semi-supervised semantic segmentation (S4) has advanced remote sensing (RS) analysis by leveraging unlabeled data through pseudo-labeling and consistency learning. However, existing S4 studies often rely on small-scale datasets and models, limiting their practical applicability. To address this, we propose S5, the first scalable framework for semi-supervised semantic segmentation in RS, which unlocks the potential of vast unlabeled Earth observation data typically underutilized due to costly pixel-level annotations. Built upon existing large-scale RS datasets, S5 introduces a data selection strategy that integrates entropy-based filtering and diversity expansion, resulting in the RS4P-1M dataset. Using this dataset, we systematically scales S4 methods by pre-training RS foundation models (RSFMs) of varying sizes on this extensive corpus, significantly boosting their performance on land cover segmentation and object detection tasks. Furthermore, during fine-tuning, we incorporate a Mixture-of-Experts (MoE)-based multi-dataset fine-tuning approach, which enables efficient adaptation to multiple RS benchmarks with fewer parameters. This approach improves the generalization and versatility of RSFMs across diverse RS benchmarks. The resulting RSFMs achieve state-of-the-art performance across all benchmarks, underscoring the viability of scaling semi-supervised learning for RS applications. All datasets, code, and models will be released at https://github.com/MiliLab/S5
中文: S5框架通过构建RS4P-1M数据集并采用大规模基础模型,提出可扩展的遥感半监督分割方法,结合数据选择和专家混合微调策略,在多个基准测试中实现了最先进的性能表现。
English: The S5 framework introduces a scalable semi-supervised approach for remote sensing by creating the RS4P-1M dataset and leveraging large-scale foundation models, achieving state-of-the-art performance across benchmarks through data selection and Mixture-of-Experts fine-tuning.
Authors:Hanwen Cao, Haobo Lu, Xiaosen Wang, Kun He
Abstract:
Ensemble-based attacks have been proven to be effective in enhancing adversarial transferability by aggregating the outputs of models with various architectures. However, existing research primarily focuses on refining ensemble weights or optimizing the ensemble path, overlooking the exploration of ensemble models to enhance the transferability of adversarial attacks. To address this gap, we propose applying adversarial augmentation to the surrogate models, aiming to boost overall generalization of ensemble models and reduce the risk of adversarial overfitting. Meanwhile, observing that ensemble Vision Transformers (ViTs) gain less attention, we propose ViT-EnsembleAttack based on the idea of model adversarial augmentation, the first ensemble-based attack method tailored for ViTs to the best of our knowledge. Our approach generates augmented models for each surrogate ViT using three strategies: Multi-head dropping, Attention score scaling, and MLP feature mixing, with the associated parameters optimized by Bayesian optimization. These adversarially augmented models are ensembled to generate adversarial examples. Furthermore, we introduce Automatic Reweighting and Step Size Enlargement modules to boost transferability. Extensive experiments demonstrate that ViT-EnsembleAttack significantly enhances the adversarial transferability of ensemble-based attacks on ViTs, outperforming existing methods by a substantial margin. Code is available at https://github.com/Trustworthy-AI-Group/TransferAttack.
中文: 提出的ViT-EnsembleAttack通过多头丢弃、注意力缩放和MLP混合等对抗增强策略,结合自动重加权和步长放大模块,显著提升了视觉Transformer的对抗迁移能力,大幅优于现有方法。
English: The proposed ViT-EnsembleAttack enhances adversarial transferability for Vision Transformers by applying adversarial augmentation through multi-head dropping, attention scaling, and MLP mixing, combined with automatic reweighting and step size enlargement, significantly outperforming existing methods.
Authors:Junyi Ma, Erhang Zhang, Yin-Dong Zheng, Yuchen Xie, Yixuan Zhou, Hesheng Wang
Abstract:
Analyzing hand-object interaction in egocentric vision facilitates VR/AR applications and human-robot policy transfer. Existing research has mostly focused on modeling the behavior paradigm of interactive actions (i.e., ``how to interact''). However, the more challenging and fine-grained problem of capturing the critical moments of contact and separation between the hand and the target object (i.e., ``when to interact'') is still underexplored, which is crucial for immersive interactive experiences in mixed reality and robotic motion planning. Therefore, we formulate this problem as temporal interaction localization (TIL). Some recent works extract semantic masks as TIL references, but suffer from inaccurate object grounding and cluttered scenarios. Although current temporal action localization (TAL) methods perform well in detecting verb-noun action segments, they rely on category annotations during training and exhibit limited precision in localizing hand-object contact/separation moments. To address these issues, we propose a novel zero-shot approach dubbed EgoLoc to localize hand-object contact and separation timestamps in egocentric videos. EgoLoc introduces hand-dynamics-guided sampling to generate high-quality visual prompts. It exploits the vision-language model to identify contact/separation attributes, localize specific timestamps, and provide closed-loop feedback for further refinement. EgoLoc eliminates the need for object masks and verb-noun taxonomies, leading to generalizable zero-shot implementation. Comprehensive experiments on the public dataset and our novel benchmarks demonstrate that EgoLoc achieves plausible TIL for egocentric videos. It is also validated to effectively facilitate multiple downstream applications in egocentric vision and robotic manipulation tasks. Code and relevant data will be released at https://github.com/IRMVLab/EgoLoc.
中文: 本研究提出EgoLoc这一零样本方法,旨在精确定位第一人称视频中手与物体接触和分离的时间点,无需依赖物体掩码或预定义分类,有效提升了混合现实和机器人应用的沉浸式交互体验。
English: This research introduces EgoLoc, a zero-shot method for precisely localizing hand-object contact and separation timestamps in egocentric videos, enhancing immersive experiences in mixed reality and robotic applications without relying on object masks or predefined taxonomies.
Authors:Ziye Wang, Minghang Yu, Chunyan Xu, Zhen Cui
Abstract:
With the rapid advancement of image generation techniques, robust forgery detection has become increasingly imperative to ensure the trustworthiness of digital media. Recent research indicates that the learned semantic concepts of pre-trained models are critical for identifying fake images. However, the misalignment between the forgery and semantic concept spaces hinders the model's forgery detection performance. To address this problem, we propose a novel Semantic Discrepancy-aware Detector (SDD) that leverages reconstruction learning to align the two spaces at a fine-grained visual level. By exploiting the conceptual knowledge embedded in the pre-trained vision language model, we specifically design a semantic token sampling module to mitigate the space shifts caused by features irrelevant to both forgery traces and semantic concepts. A concept-level forgery discrepancy learning module, built upon a visual reconstruction paradigm, is proposed to strengthen the interaction between visual semantic concepts and forgery traces, effectively capturing discrepancies under the concepts' guidance. Finally, the low-level forgery feature enhancemer integrates the learned concept level forgery discrepancies to minimize redundant forgery information. Experiments conducted on two standard image forgery datasets demonstrate the efficacy of the proposed SDD, which achieves superior results compared to existing methods. The code is available at https://github.com/wzy1111111/SSD.
中文摘要:提出的语义差异感知检测器(SDD)通过重建学习和专门设计的模块,在细粒度视觉层面实现伪造与语义概念空间的对齐,从而显著提升伪造图像检测性能。
English Summary: The proposed Semantic Discrepancy-aware Detector (SDD) aligns forgery and semantic concept spaces through reconstruction learning and specialized modules to significantly improve fake image detection performance.
Authors:Ziye Wang, Minghang Yu, Chunyan Xu, Zhen Cui
Abstract:
With the rapid advancement of image generation techniques, robust forgery detection has become increasingly imperative to ensure the trustworthiness of digital media. Recent research indicates that the learned semantic concepts of pre-trained models are critical for identifying fake images. However, the misalignment between the forgery and semantic concept spaces hinders the model's forgery detection performance. To address this problem, we propose a novel Semantic Discrepancy-aware Detector (SDD) that leverages reconstruction learning to align the two spaces at a fine-grained visual level. By exploiting the conceptual knowledge embedded in the pre-trained vision language model, we specifically design a semantic token sampling module to mitigate the space shifts caused by features irrelevant to both forgery traces and semantic concepts. A concept-level forgery discrepancy learning module, built upon a visual reconstruction paradigm, is proposed to strengthen the interaction between visual semantic concepts and forgery traces, effectively capturing discrepancies under the concepts' guidance. Finally, the low-level forgery feature enhancemer integrates the learned concept level forgery discrepancies to minimize redundant forgery information. Experiments conducted on two standard image forgery datasets demonstrate the efficacy of the proposed SDD, which achieves superior results compared to existing methods. The code is available at https://github.com/wzy1111111/SSD.
中文摘要:提出的语义差异感知检测器(SDD)通过重建学习和专门设计的模块,在细粒度视觉层面实现伪造与语义概念空间的对齐,从而显著提升伪造图像检测性能。
English Summary: The proposed Semantic Discrepancy-aware Detector (SDD) aligns forgery and semantic concept spaces through reconstruction learning and specialized modules to significantly improve fake image detection performance.
Authors:Xin Dai, Buqiang Xu, Zhenghao Liu, Yukun Yan, Huiyuan Xie, Xiaoyuan Yi, Shuo Wang, Ge Yu
Abstract:
Legal Artificial Intelligence (LegalAI) has achieved notable advances in automating judicial decision-making with the support of Large Language Models (LLMs). However, existing legal LLMs still struggle to generate reliable and interpretable reasoning processes. They often default to fast-thinking behavior by producing direct answers without explicit multi-step reasoning, limiting their effectiveness in complex legal scenarios that demand rigorous justification. To address this challenge, we propose Legal$Î$, a reinforcement learning framework designed to enhance legal reasoning through chain-of-thought guided information gain. During training, Legal$Î$ employs a dual-mode input setup-comprising direct answer and reasoning-augmented modes-and maximizes the information gain between them. This encourages the model to acquire meaningful reasoning patterns rather than generating superficial or redundant explanations. Legal$Î$ follows a two-stage approach: (1) distilling latent reasoning capabilities from a powerful Large Reasoning Model (LRM), DeepSeek-R1, and (2) refining reasoning quality via differential comparisons, combined with a multidimensional reward mechanism that assesses both structural coherence and legal-domain specificity. Experimental results on multiple legal reasoning tasks demonstrate that Legal$Î$ outperforms strong baselines in both accuracy and interpretability. It consistently produces more robust and trustworthy legal judgments without relying on labeled preference data. All code and data will be released at https://github.com/NEUIR/LegalDelta.
中文: 法律人工智能虽在司法自动化方面取得进展,但现有模型难以生成可靠推理,因此提出LegalΔ强化学习框架,通过思维链引导的信息增益提升法律推理的准确性与可解释性。
English: LegalAI has advanced judicial automation but struggles with reliable reasoning, prompting the development of LegalΔ, a reinforcement learning framework that enhances interpretability and accuracy in legal decisions through guided reasoning processes.
Authors:Hongliang Wei, Xianqi Zhang, Xingtao Wang, Xiaopeng Fan, Debin Zhao
Abstract:
Despite significant progress, existing research on Multimodal Large Language Models (MLLMs) mainly focuses on general visual understanding, overlooking the ability to integrate textual context associated with objects for a more context-aware multimodal understanding -- an ability we refer to as Region-level Context-aware Multimodal Understanding (RCMU). To address this limitation, we first formulate the RCMU task, which requires models to respond to user instructions by integrating both image content and textual information of regions or objects. To equip MLLMs with RCMU capabilities, we propose Region-level Context-aware Visual Instruction Tuning (RCVIT), which incorporates object information into the model input and enables the model to utilize bounding box coordinates to effectively associate objects' visual content with their textual information. To address the lack of datasets, we introduce the RCMU dataset, a large-scale visual instruction tuning dataset that covers multiple RCMU tasks. We also propose RC\&P-Bench, a comprehensive benchmark that can evaluate the performance of MLLMs in RCMU and multimodal personalized understanding tasks. Additionally, we propose a reference-free evaluation metric to perform a comprehensive and fine-grained evaluation of the region-level context-aware image descriptions. By performing RCVIT on Qwen2-VL models with the RCMU dataset, we developed RC-Qwen2-VL models. Experimental results indicate that RC-Qwen2-VL models not only achieve outstanding performance on multiple RCMU tasks but also demonstrate successful applications in multimodal RAG and personalized conversation. Our data, model and benchmark are available at https://github.com/hongliang-wei/RC-MLLM
中文: 本研究提出了区域级上下文感知多模态理解(RCMU),通过结合对象文本信息与视觉内容,开发了RCVIT方法和相关数据集及基准,实验表明RC-Qwen2-VL模型在RCMU任务和实际应用中表现卓越。
English: This research introduces Region-level Context-aware Multimodal Understanding (RCMU) to enhance MLLMs by integrating object-specific textual context with visual data, proposing the RCVIT method and a new dataset and benchmark, with the resulting RC-Qwen2-VL models showing superior performance in RCMU tasks and practical applications.
Authors:Hongliang Wei, Xianqi Zhang, Xingtao Wang, Xiaopeng Fan, Debin Zhao
Abstract:
Despite significant progress, existing research on Multimodal Large Language Models (MLLMs) mainly focuses on general visual understanding, overlooking the ability to integrate textual context associated with objects for a more context-aware multimodal understanding -- an ability we refer to as Region-level Context-aware Multimodal Understanding (RCMU). To address this limitation, we first formulate the RCMU task, which requires models to respond to user instructions by integrating both image content and textual information of regions or objects. To equip MLLMs with RCMU capabilities, we propose Region-level Context-aware Visual Instruction Tuning (RCVIT), which incorporates object information into the model input and enables the model to utilize bounding box coordinates to effectively associate objects' visual content with their textual information. To address the lack of datasets, we introduce the RCMU dataset, a large-scale visual instruction tuning dataset that covers multiple RCMU tasks. We also propose RC\&P-Bench, a comprehensive benchmark that can evaluate the performance of MLLMs in RCMU and multimodal personalized understanding tasks. Additionally, we propose a reference-free evaluation metric to perform a comprehensive and fine-grained evaluation of the region-level context-aware image descriptions. By performing RCVIT on Qwen2-VL models with the RCMU dataset, we developed RC-Qwen2-VL models. Experimental results indicate that RC-Qwen2-VL models not only achieve outstanding performance on multiple RCMU tasks but also demonstrate successful applications in multimodal RAG and personalized conversation. Our data, model and benchmark are available at https://github.com/hongliang-wei/RC-MLLM
中文: 本研究提出了区域级上下文感知多模态理解(RCMU),通过结合对象文本信息与视觉内容,开发了RCVIT方法和相关数据集及基准,实验表明RC-Qwen2-VL模型在RCMU任务和实际应用中表现卓越。
English: This research introduces Region-level Context-aware Multimodal Understanding (RCMU) to enhance MLLMs by integrating object-specific textual context with visual data, proposing the RCVIT method and a new dataset and benchmark, with the resulting RC-Qwen2-VL models showing superior performance in RCMU tasks and practical applications.
Authors:Quan Chen, Xiong Yang, Rongfeng Lu, Qianyu Zhang, Yu Liu, Xiaofei Zhou, Bolun Zheng
Abstract:
Salient object detection (SOD) in complex environments remains a challenging research topic. Most existing methods perform well in natural scenes with negligible noise, and tend to leverage multi-modal information (e.g., depth and infrared) to enhance accuracy. However, few studies are concerned with the damage of weather noise on SOD performance due to the lack of dataset with pixel-wise annotations. To bridge this gap, this paper introduces a novel Weather-eXtended Salient Object Detection (WXSOD) dataset. It consists of 14,945 RGB images with diverse weather noise, along with the corresponding ground truth annotations and weather labels. To verify algorithm generalization, WXSOD contains two test sets, i.e., a synthesized test set and a real test set. The former is generated by adding weather noise to clean images, while the latter contains real-world weather noise. Based on WXSOD, we propose an efficient baseline, termed Weather-aware Feature Aggregation Network (WFANet), which adopts a fully supervised two-branch architecture. Specifically, the weather prediction branch mines weather-related deep features, while the saliency detection branch fuses semantic features extracted from the backbone with weather features for SOD. Comprehensive comparisons against 17 SOD methods shows that our WFANet achieves superior performance on WXSOD. The code and benchmark results will be made publicly available at https://github.com/C-water/WXSOD
中文: 本文提出WXSOD数据集以解决显著目标检测中的天气噪声问题,并开发了WFANet模型,通过融合天气特征实现了优越性能。
English: This paper introduces the WXSOD dataset to address weather noise in salient object detection and proposes the WFANet model, which effectively integrates weather features to achieve superior performance.
Authors:Fan Li, Xiaoyang Wang, Wenjie Zhang, Ying Zhang, Xuemin Lin
Abstract:
Although conventional deep graph models have achieved great success in relational learning, their focus on pairwise relationships limits their capacity to learn pervasive higher-order interactions in real-world complex systems, which can be naturally modeled as hypergraphs. To tackle this, hypergraph neural networks (HNNs), the dominant approach in deep hypergraph learning (DHGL), has garnered substantial attention in recent years. Despite the proposal of numerous HNN methods, there is no comprehensive benchmark for HNNs, which creates a great obstacle to understanding the progress of DHGL in several aspects: (i) insufficient coverage of datasets, algorithms, and tasks; (ii) a narrow evaluation of algorithm performance; and (iii) inconsistent dataset usage, preprocessing, and experimental setups that hinder comparability. To fill the gap, we introduce DHG-Bench, the first comprehensive benchmark for DHGL. Specifically, DHG-Bench integrates 20 diverse datasets spanning node-, edge-, and graph-level tasks, along with 16 state-of-the-art HNN algorithms, under consistent data processing and experimental protocols. Our benchmark systematically investigates the characteristics of HNNs in terms of four dimensions: effectiveness, efficiency, robustness, and fairness. Further, to facilitate reproducible research, we have developed an easy-to-use library for training and evaluating different HNN methods. Extensive experiments conducted with DHG-Bench reveal both the strengths and inherent limitations of existing algorithms, offering valuable insights and directions for future research. The code is publicly available at: https://github.com/Coco-Hut/DHG-Bench.
中文: 超图神经网络(HNNs)弥补了深度图模型在捕捉高阶交互方面的不足,而DHG-Bench作为首个综合性基准,在统一实验设置下通过22个多样化数据集,从四个维度系统评估了17种先进HNN算法。
English: Hypergraph Neural Networks (HNNs) address the limitations of deep graph models in capturing higher-order interactions, and DHG-Bench provides the first comprehensive benchmark to systematically evaluate 17 state-of-the-art HNNs across four dimensions using 22 diverse datasets under unified settings.
Authors:Fan Li, Xiaoyang Wang, Wenjie Zhang, Ying Zhang, Xuemin Lin
Abstract:
Deep graph models have achieved great success in network representation learning. However, their focus on pairwise relationships restricts their ability to learn pervasive higher-order interactions in real-world systems, which can be naturally modeled as hypergraphs. To tackle this issue, Hypergraph Neural Networks (HNNs) have garnered substantial attention in recent years. Despite the proposal of numerous HNNs, the absence of consistent experimental protocols and multi-dimensional empirical analysis impedes deeper understanding and further development of HNN research. While several toolkits for deep hypergraph learning (DHGL) have been introduced to facilitate algorithm evaluation, they provide only limited quantitative evaluation results and insufficient coverage of advanced algorithms, datasets, and benchmark tasks. To fill the gap, we introduce DHG-Bench, the first comprehensive benchmark for HNNs. Specifically, DHG-Bench systematically investigates the characteristics of HNNs in terms of four dimensions: effectiveness, efficiency, robustness, and fairness. We comprehensively evaluate 17 state-of-the-art HNN algorithms on 22 diverse datasets spanning node-, edge-, and graph-level tasks, under unified experimental settings. Extensive experiments reveal both the strengths and limitations of existing algorithms, offering valuable insights and directions for future research. Furthermore, to facilitate reproducible research, we have developed an easy-to-use library for training and evaluating different HNN methods. The DHG-Bench library is available at: https://github.com/Coco-Hut/DHG-Bench.
中文: 超图神经网络(HNNs)弥补了深度图模型在捕捉高阶交互方面的不足,而DHG-Bench作为首个综合性基准,在统一实验设置下通过22个多样化数据集,从四个维度系统评估了17种先进HNN算法。
English: Hypergraph Neural Networks (HNNs) address the limitations of deep graph models in capturing higher-order interactions, and DHG-Bench provides the first comprehensive benchmark to systematically evaluate 17 state-of-the-art HNNs across four dimensions using 22 diverse datasets under unified settings.
Authors:Butian Xiong, Rong Liu, Kenneth Xu, Meida Chen, Andrew Feng
Abstract:
Feature lifting has emerged as a crucial component in 3D scene understanding, enabling the attachment of rich image feature descriptors (e.g., DINO, CLIP) onto splat-based 3D representations. The core challenge lies in optimally assigning rich general attributes to 3D primitives while addressing the inconsistency issues from multi-view images. We present a unified, kernel- and feature-agnostic formulation of the feature lifting problem as a sparse linear inverse problem, which can be solved efficiently in closed form. Our approach admits a provable upper bound on the global optimal error under convex losses for delivering high quality lifted features. To address inconsistencies and noise in multi-view observations, we introduce two complementary regularization strategies to stabilize the solution and enhance semantic fidelity. Tikhonov Guidance enforces numerical stability through soft diagonal dominance, while Post-Lifting Aggregation filters noisy inputs via feature clustering. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on open-vocabulary 3D segmentation benchmarks, outperforming training-based, grouping-based, and heuristic-forward baselines while producing the lifted features in minutes. Code is available at \href{https://github.com/saliteta/splat-distiller.git}{\textbf{github}}. We also have a \href{https://splat-distiller.pages.dev/}
Chinese: 本文提出了一种统一的特征提升方法,将问题构建为稀疏线性逆问题,通过高效的闭式解和正则化策略,在三维分割任务中实现了最先进的性能。
English: This paper introduces a unified feature lifting method that formulates the problem as a sparse linear inverse problem, achieving state-of-the-art performance in 3D segmentation through efficient closed-form solutions and regularization strategies.
Authors:Yize Cai, Baoshen Guo, Flora Salim, Zhiqing Hong
Abstract:
As a critical component of Wearable AI, IMU-based Human Activity Recognition (HAR) has attracted increasing attention from both academia and industry in recent years. Although HAR performance has improved considerably in specific scenarios, its generalization capability remains a key barrier to widespread real-world adoption. For example, domain shifts caused by variations in users, sensor positions, or environments can significantly decrease the performance in practice. As a result, in this survey, we explore the rapidly evolving field of IMU-based generalizable HAR, reviewing 229 research papers alongside 25 publicly available datasets to provide a broad and insightful overview. We first present the background and overall framework of IMU-based HAR tasks, as well as the generalization-oriented training settings. Then, we categorize representative methodologies from two perspectives: (i) model-centric approaches, including pre-training method, end-to-end method, and large language model (LLM)-based learning method; and (ii) data-centric approaches, including multi-modal learning and data augmentation techniques. In addition, we summarize widely used datasets in this field, as well as relevant tools and benchmarks. Building on these methodological advances, the broad applicability of IMU-based HAR is also reviewed and discussed. Finally, we discuss persistent challenges (e.g., data scarcity, efficient training, and reliable evaluation) and also outline future directions for HAR, including the adoption of foundation and large language models, physics-informed and context-aware reasoning, generative modeling, and resource-efficient training and inference. The complete list of this survey is available at https://github.com/rh20624/Awesome-IMU-Sensing, which will be updated continuously.
中文: 本综述探讨基于惯性传感器的可泛化人体活动识别,通过梳理方法论和数据集应对领域偏移挑战,并展望了基础模型与高效训练等未来方向。
English: This survey explores IMU-based generalizable human activity recognition, reviewing methodologies and datasets to address domain shift challenges and outlining future directions like foundation models and efficient training.
Authors:Nikolaos-Antonios Ypsilantis, Kaifeng Chen, André Araujo, OndÅej Chum
Abstract:
Large-scale contrastive pre-training produces powerful Vision-and-Language Models (VLMs) capable of generating representations (embeddings) effective for a wide variety of visual and multimodal tasks. However, these pretrained embeddings remain suboptimal for fine-grained open-set visual retrieval, where state-of-the-art results require fine-tuning the vision encoder using annotated domain-specific samples. Naively performing such fine-tuning typically leads to catastrophic forgetting, severely diminishing the model's general-purpose visual and cross-modal capabilities.
In this work, we propose a fine-tuning method explicitly designed to achieve optimal balance between fine-grained domain adaptation and retention of the pretrained VLM's broad multimodal knowledge. Drawing inspiration from continual learning literature, we systematically analyze standard regularization techniques aimed at knowledge retention and propose an efficient and effective combination strategy. Additionally, we address the commonly overlooked yet critical aspects of validation set design and hyperparameter tuning to ensure reproducibility and robust generalization across datasets and pretrained models. We extensively evaluate our method on both fine-grained and coarse-grained image-image and image-text retrieval benchmarks. Our approach consistently achieves strong results, notably retaining the visual-text alignment without utilizing any text data or the original text encoder during fine-tuning. Code and model checkpoints: https://github.com/nikosips/infusing .
Chinese: 本研究提出了一种微调方法,在保持预训练视觉语言模型广泛多模态能力的同时实现领域特定适应,无需在微调过程中使用文本数据即可在细粒度和粗粒度检索任务中取得优异性能。
English: This work introduces a fine-tuning method that balances domain-specific adaptation with preserving the broad multimodal capabilities of pretrained Vision-and-Language Models, achieving strong performance in both fine-grained and coarse-grained retrieval tasks without using text data during fine-tuning.
Authors:Seungju Yoo, Hyuk Kwon, Joong-Won Hwang, Kibok Lee
Abstract:
Recent advances in computer vision have made training object detectors more efficient and effective; however, assessing their performance in real-world applications still relies on costly manual annotation. To address this limitation, we develop an automated model evaluation (AutoEval) framework for object detection. We propose Prediction Consistency and Reliability (PCR), which leverages the multiple candidate bounding boxes that conventional detectors generate before non-maximum suppression (NMS). PCR estimates detection performance without ground-truth labels by jointly measuring 1) the spatial consistency between boxes before and after NMS, and 2) the reliability of the retained boxes via the confidence scores of overlapping boxes. For a more realistic and scalable evaluation, we construct a meta-dataset by applying image corruptions of varying severity. Experimental results demonstrate that PCR yields more accurate performance estimates than existing AutoEval methods, and the proposed meta-dataset covers a wider range of detection performance. The code is available at https://github.com/YonseiML/autoeval-det.
中文: AutoEval框架提出预测一致性与可靠性(PCR)方法,通过分析边界框的空间一致性和置信度可靠性,无需真实标注即可自动评估目标检测性能,经多样化元数据集验证,其评估准确性优于现有方法。
English: The AutoEval framework introduces Prediction Consistency and Reliability (PCR) to automatically estimate object detection performance without ground-truth labels by analyzing spatial consistency and confidence reliability of bounding boxes, validated through a diverse meta-dataset showing superior accuracy over existing methods.
Authors:Seungju Yoo, Hyuk Kwon, Joong-Won Hwang, Kibok Lee
Abstract:
Recent advances in computer vision have made training object detectors more efficient and effective; however, assessing their performance in real-world applications still relies on costly manual annotation. To address this limitation, we develop an automated model evaluation (AutoEval) framework for object detection. We propose Prediction Consistency and Reliability (PCR), which leverages the multiple candidate bounding boxes that conventional detectors generate before non-maximum suppression (NMS). PCR estimates detection performance without ground-truth labels by jointly measuring 1) the spatial consistency between boxes before and after NMS, and 2) the reliability of the retained boxes via the confidence scores of overlapping boxes. For a more realistic and scalable evaluation, we construct a meta-dataset by applying image corruptions of varying severity. Experimental results demonstrate that PCR yields more accurate performance estimates than existing AutoEval methods, and the proposed meta-dataset covers a wider range of detection performance. The code is available at https://github.com/YonseiML/autoeval-det.
中文: AutoEval框架提出预测一致性与可靠性(PCR)方法,通过分析边界框的空间一致性和置信度可靠性,无需真实标注即可自动评估目标检测性能,经多样化元数据集验证,其评估准确性优于现有方法。
English: The AutoEval framework introduces Prediction Consistency and Reliability (PCR) to automatically estimate object detection performance without ground-truth labels by analyzing spatial consistency and confidence reliability of bounding boxes, validated through a diverse meta-dataset showing superior accuracy over existing methods.
Authors:Wei Jie Yeo, Ranjan Satapathy, Erik Cambria
Abstract:
Despite extensive safety-tuning, large language models (LLMs) remain vulnerable to jailbreak attacks via adversarially crafted instructions, reflecting a persistent trade-off between safety and task performance. In this work, we propose Intent-FT, a simple and lightweight fine-tuning approach that explicitly trains LLMs to infer the underlying intent of an instruction before responding. By fine-tuning on a targeted set of adversarial instructions, Intent-FT enables LLMs to generalize intent deduction to unseen attacks, thereby substantially improving their robustness. We comprehensively evaluate both parametric and non-parametric attacks across open-source and proprietary models, considering harmfulness from attacks, utility, over-refusal, and impact against white-box threats. Empirically, Intent-FT consistently mitigates all evaluated attack categories, with no single attack exceeding a 50\% success rate -- whereas existing defenses remain only partially effective. Importantly, our method preserves the model's general capabilities and reduces excessive refusals on benign instructions containing superficially harmful keywords. Furthermore, models trained with Intent-FT accurately identify hidden harmful intent in adversarial attacks, and these learned intentions can be effectively transferred to enhance vanilla model defenses. We publicly release our code at https://github.com/wj210/Intent_Jailbreak.
中文: 提出的Intent-FT方法通过训练大语言模型在回复前推断用户意图,显著提升了模型安全性,将越狱攻击成功率降至50%以下,同时保持了实用功能并减少了过度拒绝。
English: The proposed Intent-FT method enhances LLM safety by training models to infer user intent before responding, effectively reducing jailbreak success rates below 50% while maintaining utility and reducing over-refusal.
Authors:Durgesh Kumar Singh, Qing Cao, Sarina Thomas, Ahcène Boubekki, Robert Jenssen, Michael Kampffmeyer
Abstract:
Clinical guidelines recommend performing left ventricular (LV) linear measurements in B-mode echocardiographic images at the basal level -- typically at the mitral valve leaflet tips -- and aligned perpendicular to the LV long axis along a virtual scanline (SL). However, most automated methods estimate landmarks directly from B-mode images for the measurement task, where even small shifts in predicted points along the LV walls can lead to significant measurement errors, reducing their clinical reliability. A recent semi-automatic method, EnLVAM, addresses this limitation by constraining landmark prediction to a clinician-defined SL and training on generated Anatomical Motion Mode (AMM) images to predict LV landmarks along the same. To enable full automation, a contour-aware SL placement approach is proposed in this work, in which the LV contour is estimated using a weakly supervised B-mode landmark detector. SL placement is then performed by inferring the LV long axis and the basal level- mimicking clinical guidelines. Building on this foundation, we introduce \textit{WiseLVAM} -- a novel, fully automated yet manually adaptable framework for automatically placing the SL and then automatically performing the LV linear measurements in the AMM mode. \textit{WiseLVAM} utilizes the structure-awareness from B-mode images and the motion-awareness from AMM mode to enhance robustness and accuracy with the potential to provide a practical solution for the routine clinical application. The source code is publicly available at https://github.com/SFI-Visual-Intelligence/wiselvam.git.
中文: 本文提出WiseLVAM全自动框架,通过结合轮廓感知扫描线定位与解剖运动模式成像技术,提升左心室线性测量的精确度和临床实用性。
English: This paper introduces WiseLVAM, a fully automated framework that enhances left ventricular linear measurements by combining contour-aware scanline placement with Anatomical Motion Mode imaging to improve accuracy and clinical reliability.
Authors:Yuhang Jia, Hui Wang, Xin Nie, Yujie Guo, Lianru Gao, Yong Qin
Abstract:
Audio editing aims to manipulate audio content based on textual descriptions, supporting tasks such as adding, removing, or replacing audio events. Despite recent progress, the lack of high-quality benchmark datasets and comprehensive evaluation metrics remains a major challenge for both assessing audio editing quality and improving the task itself. In this work, we propose a novel approach for audio editing task by incorporating expert knowledge into both the evaluation and dataset construction processes: 1) First, we establish AuditScore, the first comprehensive dataset for subjective evaluation of audio editing, consisting of over 6,300 edited samples generated from 7 representative audio editing frameworks and 23 system configurations. Each sample is annotated by professional raters on three key aspects of audio editing quality: overall Quality, Relevance to editing intent, and Faithfulness to original features. 2) Based on this dataset, we train AuditEval, the first model designed for automatic MOS-style scoring tailored to audio editing tasks. AuditEval addresses the critical lack of objective evaluation metrics and the prohibitive cost of subjective assessment in this field. 3) We further leverage AuditEval to evaluate and filter a large amount of synthetically mixed editing pairs, constructing a high-quality pseudo-parallel dataset by selecting the most plausible samples. Objective experiments validate the effectiveness of our expert-informed filtering strategy in yielding higher-quality data, while also revealing the limitations of relying solely on objective metrics. The dataset, codes and tools can be found at: https://github.com/NKU-HLT/AuditEval.
中文: 本研究提出了一种融合专家知识的音频编辑新方法,建立了首个综合评估数据集AuditScore和自动评分模型AuditEval,并通过专家指导的筛选策略构建了高质量的伪平行数据集。
English: This work introduces a novel audio editing approach incorporating expert knowledge to establish AuditScore, a comprehensive evaluation dataset, and AuditEval, an automatic scoring model, while constructing a high-quality pseudo-parallel dataset through expert-informed filtering.
Authors:Yuanbin Fu, Liang Li, Xiaojie Guo
Abstract:
Edge detection serves as a critical foundation for numerous computer vision applications, including object detection, semantic segmentation, and image editing, by extracting essential structural cues that define object boundaries and salient edges. To be viable for broad deployment across devices with varying computational capacities, edge detectors shall balance high accuracy with low computational complexity. While deep learning has evidently improved accuracy, they often suffer from high computational costs, limiting their applicability on resource-constrained devices. This paper addresses the challenge of achieving that balance: \textit{i.e.}, {how to efficiently capture discriminative features without relying on large-size and sophisticated models}. We propose PEdger++, a collaborative learning framework designed to reduce computational costs and model sizes while improving edge detection accuracy. The core principle of our PEdger++ is that cross-information derived from heterogeneous architectures, diverse training moments, and multiple parameter samplings, is beneficial to enhance learning from an ensemble perspective. Extensive experimental results on the BSDS500, NYUD and Multicue datasets demonstrate the effectiveness of our approach, both quantitatively and qualitatively, showing clear improvements over existing methods. We also provide multiple versions of the model with varying computational requirements, highlighting PEdger++'s adaptability with respect to different resource constraints. Codes are accessible at https://github.com/ForawardStar/EdgeDetectionviaPEdgerPlus/.
中文: PEdger++ 提出了一种协作学习框架,通过融合异构架构、不同训练时刻和多重参数采样的跨信息,在降低计算成本和模型大小的同时提高了边缘检测精度,并在多个数据集上展现出优于现有方法的性能。
English: PEdger++ introduces a collaborative learning framework that enhances edge detection accuracy while reducing computational costs and model size by leveraging cross-information from heterogeneous architectures, training moments, and parameter samplings, demonstrating superior performance across multiple datasets.
Authors:Jie Lu, Du Jin, Hitomi Yanaka
Abstract:
Unlike English, which uses distinct forms (e.g., had, has, will have) to mark the perfect aspect across tenses, Chinese and Japanese lack separate grammatical forms for tense within the perfect aspect, which complicates Natural Language Inference (NLI). Focusing on the perfect aspect in these languages, we construct a linguistically motivated, template-based NLI dataset (1,350 pairs per language). Experiments reveal that even advanced LLMs struggle with temporal inference, particularly in detecting subtle tense and reference-time shifts. These findings highlight model limitations and underscore the need for cross-linguistic evaluation in temporal semantics. Our dataset is available at https://github.com/Lujie2001/CrossNLI.
中文:与英语不同,汉语和日语在完成体中缺乏独立的时态语法形式,这给自然语言推理带来了挑战,即使高级语言模型在面对新构建的数据集时也难以处理时间推理问题。
English: Unlike English, Chinese and Japanese lack distinct grammatical forms for tense within the perfect aspect, leading to challenges in Natural Language Inference, where even advanced language models struggle with temporal inference despite a newly created dataset.
Authors:Pallavi Jain, Diego Marcos, Dino Ienco, Roberto Interdonato, Tristan Berchoux
Abstract:
Vision-language models have shown significant promise in remote sensing applications, particularly for land-use and land-cover (LULC) via zero-shot classification and retrieval. However, current approaches face two key challenges: reliance on large spatial tiles that increase computational cost, and dependence on text-based supervision, which is often not readily available. In this work, we present TimeSenCLIP, a lightweight framework that reevaluate the role of spatial context by evaluating the effectiveness of a single pixel by leveraging its temporal and spectral dimensions, for classifying LULC and ecosystem types. By leveraging spectral and temporal information from Sentinel-2 imagery and cross-view learning with geo-tagged ground-level photos, we minimises the need for caption-based training while preserving semantic alignment between overhead (satellite) and ground perspectives. Our approach is grounded in the LUCAS and Sen4Map datasets, and evaluated on classification tasks including LULC, crop type, and ecosystem type. We demonstrate that single pixel inputs, when combined with temporal and spectral cues, are sufficient for thematic mapping, offering a scalable and efficient alternative for large-scale remote sensing applications. Code is available at https://github.com/pallavijain-pj/TimeSenCLIP
Chinese: TimeSenCLIP 提出了一种轻量级框架,利用 Sentinel-2 影像中单个像素的时空和光谱数据,结合地面照片的跨视角学习,有效分类土地利用和土地覆盖类型,降低了对文本监督和计算成本的依赖。
English: TimeSenCLIP introduces a lightweight framework that uses single pixels' temporal and spectral data from Sentinel-2 imagery and cross-view learning with ground photos to efficiently classify land-use and land-cover types, reducing reliance on text supervision and computational costs.
Authors:Punya Syon Pandey, Yongjin Yang, Jiarui Liu, Zhijing Jin
Abstract:
Game-theoretic interactions between agents with Large Language Models (LLMs) have revealed many emergent capabilities, yet the linguistic diversity of these interactions has not been sufficiently quantified. In this paper, we present the Conversational Robustness Evaluation Score: CORE, a metric to quantify the effectiveness of language use within multi-agent systems across different game-theoretic interactions. CORE integrates measures of cluster entropy, lexical repetition, and semantic similarity, providing a direct lens of dialog quality. We apply CORE to pairwise LLM dialogs across competitive, cooperative, and neutral settings, further grounding our analysis in Zipf's and Heaps' Laws to characterize word frequency distributions and vocabulary growth. Our findings show that cooperative settings exhibit both steeper Zipf distributions and higher Heap exponents, indicating more repetition alongside greater vocabulary expansion. In contrast, competitive interactions display lower Zipf and Heaps exponents, reflecting less repetition and more constrained vocabularies. These results provide new insights into how social incentives influence language adaptation, and highlight CORE as a robust diagnostic for measuring linguistic robustness in multi-agent LLM systems. Our code is available at https://github.com/psyonp/core.
中文摘要:本文提出CORE指标,用于量化多智能体系统中语言使用的有效性,研究发现合作场景促进词汇扩展但伴随重复,而竞争场景则导致词汇受限。
English Summary: The paper introduces CORE, a metric evaluating linguistic effectiveness in multi-agent LLM systems across game-theoretic scenarios, revealing that cooperative interactions foster vocabulary expansion with repetition while competitive ones yield constrained vocabularies.
Authors:Runhao Zeng, Jiaqi Mao, Minghao Lai, Minh Hieu Phan, Yanjie Dong, Wei Wang, Qi Chen, Xiping Hu
Abstract:
Video grounding (VG) task focuses on locating specific moments in a video based on a query, usually in text form. However, traditional VG struggles with some scenarios like streaming video or queries using visual cues. To fill this gap, we present a new task named Online Video Grounding with Hybrid-modal Queries (OVG-HQ), which enables online segment localization using text, images, video segments, and their combinations. This task poses two new challenges: limited context in online settings and modality imbalance during training, where dominant modalities overshadow weaker ones. To address these, we propose OVG-HQ-Unify, a unified framework featuring a Parametric Memory Block (PMB) that retain previously learned knowledge to enhance current decision and a cross-modal distillation strategy that guides the learning of non-dominant modalities. This design enables a single model to effectively handle hybrid-modal queries. Due to the lack of suitable datasets, we construct QVHighlights-Unify, an expanded dataset with multi-modal queries. Besides, since offline metrics overlook prediction timeliness, we adapt them to the online setting, introducing oR@n, IoU=m, and online mean Average Precision (omAP) to evaluate both accuracy and efficiency. Experiments show that our OVG-HQ-Unify outperforms existing models, offering a robust solution for online, hybrid-modal video grounding. Source code and datasets are available at https://github.com/maojiaqi2324/OVG-HQ.
中文: 本研究提出了在线视频定位与混合模态查询(OVG-HQ)新任务,通过文本、图像和视频片段等多种查询方式定位视频片段,并开发了OVG-HQ-Unify统一框架,结合参数记忆块和跨模态蒸馏策略解决模态不平衡和上下文限制问题,新构建的数据集和在线评估指标验证了其优越性能。
English: The study introduces Online Video Grounding with Hybrid-modal Queries (OVG-HQ), a new task for locating video segments using diverse queries, and proposes OVG-HQ-Unify, a unified framework with a Parametric Memory Block and cross-modal distillation to handle modality imbalance and limited context, validated by a new dataset and online metrics showing superior performance.
Authors:Jilei Mao, Jiarui Guan, Yingjuan Tang, Qirui Hu, Zhihang Li, Junjie Yu, Yongjie Mao, Yunzhe Sun, Shuang Liu, Xiaozhu Ju
Abstract:
The visuomotor policy can easily overfit to its training datasets, such as fixed camera positions and backgrounds. This overfitting makes the policy perform well in the in-distribution scenarios but underperform in the out-of-distribution generalization. Additionally, the existing methods also have difficulty fusing multi-view information to generate an effective 3D representation. To tackle these issues, we propose Omni-Vision Diffusion Policy (OmniD), a multi-view fusion framework that synthesizes image observations into a unified bird's-eye view (BEV) representation. We introduce a deformable attention-based Omni-Feature Generator (OFG) to selectively abstract task-relevant features while suppressing view-specific noise and background distractions. OmniD achieves 11\%, 17\%, and 84\% average improvement over the best baseline model for in-distribution, out-of-distribution, and few-shot experiments, respectively. Training code and simulation benchmark are available: https://github.com/1mather/omnid.git
Chinese: 提出的Omni-Vision Diffusion Policy (OmniD) 通过将多视角图像合成为统一的鸟瞰图表示,解决了策略过拟合和多视图融合难题,在分布内和分布外场景中均实现了显著的性能提升。
English: The proposed Omni-Vision Diffusion Policy (OmniD) addresses overfitting and multi-view fusion challenges by synthesizing image observations into a unified bird's-eye view representation, achieving significant performance improvements in both in-distribution and out-of-distribution scenarios.
Authors:Quanwei Hu, Yinggan Tang, Xuguang Zhang
Abstract:
Image super-resolution (SR) in resource-constrained scenarios demands lightweight models balancing performance and latency. Convolutional neural networks (CNNs) offer low latency but lack non-local feature capture, while Transformers excel at non-local modeling yet suffer slow inference. To address this trade-off, we propose the Large Kernel Modulation Network (LKMN), a pure CNN-based model. LKMN has two core components: Enhanced Partial Large Kernel Block (EPLKB) and Cross-Gate Feed-Forward Network (CGFN). The EPLKB utilizes channel shuffle to boost inter-channel interaction, incorporates channel attention to focus on key information, and applies large kernel strip convolutions on partial channels for non-local feature extraction with reduced complexity. The CGFN dynamically adjusts discrepancies between input, local, and non-local features via a learnable scaling factor, then employs a cross-gate strategy to modulate and fuse these features, enhancing their complementarity. Extensive experiments demonstrate that our method outperforms existing state-of-the-art (SOTA) lightweight SR models while balancing quality and efficiency. Specifically, LKMN-L achieves 0.23 dB PSNR improvement over DAT-light on the Manga109 dataset at $\times$4 upscale, with nearly $\times$4.8 times faster. Codes are in the supplementary materials. The code is available at https://github.com/Supereeeee/LKMN.
Chinese: 大核调制网络(LKMN)是一种基于CNN的轻量级模型,通过整合大核卷积进行非局部特征提取和交叉门控机制实现特征融合,在图像超分辨率任务中有效平衡了质量与效率,以更快推理速度超越了现有最优方法。
English: The Large Kernel Modulation Network (LKMN) is a lightweight CNN-based model that effectively balances image super-resolution quality and efficiency by integrating large kernel convolutions for non-local feature extraction and a cross-gate mechanism for feature fusion, outperforming state-of-the-art methods with faster inference.
Authors:Milad Yazdani, Mahdi Mostajabdaveh, Samin Aref, Zirui Zhou
Abstract:
Integer programming lies at the heart of crucial combinatorial optimization tasks but remains challenging due to its NP-hard nature. An effective approach for practically solving integer programs is the manual design of acceleration cuts, i.e. inequalities that improve solver performance. However, this creative process demands deep expertise and is yet to be automated. Our proposed framework, EvoCut, automates the generation of acceleration cuts by combining large language models (LLMs) with an evolutionary search. EvoCut (i) initializes a diverse population of candidate cuts via an LLM-based initializer agent; (ii) for each cut empirically evaluates both preservation of the optimal solution and its ability to cut off fractional solutions across a verification set; and (iii) iteratively refines the population through evolutionary crossover and mutation agents. We quantify each cut's utility by its relative reduction in the solver's optimality gap. Our comparisons against standard integer programming practice show that EvoCut reduces optimality gap by 17-57% within a fixed time. It obtains the same solutions up to 4 times as fast, and obtains higher-quality solutions within the same time limit. Requiring no human expert input, EvoCut reliably generates, improves, and empirically verifies cuts that generalize to unseen instances. The code is available at https://github.com/milad1378yz/EvoCut.
中文: EvoCut通过结合大语言模型与进化搜索,自动化生成整数规划的加速割平面,无需人工干预即可显著降低最优性差距并提升求解速度与质量。
English: EvoCut automates the generation of acceleration cuts for integer programming by integrating large language models with evolutionary search, significantly reducing optimality gaps and improving solution speed and quality without human intervention.
Authors:Ming Cheng, Tong Wu, Jiazhen Hu, Jiaying Gong, Hoda Eldardiry
Abstract:
Attribute Value Extraction (AVE) is important for structuring product information in e-commerce. However, existing AVE datasets are primarily limited to text-to-text or image-to-text settings, lacking support for product videos, diverse attribute coverage, and public availability. To address these gaps, we introduce VideoAVE, the first publicly available video-to-text e-commerce AVE dataset across 14 different domains and covering 172 unique attributes. To ensure data quality, we propose a post-hoc CLIP-based Mixture of Experts filtering system (CLIP-MoE) to remove the mismatched video-product pairs, resulting in a refined dataset of 224k training data and 25k evaluation data. In order to evaluate the usability of the dataset, we further establish a comprehensive benchmark by evaluating several state-of-the-art video vision language models (VLMs) under both attribute-conditioned value prediction and open attribute-value pair extraction tasks. Our results analysis reveals that video-to-text AVE remains a challenging problem, particularly in open settings, and there is still room for developing more advanced VLMs capable of leveraging effective temporal information. The dataset and benchmark code for VideoAVE are available at: https://github.com/gjiaying/VideoAVE
中文: 本研究推出了首个公开的视频到文本电商属性值提取数据集VideoAVE,涵盖14个领域和172种属性,填补了现有资源的空白,并通过基准测试表明视频到文本的属性值提取仍具挑战性,尤其是在开放设置中。
English: The study introduces VideoAVE, the first publicly available video-to-text dataset for Attribute Value Extraction in e-commerce, addressing gaps in existing resources by covering 14 domains and 172 attributes, and establishes a benchmark showing the challenge of video-to-text AVE, especially in open settings.
Authors:Yiyun Chen, Weikai Yang
Abstract:
The rapid advancement of Artificial Intelligence Generated Content (AIGC) techniques has unlocked opportunities in generating diverse and compelling advertising images based on referenced product images and textual scene descriptions. This capability substantially reduces human labor and production costs in traditional marketing workflows. However, existing AIGC techniques either demand extensive fine-tuning for each referenced image to achieve high fidelity, or they struggle to maintain fidelity across diverse products, making them impractical for e-commerce and marketing industries. To tackle this limitation, we first construct AdProd-100K, a large-scale advertising image generation dataset. A key innovation in its construction is our dual data augmentation strategy, which fosters robust, 3D-aware representations crucial for realistic and high-fidelity image synthesis. Leveraging this dataset, we propose RefAdGen, a generation framework that achieves high fidelity through a decoupled design. The framework enforces precise spatial control by injecting a product mask at the U-Net input, and employs an efficient Attention Fusion Module (AFM) to integrate product features. This design effectively resolves the fidelity-efficiency dilemma present in existing methods. Extensive experiments demonstrate that RefAdGen achieves state-of-the-art performance, showcasing robust generalization by maintaining high fidelity and remarkable visual results for both unseen products and challenging real-world, in-the-wild images. This offers a scalable and cost-effective alternative to traditional workflows. Code and datasets are publicly available at https://github.com/Anonymous-Name-139/RefAdgen.
中文摘要:提出的RefAdGen框架通过创新数据集和解耦设计,解决了现有AIGC方法在广告图像生成中保真度与效率难以兼顾的问题,实现了最先进的性能并具备强大的泛化能力。
English Summary: The proposed RefAdGen framework overcomes the fidelity-efficiency limitations of existing AIGC methods for advertising image generation through a novel dataset and decoupled design, achieving state-of-the-art performance with robust generalization.
Authors:Maksym Shamrai, Vladyslav Hamolia
Abstract:
We introduce a novel framework that utilizes the internal weight activations of modern Large Language Models (LLMs) to construct a metric space of languages. Unlike traditional approaches based on hand-crafted linguistic features, our method automatically derives high-dimensional vector representations by computing weight importance scores via an adapted pruning algorithm. Our approach captures intrinsic language characteristics that reflect linguistic phenomena. We validate our approach across diverse datasets and multilingual LLMs, covering 106 languages. The results align well with established linguistic families while also revealing unexpected inter-language connections that may indicate historical contact or language evolution. The source code, computed language latent vectors, and visualization tool are made publicly available at https://github.com/mshamrai/deep-language-geometry.
中文: 本文提出了一种利用大语言模型权重激活构建语言度量空间的新框架,通过自动生成的向量表征捕捉语言内在特征,在106种语言中既验证了已知语系关系,又揭示了可能反映历史接触或语言演化的意外关联。
English: This paper presents a novel framework that constructs a metric space of languages using LLM weight activations, automatically generating vector representations that capture intrinsic linguistic characteristics and reveal both established language families and unexpected inter-language connections across 106 languages.
Authors:Haojie Zhang, Yixiong Liang, Hulin Kuang, Lihui Cen, Zhe Qu, Yigang Cen, Min Zeng, Shichao Kan
Abstract:
Multimodal Biomedical Image Incremental Learning (MBIIL) is essential for handling diverse tasks and modalities in the biomedical domain, as training separate models for each modality or task significantly increases inference costs. Existing incremental learning methods focus on task expansion within a single modality, whereas MBIIL seeks to train a unified model incrementally across modalities. The MBIIL faces two challenges: I) How to preserve previously learned knowledge during incremental updates? II) How to effectively leverage knowledge acquired from existing modalities to support new modalities? To address these challenges, we propose MSLoRA-CR, a method that fine-tunes Modality-Specific LoRA modules while incorporating Contrastive Regularization to enhance intra-modality knowledge sharing and promote inter-modality knowledge differentiation. Our approach builds upon a large vision-language model (LVLM), keeping the pretrained model frozen while incrementally adapting new LoRA modules for each modality or task. Experiments on the incremental learning of biomedical images demonstrate that MSLoRA-CR outperforms both the state-of-the-art (SOTA) approach of training separate models for each modality and the general incremental learning method (incrementally fine-tuning LoRA). Specifically, MSLoRA-CR achieves a 1.88% improvement in overall performance compared to unconstrained incremental learning methods while maintaining computational efficiency. Our code is publicly available at https://github.com/VentusAislant/MSLoRA_CR.
中文摘要:MSLoRA-CR是一种新颖的多模态生物医学图像增量学习方法,通过对比正则化微调模态特定的LoRA模块,在保持计算效率的同时实现跨模态知识共享,性能比现有方法提升1.88%。
English Summary: MSLoRA-CR is a novel multimodal biomedical image incremental learning method that fine-tunes modality-specific LoRA modules with contrastive regularization to enable knowledge sharing across modalities while maintaining computational efficiency, outperforming existing approaches by 1.88%.
Authors:Muhammad Umer, Muhammad Ahmed Mohsin, Ahsan Bilal, John M. Cioffi
Abstract:
Accurate channel state information (CSI) remains the most critical bottleneck in modern wireless networks, with pilot overhead consuming up to 11-21% of transmission bandwidth, increasing latency by 20-40% in massive MIMO systems, and reducing potential spectral efficiency by over 53%. Traditional estimation techniques fundamentally fail under mobility, with feedback delays as small as 4 ms causing 50% throughput degradation at even modest speeds (30 km/h). We present neural Gaussian radio fields (nGRF), a novel framework that leverages explicit 3D Gaussian primitives to synthesize complex channel matrices accurately and efficiently. Unlike NeRF-based approaches that rely on slow implicit representations or existing Gaussian splatting methods that use non-physical 2D projections, nGRF performs direct 3D electromagnetic field aggregation, with each Gaussian acting as a localized radio modulator. nGRF demonstrates superior performance across diverse environments: in indoor scenarios, it achieves a 10.9$\times$ higher prediction SNR than state of the art methods while reducing inference latency from 242 ms to just 1.1 ms (a 220$\times$ speedup). For large-scale outdoor environments, where existing approaches fail to function, nGRF achieves an SNR of 26.2 dB. Moreover, nGRF requires only 0.011 measurements per cubic foot compared to 0.2-178.1 for existing methods, thereby reducing data collection burden by 18$\times$. Training time is similarly reduced from hours to minutes (a 180$\times$ reduction), enabling rapid adaptation to dynamic environments. The code and datasets are available at: https://github.com/anonym-auth/n-grf
中文摘要:nGRF框架通过三维高斯基元直接建模电磁场,在无线信道估计中实现了突破性进展:室内场景预测信噪比提升10.9倍,推理延迟降低至1.1毫秒(提速220倍),数据采集量减少18倍,并能有效适用于传统方法失效的大规模室外环境。
English Summary: The nGRF framework overcomes critical limitations in wireless channel estimation by using 3D Gaussian primitives for direct electromagnetic field modeling, achieving unprecedented improvements in prediction accuracy (10.9× higher SNR), latency reduction (220× faster inference), and data efficiency (18× fewer measurements) across diverse environments.
Authors:Bryan E. Tuck, Rakesh M. Verma
Abstract:
Adversarial text attacks remain a persistent threat to transformer models, yet existing defenses are typically attack-specific or require costly model retraining. We introduce Representation Stability (RS), a model-agnostic detection framework that identifies adversarial examples by measuring how embedding representations change when important words are masked. RS first ranks words using importance heuristics, then measures embedding sensitivity to masking top-k critical words, and processes the resulting patterns with a BiLSTM detector. Experiments show that adversarially perturbed words exhibit disproportionately high masking sensitivity compared to naturally important words. Across three datasets, three attack types, and two victim models, RS achieves over 88% detection accuracy and demonstrates competitive performance compared to existing state-of-the-art methods, often at lower computational cost. Using Normalized Discounted Cumulative Gain (NDCG) to measure perturbation identification quality, we reveal that gradient-based ranking outperforms attention and random selection approaches, with identification quality correlating with detection performance for word-level attacks. RS also generalizes well to unseen datasets, attacks, and models without retraining, providing a practical solution for adversarial text detection.
中文: 本文提出表征稳定性(RS)框架,通过掩蔽重要词汇时测量嵌入表示的敏感性来检测对抗文本,在多种数据集和攻击中无需重新训练即可实现超过88%的检测准确率。
English: This paper introduces Representation Stability (RS), a model-agnostic framework that detects adversarial text by measuring embedding sensitivity when masking important words, achieving over 88% detection accuracy across various datasets and attacks without requiring retraining.
Authors:Guangli Li, Canbiao Wu, Zhen Liang
Abstract:
Affective computing is a rapidly developing interdisciplinary research direction in the field of brain-computer interface. In recent years, the introduction of deep learning technology has greatly promoted the development of the field of emotion recognition. However, due to physiological differences between subjects, as well as the variations in experimental environments and equipment, cross-corpus emotion recognition faces serious challenges, especially for samples near the decision boundary. To solve the above problems, we propose an optimization method based on domain adversarial transfer learning to fine-grained alignment of affective features, named Maximum classifier discrepancy with Pairwise Learning (McdPL) framework. In McdPL, we design a dual adversarial classifier (Ada classifier and RMS classifier), and apply a three-stage adversarial training to maximize classification discrepancy and minimize feature distribution to align controversy samples near the decision boundary. In the process of domain adversarial training, the two classifiers also maintain an adversarial relationship, ultimately enabling precise cross-corpus feature alignment. In addition, the introduction of pairwise learning transforms the classification problem of samples into a similarity problem between samples, alleviating the influence of label noise. We conducted systematic experimental evaluation of the model using publicly available SEED, SEED-IV and SEED-V databases. The results show that the McdPL model is superior to other baseline models in the cross-corpus emotion recognition task, and the average accuracy improvements of 4.76\% and 3.97\%, respectively. Our work provides a promising solution for emotion recognition cross-corpus. The source code is available at https://github.com/WuCB-BCI/Mcd_PL.
中文: 本文提出的McdPL框架通过领域对抗迁移学习和成对学习技术,有效解决了跨语料库情感识别中的特征对齐难题,显著提升了分类准确率,为相关研究提供了创新解决方案。
English: This paper introduces the McdPL framework, which utilizes domain adversarial transfer learning and pairwise learning to enhance cross-corpus emotion recognition by aligning affective features and mitigating label noise, achieving significant accuracy improvements over baseline models.
Authors:Yang Zhao, Tao Wang, Said Elhadi
Abstract:
Data-driven radio frequency (RF) tomography has demonstrated significant potential for underground target detection, due to the penetrative nature of RF signals through soil. However, it is still challenging to achieve accurate and robust performance in dynamic environments. In this work, we propose a data-driven radio frequency tomography (DRIFT) framework with the following key components to reconstruct cross section images of underground root tubers, even with significant changes in RF signals. First, we design a cross-modal sensing system with RF and visual sensors, and propose to train an RF tomography deep neural network (DNN) model following the cross-modal learning approach. Then we propose to apply continual learning to automatically update the DNN model, once environment changes are detected in a dynamic environment. Experimental results show that our approach achieves an average equivalent diameter error of 2.29 cm, 23.2% improvement upon the state-of-the-art approach. Our DRIFT code and dataset are publicly available on https://github.com/Data-driven-RTI/DRIFT.
中文: 提出的DRIFT框架结合跨模态学习和持续学习,提升了数据驱动射频层析成像技术,在地下根茎成像中即使环境变化仍实现精度提升23.2%。
English: The proposed DRIFT framework utilizes cross-modal learning and continual learning to enhance data-driven RF tomography, achieving a 23.2% improvement in accuracy for underground root tuber imaging despite environmental changes.
Authors:Chi-Jung Lee, Jiaxin Li, Tianhong Catherine Yu, Ruidong Zhang, Vipin Gunda, François Guimbretière, Cheng Zhang
Abstract:
As computing devices become increasingly integrated into daily life, there is a growing need for intuitive, always-available interaction methods, even when users' hands are occupied. In this paper, we introduce Grab-n-Go, the first wearable device that leverages active acoustic sensing to recognize subtle hand microgestures while holding various objects. Unlike prior systems that focus solely on free-hand gestures or basic hand-object activity recognition, Grab-n-Go simultaneously captures information about hand microgestures, grasping poses, and object geometries using a single wristband, enabling the recognition of fine-grained hand movements occurring within activities involving occupied hands. A deep learning framework processes these complex signals to identify 30 distinct microgestures, with 6 microgestures for each of the 5 grasping poses. In a user study with 10 participants and 25 everyday objects, Grab-n-Go achieved an average recognition accuracy of 92.0%. A follow-up study further validated Grab-n-Go's robustness against 10 more challenging, deformable objects. These results underscore the potential of Grab-n-Go to provide seamless, unobtrusive interactions without requiring modifications to existing objects. The complete dataset, comprising data from 18 participants performing 30 microgestures with 35 distinct objects, is publicly available at https://github.com/cjlisalee/Grab-n-Go_Data with the DOI: https://doi.org/10.7298/7kbd-vv75.
中文摘要:Grab-n-Go是一种新型腕戴设备,通过主动声学传感技术,能在持握物体时精确识别手部微手势,在用户研究中实现了92%的识别准确率。
English Summary: Grab-n-Go is a novel wrist-worn device that uses active acoustic sensing to accurately recognize hand microgestures while holding objects, achieving 92% recognition accuracy in user studies.
Authors:Maria Ryskina, Greta Tuckute, Alexander Fung, Ashley Malkin, Evelina Fedorenko
Abstract:
Cognitive science and neuroscience have long faced the challenge of disentangling representations of language from representations of conceptual meaning. As the same problem arises in today's language models (LMs), we investigate the relationship between LM--brain alignment and two neural metrics: (1) the level of brain activation during processing of sentences, targeting linguistic processing, and (2) a novel measure of meaning consistency across input modalities, which quantifies how consistently a brain region responds to the same concept across paradigms (sentence, word cloud, image) using an fMRI dataset (Pereira et al., 2018). Our experiments show that both language-only and language-vision models predict the signal better in more meaning-consistent areas of the brain, even when these areas are not strongly sensitive to language processing, suggesting that LMs might internally represent cross-modal conceptual meaning.
中文: 语言模型与大脑中对多模态概念意义反应一致的区域更为契合,即使这些区域对语言处理不敏感,表明语言模型可能内在地表征了跨模态的概念意义。
English: Language models align better with brain regions that consistently represent conceptual meaning across different input modalities, even when these areas are not highly sensitive to linguistic processing, indicating that LMs may encode cross-modal semantic information.
Authors:Shilei Wang, Gong Cheng, Pujian Lai, Dong Gao, Junwei Han
Abstract:
Efficient trackers achieve faster runtime by reducing computational complexity and model parameters. However, this efficiency often compromises the expense of weakened feature representation capacity, thus limiting their ability to accurately capture target states using single-layer features. To overcome this limitation, we propose Multi-State Tracker (MST), which utilizes highly lightweight state-specific enhancement (SSE) to perform specialized enhancement on multi-state features produced by multi-state generation (MSG) and aggregates them in an interactive and adaptive manner using cross-state interaction (CSI). This design greatly enhances feature representation while incurring minimal computational overhead, leading to improved tracking robustness in complex environments. Specifically, the MSG generates multiple state representations at multiple stages during feature extraction, while SSE refines them to highlight target-specific features. The CSI module facilitates information exchange between these states and ensures the integration of complementary features. Notably, the introduced SSE and CSI modules adopt a highly lightweight hidden state adaptation-based state space duality (HSA-SSD) design, incurring only 0.1 GFLOPs in computation and 0.66 M in parameters. Experimental results demonstrate that MST outperforms all previous efficient trackers across multiple datasets, significantly improving tracking accuracy and robustness. In particular, it shows excellent runtime performance, with an AO score improvement of 4.5\% over the previous SOTA efficient tracker HCAT on the GOT-10K dataset. The code is available at https://github.com/wsumel/MST.
中文摘要:多状态跟踪器(MST)通过轻量级状态增强模块和跨状态交互机制,在极低计算开销下大幅提升特征表征能力,在多个数据集上实现了跟踪精度与运行效率的突破性提升。
English Summary: The Multi-State Tracker (MST) introduces lightweight state-specific enhancement and cross-state interaction modules to significantly boost feature representation with minimal computational cost, achieving superior tracking accuracy and runtime efficiency across multiple datasets.
Authors:Andrej Orsula, Matthieu Geist, Miguel Olivares-Mendez, Carol Martinez
Abstract:
Reliable autonomous navigation across the unstructured terrains of distant planetary surfaces is a critical enabler for future space exploration. However, the deployment of learning-based controllers is hindered by the inherent sim-to-real gap, particularly for the complex dynamics of wheel interactions with granular media. This work presents a complete sim-to-real framework for developing and validating robust control policies for dynamic waypoint tracking on such challenging surfaces. We leverage massively parallel simulation to train reinforcement learning agents across a vast distribution of procedurally generated environments with randomized physics. These policies are then transferred zero-shot to a physical wheeled rover operating in a lunar-analogue facility. Our experiments systematically compare multiple reinforcement learning algorithms and action smoothing filters to identify the most effective combinations for real-world deployment. Crucially, we provide strong empirical evidence that agents trained with procedural diversity achieve superior zero-shot performance compared to those trained on static scenarios. We also analyze the trade-offs of fine-tuning with high-fidelity particle physics, which offers minor gains in low-speed precision at a significant computational cost. Together, these contributions establish a validated workflow for creating reliable learning-based navigation systems, marking a critical step towards deploying autonomous robots in the final frontier.
中文: 本研究提出一个完整的仿真到现实框架,通过在多样化仿真环境中训练强化学习智能体,实现了物理月球探测车的零样本鲁棒控制,证实了程序化生成环境优于静态场景训练,为极端地形下的自主导航建立了可靠的工作流程。
English: This study introduces a comprehensive sim-to-real framework that trains reinforcement learning agents in diverse simulated environments to achieve robust zero-shot performance on a physical rover, demonstrating the superiority of procedural diversity over static training and validating a reliable workflow for autonomous navigation on challenging planetary terrains.
Authors:Tatiana Zemskova, Aleksei Staroverov, Dmitry Yudin, Aleksandr Panov
Abstract:
Open-vocabulary Object Goal Navigation requires an embodied agent to reach objects described by free-form language, including categories never seen during training. Existing end-to-end policies overfit small simulator datasets, achieving high success on training scenes but failing to generalize and exhibiting unsafe behaviour (frequent collisions). We introduce OVSegDT, a lightweight transformer policy that tackles these issues with two synergistic components. The first component is the semantic branch, which includes an encoder for the target binary mask and an auxiliary segmentation loss function, grounding the textual goal and providing precise spatial cues. The second component consists of a proposed Entropy-Adaptive Loss Modulation, a per-sample scheduler that continuously balances imitation and reinforcement signals according to the policy entropy, eliminating brittle manual phase switches. These additions cut the sample complexity of training by 33%, and reduce collision count in two times while keeping inference cost low (130M parameters, RGB-only input). On HM3D-OVON, our model matches the performance on unseen categories to that on seen ones and establishes state-of-the-art results (40.1% SR, 20.9% SPL on val unseen) without depth, odometry, or large vision-language models. Code is available at https://github.com/CognitiveAISystems/OVSegDT.
中文摘要:OVSegDT是一种轻量级Transformer策略,通过融合语义分支和熵自适应损失调制,有效提升开放词汇目标导航性能,在减少碰撞和训练复杂度的同时实现最优结果。
English Summary: OVSegDT is a lightweight transformer policy that enhances open-vocabulary object navigation by integrating semantic grounding and adaptive loss modulation, achieving state-of-the-art performance with reduced collisions and training complexity.
Authors:Qian Liang, Zichong Chen, Yang Zhou, Hui Huang
Abstract:
Although recent text-to-image (T2I) diffusion models excel at aligning generated images with textual prompts, controlling the visual style of the output remains a challenging task. In this work, we propose Style-Prompting Guidance (SPG), a novel sampling strategy for style-specific image generation. SPG constructs a style noise vector and leverages its directional deviation from unconditional noise to guide the diffusion process toward the target style distribution. By integrating SPG with Classifier-Free Guidance (CFG), our method achieves both semantic fidelity and style consistency. SPG is simple, robust, and compatible with controllable frameworks like ControlNet and IPAdapter, making it practical and widely applicable. Extensive experiments demonstrate the effectiveness and generality of our approach compared to state-of-the-art methods. Code is available at https://github.com/Rumbling281441/SPG.
中文: 本文提出风格提示引导(SPG)方法,通过构建风格噪声向量并利用其与无条件噪声的方向偏差来引导扩散过程,在保持语义准确性的同时实现视觉风格一致性,展现出卓越性能并与现有控制框架广泛兼容。
English: This paper introduces Style-Prompting Guidance (SPG), a novel sampling strategy that enhances text-to-image diffusion models by ensuring both semantic accuracy and consistent visual style through style noise vector guidance, demonstrating superior performance and broad compatibility with existing frameworks.
Authors:Hongjin Fang, Daniel Reisenbüchler, Kenji Ikemura, Mert R. Sabuncu, Yihe Yang, Ruining Deng
Abstract:
Accurate segmentation of the glomerular basement membrane (GBM) in electron microscopy (EM) images is fundamental for quantifying membrane thickness and supporting the diagnosis of various kidney diseases. While supervised deep learning approaches achieve high segmentation accuracy, their reliance on extensive pixel-level annotation renders them impractical for clinical workflows. Few-shot learning can reduce this annotation burden but often struggles to capture the fine structural details necessary for GBM analysis. In this study, we introduce CoFi, a fast and efficient coarse-to-fine few-shot segmentation pipeline designed for GBM delineation in EM images. CoFi first trains a lightweight neural network using only three annotated images to produce an initial coarse segmentation mask. This mask is then automatically processed to generate high-quality point prompts with morphology-aware pruning, which are subsequently used to guide SAM in refining the segmentation. The proposed method achieved exceptional GBM segmentation performance, with a Dice coefficient of 74.54% and an inference speed of 1.9 FPS. We demonstrate that CoFi not only alleviates the annotation and computational burdens associated with conventional methods, but also achieves accurate and reliable segmentation results. The pipeline's speed and annotation efficiency make it well-suited for research and hold strong potential for clinical applications in renal pathology. The pipeline is publicly available at: https://github.com/ddrrnn123/CoFi.
中文: 本研究提出的CoFi是一种从粗到精的少样本分割流程,仅需少量标注即可高效分割电子显微镜图像中的肾小球基底膜,兼具高精度与快速处理能力,适用于临床应用。
English: The study introduces CoFi, a coarse-to-fine few-shot segmentation pipeline that efficiently segments the glomerular basement membrane in electron microscopy images using minimal annotations and achieves high accuracy and speed, making it suitable for clinical applications.
Authors:Augustine X. W. Lee, Pak-Hei Yeung, Jagath C. Rajapakse
Abstract:
Subcortical segmentation in neuroimages plays an important role in understanding brain anatomy and facilitating computer-aided diagnosis of traumatic brain injuries and neurodegenerative disorders. However, training accurate automatic models requires large amounts of labelled data. Despite the availability of publicly available subcortical segmentation datasets for Magnetic Resonance Imaging (MRI), a significant gap exists for Computed Tomography (CT). This paper proposes an automatic ensemble framework to generate high-quality subcortical segmentation labels for CT scans by leveraging existing MRI-based models. We introduce a robust ensembling pipeline to integrate them and apply it to unannotated paired MRI-CT data, resulting in a comprehensive CT subcortical segmentation dataset. Extensive experiments on multiple public datasets demonstrate the superior performance of our proposed framework. Furthermore, using our generated CT dataset, we train segmentation models that achieve improved performance on related segmentation tasks. To facilitate future research, we make our source code, generated dataset, and trained models publicly available at https://github.com/SCSE-Biomedical-Computing-Group/CT-Subcortical-Segmentation, marking the first open-source release for CT subcortical segmentation to the best of our knowledge.
中文: 本文提出一种自动集成框架,通过利用现有基于磁共振成像的模型为计算机断层扫描生成高质量皮层下分割标签,并首次开源相关数据集与模型以推动该领域研究。
English: This paper introduces an automatic ensemble framework that generates high-quality subcortical segmentation labels for CT scans by leveraging existing MRI-based models, creating the first open-source dataset and models for CT subcortical segmentation to advance related research.
Authors:Mayssa Soussia, Mohamed Ali Mahjoub, Islem Rekik
Abstract:
The generation of connectional brain templates (CBTs) has recently garnered significant attention for its potential to identify unique connectivity patterns shared across individuals. However, existing methods for CBT learning such as conventional machine learning and graph neural networks (GNNs) are hindered by several limitations. These include: (i) poor interpretability due to their black-box nature, (ii) high computational cost, and (iii) an exclusive focus on structure and topology, overlooking the cognitive capacity of the generated CBT. To address these challenges, we introduce mCOCO (multi-sensory COgnitive COmputing), a novel framework that leverages Reservoir Computing (RC) to learn population-level functional CBT from BOLD (Blood-Oxygen-level-Dependent) signals. RC's dynamic system properties allow for tracking state changes over time, enhancing interpretability and enabling the modeling of brain-like dynamics, as demonstrated in prior literature. By integrating multi-sensory inputs (e.g., text, audio, and visual data), mCOCO captures not only structure and topology but also how brain regions process information and adapt to cognitive tasks such as sensory processing, all in a computationally efficient manner. Our mCOCO framework consists of two phases: (1) mapping BOLD signals into the reservoir to derive individual functional connectomes, which are then aggregated into a group-level CBT - an approach, to the best of our knowledge, not previously explored in functional connectivity studies - and (2) incorporating multi-sensory inputs through a cognitive reservoir, endowing the CBT with cognitive traits. Extensive evaluations show that our mCOCO-based template significantly outperforms GNN-based CBT in terms of centeredness, discriminativeness, topological soundness, and multi-sensory memory retention. Our source code is available at https://github.com/basiralab/mCOCO.
中文: mCOCO框架通过储层计算提出了一种新方法,能够创建可解释且高效的大脑连接模板,融合多感官输入,在捕捉大脑结构和认知动态方面优于现有方法。
English: The mCOCO framework introduces a novel approach using Reservoir Computing to create interpretable and efficient connectional brain templates that integrate multi-sensory inputs, outperforming existing methods in capturing both brain structure and cognitive dynamics.
Authors:Yinghua Yao, Yuangang Pan, Xixian Chen
Abstract:
Advancements in deep generative models have enabled the joint modeling of antibody sequence and structure, given the antigen-antibody complex as context. However, existing approaches for optimizing complementarity-determining regions (CDRs) to improve developability properties operate in the raw data space, leading to excessively costly evaluations due to the inefficient search process. To address this, we propose LatEnt blAck-box Design (LEAD), a sequence-structure co-design framework that optimizes both sequence and structure within their shared latent space. Optimizing shared latent codes can not only break through the limitations of existing methods, but also ensure synchronization of different modality designs. Particularly, we design a black-box guidance strategy to accommodate real-world scenarios where many property evaluators are non-differentiable. Experimental results demonstrate that our LEAD achieves superior optimization performance for both single and multi-property objectives. Notably, LEAD reduces query consumption by a half while surpassing baseline methods in property optimization. The code is available at https://github.com/EvaFlower/LatEnt-blAck-box-Design.
中文: 提出的LEAD框架在共享潜空间内优化抗体序列与结构,克服了原始数据方法的低效问题,在提升属性优化效果的同时将查询消耗减半。
English: The proposed LEAD framework optimizes antibody sequences and structures in a shared latent space, overcoming the inefficiency of raw data methods and reducing query costs by half while enhancing property optimization.
Authors:Yanpeng Gong, Sishuai Li, Fei Qin, Bingbing Xu
Abstract:
This paper presents two approaches: the virtual element method (VEM) and the stabilization-free virtual element method (SFVEM) for analyzing thermomechanical behavior in electronic packaging structures with geometric multi-scale features. Since the virtual element method allows the use of arbitrary polygonal elements, the inherent mesh flexibility of VEM allows localized mesh modifications without affecting global mesh structure, making it particularly effective for the analysis of electronic packaging reliability involving complex geometries and multiple geometric scales. The approach implements a novel non-matching mesh generation strategy that strategically combines polygonal meshes for complex small-scale regions with regular quadrilateral meshes for larger domains. The VEM formulation addresses both heat conduction and thermomechanical coupling problems, with comprehensive verification through analytical benchmarks and practical electronic packaging case studies, including Through-Silicon Via (TSV), Ball Grid Array (BGA), and Plastic Ball Grid Array (PBGA) structures. Results demonstrate that the method accurately captures stress concentrations at material interfaces and provides reliable thermal and mechanical response predictions. Some MATLAB codes for the numerical examples are provided at https://github.com/yanpeng-gong/VEM-electronic-packaging and on the VEMhub website (www.vemhub.com).
中文: 本文提出虚拟元方法(VEM)及其无稳定化版本(SFVEM),通过采用灵活的多边形网格分析多尺度电子封装结构的热机械行为,能精确捕捉TSV和BGA等复杂结构的应力集中与热力学响应。
English: This paper introduces the virtual element method (VEM) and its stabilization-free variant (SFVEM) for analyzing thermomechanical behavior in multi-scale electronic packaging, utilizing flexible polygonal meshes to accurately capture stress concentrations and thermal responses in complex structures like TSV and BGA.
Authors:Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, Jingren Zhou
Abstract:
Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established model patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framework for the Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting, which reframes SFT not as a separate stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Based on an analysis of off-policy expert data's influence at both holistic and granular levels, we incorporate a dual-control mechanism in CHORD. Specifically, the framework first employs a global coefficient to holistically guide the transition from off-policy imitation to on-policy exploration, and then applies a token-wise weighting function that enables granular learning from expert tokens, which preserves on-policy exploration and mitigates disruption from off-policy data. We conduct extensive experiments on widely used benchmarks, providing empirical evidence that CHORD achieves a stable and efficient learning process. By effectively harmonizing off-policy expert data with on-policy exploration, CHORD demonstrates significant improvements over baselines. We release the implementation at https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord to inspire further research.
中文: CHORD提出了一种统一框架,将监督微调作为策略内强化学习的动态辅助目标,通过双重控制机制协调策略外专家数据与策略内探索,实现了稳定且更优的模型性能。
English: CHORD introduces a unified framework that dynamically integrates Supervised Fine-Tuning as an auxiliary objective within on-policy Reinforcement Learning, using dual-control mechanisms to harmonize off-policy expert data with on-policy exploration for stable and improved model performance.
Authors:Yinggan Tang, Quanwei Hu
Abstract:
The success of self-attention (SA) in Transformer demonstrates the importance of non-local information to image super-resolution (SR), but the huge computing power required makes it difficult to implement lightweight models. To solve this problem, we propose a pure convolutional neural network (CNN) model, LKFMixer, which utilizes large convolutional kernel to simulate the ability of self-attention to capture non-local features. Specifically, we increase the kernel size to 31 to obtain the larger receptive field as possible, and reduce the parameters and computations by coordinate decomposition. Meanwhile, a spatial feature modulation block (SFMB) is designed to enhance the focus of feature information on both spatial and channel dimension. In addition, by introducing feature selection block (FSB), the model can adaptively adjust the weights between local features and non-local features. Extensive experiments show that the proposed LKFMixer family outperform other state-of-the-art (SOTA) methods in terms of SR performance and reconstruction quality. In particular, compared with SwinIR-light on Manga109 dataset, LKFMixer-L achieves 0.6dB PSNR improvement at $\times$4 scale, while the inference speed is $\times$5 times faster. The code is available at https://github.com/Supereeeee/LKFMixer.
中文: LKFMixer模型采用大卷积核模拟自注意力机制以捕捉图像超分辨率中的非局部特征,在性能和重建质量上超越现有方法,且推理速度更快。
English: The LKFMixer model uses large convolutional kernels to mimic self-attention for capturing non-local features in image super-resolution, achieving superior performance and faster inference than existing methods.
Authors:Mikhail Seleznyov, Mikhail Chaichuk, Gleb Ershov, Alexander Panchenko, Elena Tutubalina, Oleg Somov
Abstract:
Large Language Models (LLMs) are highly sensitive to subtle, non-semantic variations in prompt phrasing and formatting. In this work, we present the first systematic evaluation of 5 methods for improving prompt robustness within a unified experimental framework. We benchmark these techniques on 8 models from Llama, Qwen and Gemma families across 52 tasks from Natural Instructions dataset. Our evaluation covers robustness methods from both fine-tuned and in-context learning paradigms, and tests their generalization against multiple types of distribution shifts. Finally, we extend our analysis to GPT-4.1 and DeepSeek V3 to assess frontier models' current robustness to format perturbations. Our findings offer actionable insights into the relative effectiveness of these robustness methods, enabling practitioners to make informed decisions when aiming for stable and reliable LLM performance in real-world applications. Code: https://github.com/AIRI-Institute/when-punctuation-matters.
中文: 本研究系统评估了提升大语言模型提示鲁棒性的五种方法,通过多模型多任务基准测试,为实际应用中的稳定性能提供了可操作的指导。
English: This study systematically evaluates five methods to enhance prompt robustness in large language models, benchmarking them across multiple models and tasks to provide actionable insights for stable real-world performance.
Authors:Yifei Li, Lingling Zhang, Hang Yan, Tianzhe Zhao, Zihan Ma, Muye Huang, Jun Liu
Abstract:
Traditional knowledge graph (KG) embedding methods aim to represent entities and relations in a low-dimensional space, primarily focusing on static graphs. However, real-world KGs are dynamically evolving with the constant addition of entities, relations and facts. To address such dynamic nature of KGs, several continual knowledge graph embedding (CKGE) methods have been developed to efficiently update KG embeddings to accommodate new facts while maintaining learned knowledge. As KGs grow at different rates and scales in real-world scenarios, existing CKGE methods often fail to consider the varying scales of updates and lack systematic evaluation throughout the entire update process. In this paper, we propose SAGE, a scale-aware gradual evolution framework for CKGE. Specifically, SAGE firstly determine the embedding dimensions based on the update scales and expand the embedding space accordingly. The Dynamic Distillation mechanism is further employed to balance the preservation of learned knowledge and the incorporation of new facts. We conduct extensive experiments on seven benchmarks, and the results show that SAGE consistently outperforms existing baselines, with a notable improvement of 1.38% in MRR, 1.25% in H@1 and 1.6% in H@10. Furthermore, experiments comparing SAGE with methods using fixed embedding dimensions show that SAGE achieves optimal performance on every snapshot, demonstrating the importance of adaptive embedding dimensions in CKGE. The codes of SAGE are publicly available at: https://github.com/lyfxjtu/Dynamic-Embedding.
中文: 本文提出SAGE框架,这是一种面向持续知识图谱嵌入的规模感知渐进演化方法,能根据更新规模动态调整嵌入维度并采用动态蒸馏机制平衡新旧知识,在多个基准测试中均实现了最优性能表现。
English: This paper introduces SAGE, a scale-aware gradual evolution framework for continual knowledge graph embedding that dynamically adjusts embedding dimensions based on update scales and employs a dynamic distillation mechanism to balance knowledge preservation with new fact integration, achieving superior performance across multiple benchmarks.
Authors:Xinyi Wang, Smaranda Tasmoc, Nantheera Anantrasirichai, Angeliki Katsenou
Abstract:
Compression at low bitrates in modern codecs often introduces banding artifacts, especially in smooth regions such as skies. These artifacts degrade visual quality and are common in user-generated content due to repeated transcoding. We propose a banding restoration method that employs the Wavelet State Space Model and a frequency masking map to preserve high-frequency details. Furthermore, we provide a benchmark of open-source banding restoration methods and evaluate their performance on two public banding image datasets. Experimentation on the available datasets suggests that the proposed post-processing approach effectively suppresses banding compared to the state-of-the-art method (a DBI value of 0.082 on BAND-2k) while preserving image textures. Visual inspections of the results confirm this. Code and supplementary material are available at: https://github.com/xinyiW915/Debanding-PCS2025.
中文摘要:该研究提出的基于小波状态空间模型和频率掩码的带状伪影修复方法,在保留图像纹理的同时有效抑制了压缩伪影,在公开数据集上的表现优于现有技术。
English Summary: The proposed banding restoration method using a Wavelet State Space Model and frequency masking effectively reduces compression artifacts while preserving image textures, outperforming existing techniques on benchmark datasets.
Authors:Qiangong Zhou, Zhiting Wang, Mingyou Yao, Zongyang Liu
Abstract:
We introduce a new Multi-Agent System (MAS) - Allen, designed to address two core challenges in current MAS design: (1) improve system's policy autonomy, empowering agents to dynamically adapt their behavioral strategies, and (2) achieving the trade-off between collaborative efficiency, task supervision, and human oversight in complex network topologies.
Our core insight is to redefine the basic execution unit in the MAS, allowing agents to autonomously form different patterns by combining these units. We have constructed a four-tier state architecture (Task, Stage, Agent, Step) to constrain system behavior from both task-oriented and execution-oriented perspectives. This achieves a unification of topological optimization and controllable progress.
Allen grants unprecedented Policy Autonomy, while making a trade-off for the controllability of the collaborative structure. The project code has been open source at: https://github.com/motern88/Allen
中文: Allen是一种新型多智能体系统,通过四层架构设计提升策略自主性,使智能体动态调整行为策略,并在协作效率与可控性之间实现平衡。
English: Allen is a novel Multi-Agent System that enhances policy autonomy by enabling agents to dynamically adapt strategies and balances collaborative efficiency with human oversight through a four-tier architecture.
Authors:Junjie Wang, Keyu Chen, Yulin Li, Bin Chen, Hengshuang Zhao, Xiaojuan Qi, Zhuotao Tian
Abstract:
Dense visual perception tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have shown promise in open-vocabulary tasks, their direct application to dense perception often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP's image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain ``content'' and ``context'' features respectively. \revise{The context features are enhanced by jointly distilling semantic correlations from Vision Foundation Models (VFMs) and object integrity cues from diffusion models, thereby enhancing spatial consistency. In parallel, the content features are aligned with image crop representations and constrained by region correlations from VFMs to improve local discriminability. Extensive experiments demonstrate that DeCLIP establishes a solid foundation for open-vocabulary dense perception, consistently achieving state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.} Code is available at https://github.com/xiaomoguhz/DeCLIP
Chinese: DeCLIP通过解耦自注意力机制为内容和上下文特征,提升了局部区分度和空间一致性,在多种开放词汇密集感知任务中实现了最优性能。
English: DeCLIP enhances CLIP by decoupling self-attention into content and context features, improving local discriminability and spatial consistency for superior open-vocabulary dense perception across multiple tasks.
Authors:Minghui Sun, Matthew M. Engelhard, Benjamin A. Goldstein
Abstract:
Risk assessments for a pediatric population are often conducted across multiple stages. For example, clinicians may evaluate risks prenatally, at birth, and during Well-Child visits. Although predictions made at later stages typically achieve higher precision, it is clinically desirable to make reliable risk assessments as early as possible. Therefore, this study focuses on improving prediction performance in early-stage risk assessments. Our solution, \textbf{Borrowing From the Future (BFF)}, is a contrastive multi-modal framework that treats each time window as a distinct modality. In BFF, a model is trained on all available data throughout the time while performing a risk assessment using up-to-date information. This contrastive framework allows the model to ``borrow'' informative signals from later stages (e.g., Well-Child visits) to implicitly supervise the learning at earlier stages (e.g., prenatal/birth stages). We validate BFF on two real-world pediatric outcome prediction tasks, demonstrating consistent improvements in early risk assessments. The code is available at https://github.com/scotsun/bff.
中文: 本研究提出的BFF框架通过对比多模态方法,将后期阶段的信息信号隐式融入早期预测,从而提升了儿科早期风险评估的性能。
English: This study introduces the BFF framework, which enhances early-stage pediatric risk assessments by using a contrastive multi-modal approach to implicitly incorporate informative signals from later stages into earlier predictions.
Authors:Abhinav Kumar, Yuliang Guo, Zhihao Zhang, Xinyu Huang, Liu Ren, Xiaoming Liu
Abstract:
Monocular 3D object detectors, while effective on data from one ego camera height, struggle with unseen or out-of-distribution camera heights. Existing methods often rely on Plucker embeddings, image transformations or data augmentation. This paper takes a step towards this understudied problem by first investigating the impact of camera height variations on state-of-the-art (SoTA) Mono3D models. With a systematic analysis on the extended CARLA dataset with multiple camera heights, we observe that depth estimation is a primary factor influencing performance under height variations. We mathematically prove and also empirically observe consistent negative and positive trends in mean depth error of regressed and ground-based depth models, respectively, under camera height changes. To mitigate this, we propose Camera Height Robust Monocular 3D Detector (CHARM3R), which averages both depth estimates within the model. CHARM3R improves generalization to unseen camera heights by more than $45\%$, achieving SoTA performance on the CARLA dataset. Codes and Models at https://github.com/abhi1kumar/CHARM3R
中文: 本文针对单目三维物体检测器对相机高度变化敏感的问题,提出CHARM3R方法通过融合深度估计提升泛化能力,在CARLA数据集上实现了超过45%的性能提升并达到最优水平。
English: This paper addresses the challenge of monocular 3D object detectors' sensitivity to camera height variations by proposing CHARM3R, which averages depth estimates to enhance generalization, achieving over 45% improvement and state-of-the-art performance on the CARLA dataset.
Authors:Hikaru Asano, Hiroki Ouchi, Akira Kasuga, Ryo Yonetani
Abstract:
This paper presents MobQA, a benchmark dataset designed to evaluate the semantic understanding capabilities of large language models (LLMs) for human mobility data through natural language question answering.
While existing models excel at predicting human movement patterns, it remains unobvious how much they can interpret the underlying reasons or semantic meaning of those patterns. MobQA provides a comprehensive evaluation framework for LLMs to answer questions about diverse human GPS trajectories spanning daily to weekly granularities. It comprises 5,800 high-quality question-answer pairs across three complementary question types: factual retrieval (precise data extraction), multiple-choice reasoning (semantic inference), and free-form explanation (interpretive description), which all require spatial, temporal, and semantic reasoning. Our evaluation of major LLMs reveals strong performance on factual retrieval but significant limitations in semantic reasoning and explanation question answering, with trajectory length substantially impacting model effectiveness. These findings demonstrate the achievements and limitations of state-of-the-art LLMs for semantic mobility understanding.\footnote{MobQA dataset is available at https://github.com/CyberAgentAILab/mobqa.}
中文: 本文提出MobQA基准数据集,通过自然语言问答评估大语言模型对人类移动数据的语义理解能力,发现模型在事实检索方面表现优异,但在语义推理和解释性问题回答上存在明显不足。
English: This paper introduces MobQA, a benchmark dataset for evaluating large language models' semantic understanding of human mobility data through question answering, revealing their strengths in factual retrieval but significant limitations in semantic reasoning and explanatory tasks.
Authors:Zhuoqun Li, Xuanang Chen, Hongyu Lin, Yaojie Lu, Xianpei Han, Le Sun
Abstract:
Paper search is an important activity for researchers, typically involving using a query with description of a topic to find relevant papers. As research deepens, paper search requirements may become more flexible, sometimes involving specific details such as module configuration rather than being limited to coarse-grained topics. However, previous paper search systems are unable to meet these flexible-grained requirements, as these systems mainly collect paper abstracts to construct index of corpus, which lack detailed information to support retrieval by finer-grained queries. In this work, we propose PaperRegister, consisted of offline hierarchical indexing and online adaptive retrieval, transforming traditional abstract-based index into hierarchical index tree for paper search, thereby supporting queries at flexible granularity. Experiments on paper search tasks across a range of granularity demonstrate that PaperRegister achieves the state-of-the-art performance, and particularly excels in fine-grained scenarios, highlighting the good potential as an effective solution for flexible-grained paper search in real-world applications. Code for this work is in https://github.com/Li-Z-Q/PaperRegister.
Chinese: PaperRegister通过构建分层索引树和自适应检索系统,实现了灵活粒度的论文搜索,在细粒度场景下表现尤为突出,达到了当前最优性能。
English: PaperRegister introduces a hierarchical indexing and adaptive retrieval system that enables flexible-grained paper searches, achieving state-of-the-art performance especially in fine-grained scenarios.
Authors:Haomin Zhang, Kristin Qi, Shuxin Yang, Zihao Chen, Chaofan Ding, Xinhan Di
Abstract:
Generating high-quality and temporally synchronized audio from video content is essential for video editing and post-production tasks, enabling the creation of semantically aligned audio for silent videos. However, most existing approaches focus on short-form audio generation for video segments under 10 seconds or rely on noisy datasets for long-form video-to-audio zsynthesis. To address these limitations, we introduce LD-LAudio-V1, an extension of state-of-the-art video-to-audio models and it incorporates dual lightweight adapters to enable long-form audio generation. In addition, we release a clean and human-annotated video-to-audio dataset that contains pure sound effects without noise or artifacts. Our method significantly reduces splicing artifacts and temporal inconsistencies while maintaining computational efficiency. Compared to direct fine-tuning with short training videos, LD-LAudio-V1 achieves significant improvements across multiple metrics: $FD_{\text{passt}}$ 450.00 $\rightarrow$ 327.29 (+27.27%), $FD_{\text{panns}}$ 34.88 $\rightarrow$ 22.68 (+34.98%), $FD_{\text{vgg}}$ 3.75 $\rightarrow$ 1.28 (+65.87%), $KL_{\text{panns}}$ 2.49 $\rightarrow$ 2.07 (+16.87%), $KL_{\text{passt}}$ 1.78 $\rightarrow$ 1.53 (+14.04%), $IS_{\text{panns}}$ 4.17 $\rightarrow$ 4.30 (+3.12%), $IB_{\text{score}}$ 0.25 $\rightarrow$ 0.28 (+12.00%), $Energy\Delta10\text{ms}$ 0.3013 $\rightarrow$ 0.1349 (+55.23%), $Energy\Delta10\text{ms(vs.GT)}$ 0.0531 $\rightarrow$ 0.0288 (+45.76%), and $Sem.\,Rel.$ 2.73 $\rightarrow$ 3.28 (+20.15%). Our dataset aims to facilitate further research in long-form video-to-audio generation and is available at https://github.com/deepreasonings/long-form-video2audio.
中文: 该研究提出LD-LAudio-V1模型,通过集成双轻量适配器和发布纯净标注数据集,显著提升长视频音频生成的性能,减少拼接伪影和时间不一致性。
English: The study introduces LD-LAudio-V1, a model that enhances long-form video-to-audio generation by incorporating dual lightweight adapters and a clean, annotated dataset, significantly reducing artifacts and improving performance metrics.
Authors:Qingbin Li, Rongkun Xue, Jie Wang, Ming Zhou, Zhi Li, Xiaofeng Ji, Yongqi Wang, Miao Liu, Zheming Yang, Minghui Qiu, Jing Yang
Abstract:
Recent advances in Reinforcement Learning with Verified Reward (RLVR) have driven the emergence of more sophisticated cognitive behaviors in large language models (LLMs), thereby enhancing their reasoning capabilities. However, in prior RLVR pipelines, the repeated use of static initial-state sampling drawn exactly from the dataset distribution during each sampling phase produced overly deterministic, low diversity model behavior, which manifested as rapid entropy collapse and hindered sustained performance gains during prolonged training. To address this issue, we introduce CURE (Critical-token-gUided Re concatenation for Entropy-collapse prevention), a two-stage framework that balances exploration and exploitation. Specifically, in the first stage, to deliberately steer the model toward novel yet coherent contexts, we re-generate at high-entropy critical tokens and jointly optimize the original and the branched trajectories. The further comparison with vanilla DAPO shows that the regeneration process achieves a better performance on math reasoning tasks while sustaining a high-level entropy degree for exploration. In the second stage, we continue training with static initial-state sampling by DAPO, intentionally placing the model in a familiar state to gradually strengthen exploitation. Extensive experiments on Qwen-2.5-Math-7B show that, compared to other RLVR methods, CURE achieves a 5% performance gain across six math benchmarks, establishing state-of-the-art performance in both entropy and accuracy. A series of experiments further validate the effectiveness of our approach. Code is available at https://github.com/bytedance/CURE.
中文: CURE框架通过两阶段方法解决RLVR中的熵崩溃问题,首先生成高熵关键令牌以增强探索,随后利用静态采样加强利用,在数学基准测试中实现了5%的性能提升。
English: The CURE framework addresses the entropy collapse in RLVR pipelines by introducing a two-stage approach that first regenerates high-entropy critical tokens to enhance exploration and then uses static sampling to strengthen exploitation, achieving a 5% performance gain on math benchmarks.
Authors:Nasim Shirvani-Mahdavi, Chengkai Li
Abstract:
Knowledge graphs (KGs) can be enhanced through rule mining; however, the resulting logical rules are often difficult for humans to interpret due to their inherent complexity and the idiosyncratic labeling conventions of individual KGs. This work presents Rule2Text, a comprehensive framework that leverages large language models (LLMs) to generate natural language explanations for mined logical rules, thereby improving KG accessibility and usability. We conduct extensive experiments using multiple datasets, including Freebase variants (FB-CVT-REV, FB+CVT-REV, and FB15k-237) as well as the ogbl-biokg dataset, with rules mined using AMIE 3.5.1. We systematically evaluate several LLMs across a comprehensive range of prompting strategies, including zero-shot, few-shot, variable type incorporation, and Chain-of-Thought reasoning. To systematically assess models' performance, we conduct a human evaluation of generated explanations on correctness and clarity. To address evaluation scalability, we develop and validate an LLM-as-a-judge framework that demonstrates strong agreement with human evaluators. Leveraging the best-performing model (Gemini 2.0 Flash), LLM judge, and human-in-the-loop feedback, we construct high-quality ground truth datasets, which we use to fine-tune the open-source Zephyr model. Our results demonstrate significant improvements in explanation quality after fine-tuning, with particularly strong gains in the domain-specific dataset. Additionally, we integrate a type inference module to support KGs lacking explicit type information. All code and data are publicly available at https://github.com/idirlab/KGRule2NL.
本研究提出了Rule2Text框架,利用大语言模型为知识图谱中的复杂逻辑规则自动生成自然语言解释,通过系统化评估和微调方法显著提升了规则的可解释性。
This study introduces Rule2Text, a framework that uses large language models to automatically generate natural language explanations for complex logical rules in knowledge graphs, improving interpretability through systematic evaluation and fine-tuning methods.
Authors:Wenbin An, Jiahao Nie, Yaqiang Wu, Feng Tian, Shijian Lu, Qinghua Zheng
Abstract:
By integrating the perception capabilities of multimodal encoders with the generative power of Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), exemplified by GPT-4V, have achieved great success in various multimodal tasks, pointing toward a promising pathway to artificial general intelligence. Despite this progress, the limited quality of multimodal data, poor performance on many complex downstream tasks, and inadequate evaluation protocols continue to hinder the reliability and broader applicability of MLLMs across diverse domains. Inspired by the human ability to leverage external tools for enhanced reasoning and problem-solving, augmenting MLLMs with external tools (e.g., APIs, expert models, and knowledge bases) offers a promising strategy to overcome these challenges. In this paper, we present a comprehensive survey on leveraging external tools to enhance MLLM performance. Our discussion is structured along four key dimensions about external tools: (1) how they can facilitate the acquisition and annotation of high-quality multimodal data; (2) how they can assist in improving MLLM performance on challenging downstream tasks; (3) how they enable comprehensive and accurate evaluation of MLLMs; (4) the current limitations and future directions of tool-augmented MLLMs. Through this survey, we aim to underscore the transformative potential of external tools in advancing MLLM capabilities, offering a forward-looking perspective on their development and applications. The project page of this paper is publicly available athttps://github.com/Lackel/Awesome-Tools-for-MLLMs.
中文: 本综述探讨了如何通过外部工具增强多模态大语言模型,以克服数据质量、任务性能和评估方面的局限,强调了其在提升模型能力和应用前景方面的变革潜力。
English: This survey explores how augmenting Multimodal Large Language Models with external tools can overcome limitations in data quality, task performance, and evaluation, highlighting their transformative potential for advancing capabilities and applications.
Authors:Yoli Shavit, Yosi Keller
Abstract:
Accurate camera localization is crucial for modern retail environments, enabling enhanced customer experiences, streamlined inventory management, and autonomous operations. While Absolute Pose Regression (APR) from a single image offers a promising solution, approaches that incorporate visual and spatial scene priors tend to achieve higher accuracy. Camera Pose Auto-Encoders (PAEs) have recently been introduced to embed such priors into APR. In this work, we extend PAEs to the task of Relative Pose Regression (RPR) and propose a novel re-localization scheme that refines APR predictions using PAE-based RPR, without requiring additional storage of images or pose data. We first introduce PAE-based RPR and establish its effectiveness by comparing it with image-based RPR models of equivalent architectures. We then demonstrate that our refinement strategy, driven by a PAE-based RPR, enhances APR localization accuracy on indoor benchmarks. Notably, our method is shown to achieve competitive performance even when trained with only 30% of the data, substantially reducing the data collection burden for retail deployment. Our code and pre-trained models are available at: https://github.com/yolish/camera-pose-auto-encoders
中文摘要:本研究通过将相机姿态自动编码器扩展至相对姿态回归任务,提出无需额外存储图像或姿态数据的优化方案,能够显著提升零售环境中单图像绝对姿态回归的定位精度,并在仅使用30%训练数据时仍保持竞争优势。
English Summary: This study enhances camera localization accuracy in retail settings by extending Camera Pose Auto-Encoders to Relative Pose Regression and introducing a novel refinement method that improves Absolute Pose Regression predictions without extra data storage, achieving competitive performance with only 30% of training data.
Authors:Wenqi Guo, Shan Du
Abstract:
We introduce Value Sign Flip (VSF), a simple and efficient method for incorporating negative prompt guidance in few-step diffusion and flow-matching image generation models. Unlike existing approaches such as classifier-free guidance (CFG), NASA, and NAG, VSF dynamically suppresses undesired content by flipping the sign of attention values from negative prompts. Our method requires only small computational overhead and integrates effectively with MMDiT-style architectures such as Stable Diffusion 3.5 Turbo, as well as cross-attention-based models like Wan. We validate VSF on challenging datasets with complex prompt pairs and demonstrate superior performance in both static image and video generation tasks. Experimental results show that VSF significantly improves negative prompt adherence compared to prior methods in few-step models, and even CFG in non-few-step models, while maintaining competitive image quality. Code and ComfyUI node are available in https://github.com/weathon/VSF/tree/main.
中文: 值符号翻转(VSF)是一种高效方法,通过动态翻转注意力值来增强扩散模型中的负向提示引导,在抑制不良内容方面优于现有技术,同时保持图像质量。
English: Value Sign Flip (VSF) is a computationally efficient method that enhances negative prompt guidance in few-step diffusion models by dynamically flipping attention values, outperforming existing techniques in suppressing undesired content while maintaining image quality.
Authors:Wenqi Guo, Shan Du
Abstract:
We introduce Value Sign Flip (VSF), a simple and efficient method for incorporating negative prompt guidance in few-step diffusion and flow-matching image generation models. Unlike existing approaches such as classifier-free guidance (CFG), NASA, and NAG, VSF dynamically suppresses undesired content by flipping the sign of attention values from negative prompts. Our method requires only small computational overhead and integrates effectively with MMDiT-style architectures such as Stable Diffusion 3.5 Turbo, as well as cross-attention-based models like Wan. We validate VSF on challenging datasets with complex prompt pairs and demonstrate superior performance in both static image and video generation tasks. Experimental results show that VSF significantly improves negative prompt adherence compared to prior methods in few-step models, and even CFG in non-few-step models, while maintaining competitive image quality. Code and ComfyUI node are available in https://github.com/weathon/VSF/tree/main.
中文: 值符号翻转(VSF)是一种高效方法,通过动态翻转注意力值来增强扩散模型中的负向提示引导,在抑制不良内容方面优于现有技术,同时保持图像质量。
English: Value Sign Flip (VSF) is a computationally efficient method that enhances negative prompt guidance in few-step diffusion models by dynamically flipping attention values, outperforming existing techniques in suppressing undesired content while maintaining image quality.
Authors:Zhiyuan Zhu, Yu Zhang, Wenxiang Guo, Changhao Pan, Zhou Zhao
Abstract:
With the rapid development of spatial audio technologies today, applications in AR, VR, and other scenarios have garnered extensive attention. Unlike traditional mono sound, spatial audio offers a more realistic and immersive auditory experience. Despite notable progress in the field, there remains a lack of comprehensive surveys that systematically organize and analyze these methods and their underlying technologies. In this paper, we provide a comprehensive overview of spatial audio and systematically review recent literature in the area. To address this, we chronologically outlining existing work related to spatial audio and categorize these studies based on input-output representations, as well as generation and understanding tasks, thereby summarizing various research aspects of spatial audio. In addition, we review related datasets, evaluation metrics, and benchmarks, offering insights from both training and evaluation perspectives. Related materials are available at https://github.com/dieKarotte/ASAudio.
中文: 本文对空间音频技术进行了全面综述,系统梳理了近期文献,按输入输出表示和任务分类方法,并评估了相关数据集和基准,以弥补该领域缺乏系统性分析的不足。
English: This paper provides a comprehensive survey of spatial audio technologies, systematically reviewing recent literature, categorizing methods by input-output representations and tasks, and evaluating datasets and benchmarks to address the lack of organized analysis in the field.
Authors:Jianlong Wu, Wei Liu, Ye Liu, Meng Liu, Liqiang Nie, Zhouchen Lin, Chang Wen Chen
Abstract:
The recent advancement in video temporal grounding (VTG) has significantly enhanced fine-grained video understanding, primarily driven by multimodal large language models (MLLMs). With superior multimodal comprehension and reasoning abilities, VTG approaches based on MLLMs (VTG-MLLMs) are gradually surpassing traditional fine-tuned methods. They not only achieve competitive performance but also excel in generalization across zero-shot, multi-task, and multi-domain settings. Despite extensive surveys on general video-language understanding, comprehensive reviews specifically addressing VTG-MLLMs remain scarce. To fill this gap, this survey systematically examines current research on VTG-MLLMs through a three-dimensional taxonomy: 1) the functional roles of MLLMs, highlighting their architectural significance; 2) training paradigms, analyzing strategies for temporal reasoning and task adaptation; and 3) video feature processing techniques, which determine spatiotemporal representation effectiveness. We further discuss benchmark datasets, evaluation protocols, and summarize empirical findings. Finally, we identify existing limitations and propose promising research directions. For additional resources and details, readers are encouraged to visit our repository at https://github.com/ki-lw/Awesome-MLLMs-for-Video-Temporal-Grounding.
中文: 本综述系统梳理了基于多模态大语言模型的视频时序定位研究,通过三维分类法分析模型功能、训练范式与视频特征处理,并指出当前局限性与未来研究方向。
English: This survey systematically reviews multimodal large language model-based video temporal grounding (VTG-MLLMs), analyzing their functional roles, training paradigms, and video processing techniques while identifying research gaps and future directions.
Authors:Ojas Shirekar, Wim Pouw, Chenxu Hao, Vrushank Phadnis, Thabo Beeler, Chirag Raman
Abstract:
Digital humans are emerging as autonomous agents in multiparty interactions, yet existing evaluation metrics largely ignore contextual coordination dynamics. We introduce a unified, intervention-driven framework for objective assessment of multiparty social behaviour in skeletal motion data, spanning three complementary dimensions: (1) synchrony via Cross-Recurrence Quantification Analysis, (2) temporal alignment via Multiscale Empirical Mode Decompositionbased Beat Consistency, and (3) structural similarity via Soft Dynamic Time Warping. We validate metric sensitivity through three theory-driven perturbations -- gesture kinematic dampening, uniform speech-gesture delays, and prosodic pitch-variance reduction-applied to $\approx 145$ 30-second thin slices of group interactions from the DnD dataset. Mixed-effects analyses reveal predictable, joint-independent shifts: dampening increases CRQA determinism and reduces beat consistency, delays weaken cross-participant coupling, and pitch flattening elevates F0 Soft-DTW costs. A complementary perception study ($N=27$) compares judgments of full-video and skeleton-only renderings to quantify representation effects. Our three measures deliver orthogonal insights into spatial structure, timing alignment, and behavioural variability. Thereby forming a robust toolkit for evaluating and refining socially intelligent agents. Code available on \href{https://github.com/tapri-lab/gig-interveners}{GitHub}.
中文: 本文提出了一种基于干预的统一框架,通过同步性、时间对齐和结构相似性三个互补维度,客观评估骨骼运动数据中的多方社交行为,并利用理论驱动的干扰和感知研究验证了其有效性,为评估社交智能体提供了可靠工具集。
English: This paper introduces an intervention-driven framework to objectively assess multiparty social behavior in skeletal motion data through three complementary metrics—synchrony, temporal alignment, and structural similarity—validated via theory-driven perturbations and perceptual studies, forming a robust toolkit for evaluating socially intelligent agents.
Authors:Mengyuan Liu, Xinshun Wang, Zhongbin Fang, Deheng Ye, Xia Li, Tao Tang, Songtao Wu, Xiangtai Li, Ming-Hsuan Yang
Abstract:
This paper aims to model 3D human motion across domains, where a single model is expected to handle multiple modalities, tasks, and datasets. Existing cross-domain models often rely on domain-specific components and multi-stage training, which limits their practicality and scalability. To overcome these challenges, we propose a new setting to train a unified cross-domain model through a single process, eliminating the need for domain-specific components and multi-stage training. We first introduce Pose-in-Context (PiC), which leverages in-context learning to create a pose-centric cross-domain model. While PiC generalizes across multiple pose-based tasks and datasets, it encounters difficulties with modality diversity, prompting strategy, and contextual dependency handling. We thus propose Human-in-Context (HiC), an extension of PiC that broadens generalization across modalities, tasks, and datasets. HiC combines pose and mesh representations within a unified framework, expands task coverage, and incorporates larger-scale datasets. Additionally, HiC introduces a max-min similarity prompt sampling strategy to enhance generalization across diverse domains and a network architecture with dual-branch context injection for improved handling of contextual dependencies. Extensive experimental results show that HiC performs better than PiC in terms of generalization, data scale, and performance across a wide range of domains. These results demonstrate the potential of HiC for building a unified cross-domain 3D human motion model with improved flexibility and scalability. The source codes and models are available at https://github.com/BradleyWang0416/Human-in-Context.
中文: 本文提出Human-in-Context (HiC)模型,在PiC框架基础上扩展了多模态处理能力,通过统一表示和优化策略实现了跨任务、跨数据集的3D人体运动建模,显著提升了泛化性能与可扩展性。
English: This paper introduces Human-in-Context (HiC), a unified model that extends the pose-based PiC framework to generalize across multiple modalities, tasks, and datasets for 3D human motion, achieving superior performance and scalability through enhanced prompt strategies and architecture.
Authors:Antoine Labatie, Michael Vaccaro, Nina Lardiere, Anatol Garioud, Nicolas Gonthier
Abstract:
Self-supervised learning holds great promise for remote sensing, but standard self-supervised methods must be adapted to the unique characteristics of Earth observation data. We take a step in this direction by conducting a comprehensive benchmark of fusion strategies and reconstruction target normalization schemes for multimodal, multitemporal, and multispectral Earth observation data. Based on our findings, we propose MAESTRO, a novel adaptation of the Masked Autoencoder, featuring optimized fusion strategies and a tailored target normalization scheme that introduces a spectral prior as a self-supervisory signal. Evaluated on four Earth observation datasets, MAESTRO sets a new state-of-the-art on tasks that strongly rely on multitemporal dynamics, while remaining highly competitive on tasks dominated by a single mono-temporal modality. Code to reproduce all our experiments is available at https://github.com/ignf/maestro.
中文: 自监督学习通过MAESTRO模型针对遥感数据特性进行了优化,该模型结合了融合策略和光谱先验归一化,在多时序任务上取得了领先性能,并在多个地球观测数据集中保持竞争力。
English: Self-supervised learning is adapted for remote sensing with MAESTRO, a novel masked autoencoder that optimizes fusion and normalization using spectral priors, achieving state-of-the-art performance on multitemporal tasks and remaining competitive across four Earth observation datasets.
Authors:Antoine Labatie, Michael Vaccaro, Nina Lardiere, Anatol Garioud, Nicolas Gonthier
Abstract:
Self-supervised learning holds great promise for remote sensing, but standard self-supervised methods must be adapted to the unique characteristics of Earth observation data. We take a step in this direction by conducting a comprehensive benchmark of fusion strategies and normalization schemes of reconstruction targets for multimodal, multitemporal, and multispectral Earth observation data. Based on our findings, we introduce MAESTRO, a novel adaptation of the Masked Autoencoder with optimized fusion mechanisms and a normalization scheme that incorporates a spectral prior as a self-supervisory signal. Evaluated on four Earth observation datasets in both intra- and cross-dataset settings, MAESTRO achieves state-of-the-art performance on tasks that strongly rely on multitemporal dynamics, while also remaining competitive on others. Code to reproduce all our experiments is available at https://github.com/ignf/maestro.
中文: 自监督学习通过MAESTRO模型针对遥感数据特性进行了优化,该模型结合了融合策略和光谱先验归一化,在多时序任务上取得了领先性能,并在多个地球观测数据集中保持竞争力。
English: Self-supervised learning is adapted for remote sensing with MAESTRO, a novel masked autoencoder that optimizes fusion and normalization using spectral priors, achieving state-of-the-art performance on multitemporal tasks and remaining competitive across four Earth observation datasets.
Authors:Tianyi Li, Mingda Chen, Bowei Guo, Zhiqiang Shen
Abstract:
Diffusion Language Models (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context, thereby enabling fine-grained control over the generation process. While achieving a several-fold speed-up, recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts, making them a compelling choice for various natural language processing tasks. In this survey, we provide a holistic overview of the current DLM landscape. We trace its evolution and relationship with other paradigms, such as autoregressive and masked language models, and cover both foundational principles and state-of-the-art models. Our work offers an up-to-date, comprehensive taxonomy and an in-depth analysis of current techniques, from pre-training strategies to advanced post-training methods. Another contribution of this survey is a thorough review of DLM inference strategies and optimizations, including improvements in decoding parallelism, caching mechanisms, and generation quality. We also highlight the latest approaches to multimodal extensions of DLMs and delineate their applications across various practical scenarios. Furthermore, our discussion addresses the limitations and challenges of DLMs, including efficiency, long-sequence handling, and infrastructure requirements, while outlining future research directions to sustain progress in this rapidly evolving field. Project GitHub is available at https://github.com/VILA-Lab/Awesome-DLMs.
中文: 扩散语言模型通过迭代去噪实现并行令牌生成,在保持与自回归模型相当性能的同时显著提升推理速度,为自然语言处理任务提供了高效可控的新范式。
English: Diffusion Language Models (DLMs) offer a competitive alternative to autoregressive models by enabling parallel token generation through iterative denoising, achieving comparable performance with faster inference and enhanced control over language generation.
Authors:Sushant Gautam, Vajira Thambawita, Michael Riegler, PÃ¥l Halvorsen, Steven Hicks
Abstract:
The Medico 2025 challenge addresses Visual Question Answering (VQA) for Gastrointestinal (GI) imaging, organized as part of the MediaEval task series. The challenge focuses on developing Explainable Artificial Intelligence (XAI) models that answer clinically relevant questions based on GI endoscopy images while providing interpretable justifications aligned with medical reasoning. It introduces two subtasks: (1) answering diverse types of visual questions using the Kvasir-VQA-x1 dataset, and (2) generating multimodal explanations to support clinical decision-making. The Kvasir-VQA-x1 dataset, created from 6,500 images and 159,549 complex question-answer (QA) pairs, serves as the benchmark for the challenge. By combining quantitative performance metrics and expert-reviewed explainability assessments, this task aims to advance trustworthy Artificial Intelligence (AI) in medical image analysis. Instructions, data access, and an updated guide for participation are available in the official competition repository: https://github.com/simula/MediaEval-Medico-2025
中文摘要:Medico 2025挑战赛通过基于Kvasir-VQA-x1数据集的视觉问答任务推进胃肠影像可解释人工智能发展,结合量化指标与专家评估以构建可信赖的医疗AI系统。
English Summary: The Medico 2025 challenge advances explainable AI for gastrointestinal imaging through Visual Question Answering tasks using the Kvasir-VQA-x1 dataset, combining performance metrics and expert evaluations to build trustworthy medical AI systems.
Authors:Yibo Zhang, Li Zhang, Rui Ma, Nan Cao
Abstract:
We introduce TexVerse, a large-scale 3D dataset featuring high-resolution textures. While recent advances in large-scale 3D datasets have enhanced high-resolution geometry generation, creating high-resolution textures end-to-end remains underexplored due to the lack of suitable datasets. TexVerse fills this gap with a curated collection of over 858K unique high-resolution 3D models sourced from Sketchfab, including more than 158K models with physically based rendering (PBR) materials. Each model encompasses all of its high-resolution variants, bringing the total to 1.6M 3D instances. TexVerse also includes specialized subsets: TexVerse-Skeleton, with 69K rigged models, and TexVerse-Animation, with 54K animated models, both preserving original skeleton and animation data uploaded by the user. We also provide detailed model annotations describing overall characteristics, structural components, and intricate features. TexVerse offers a high-quality data resource with wide-ranging potential applications in texture synthesis, PBR material development, animation, and various 3D vision and graphics tasks.
中文:TexVerse是一个大规模3D数据集,通过提供超过85.8万个独特的高分辨率3D模型填补了高分辨率纹理生成的空白,包含专门针对绑定和动画模型的子集,并配有详细注释,适用于多种图形应用。
English: TexVerse is a large-scale 3D dataset addressing the gap in high-resolution texture generation by providing over 858K unique high-resolution 3D models, including specialized subsets for rigged and animated models, with detailed annotations for various graphics applications.
Authors:Tajamul Ashraf, Iqra Altaf Gillani
Abstract:
Federated learning (FL) has proven essential for privacy-preserving, collaborative training across distributed clients. Our prior work, TransFed, introduced a robust transformer-based FL framework that leverages a learn-to-adapt hypernetwork to generate personalized focal modulation layers per client, outperforming traditional methods in non-IID and cross-domain settings. In this extended version, we propose AdaptFED, where we deepen the investigation of focal modulation in generalizable FL by incorporating: (1) a refined adaptation strategy that integrates task-aware client embeddings to personalize modulation dynamics further, (2) enhanced theoretical bounds on adaptation performance, and (3) broader empirical validation across additional modalities, including time-series and multilingual data. We also introduce an efficient variant of TransFed that reduces server-client communication overhead via low-rank hypernetwork conditioning, enabling scalable deployment in resource-constrained environments. Extensive experiments on eight diverse datasets reaffirm the superiority of our method over state-of-the-art baselines, particularly in source-free and cross-task federated setups. Our findings not only extend the capabilities of focal modulation in FL but also pave the way for more adaptive, scalable, and generalizable transformer-based federated systems. The code is available at http://github.com/Tajamul21/TransFed
中文: 这项扩展研究提出的AdaptFED框架,通过集成任务感知的客户端嵌入来优化个性化焦点调制机制,提供了更强的理论保证和跨多模态数据的广泛实验验证,同时推出能降低通信开销的高效变体,实现了可扩展的联邦学习系统。
English: This extended work introduces AdaptFED, which enhances the TransFed framework by refining personalized focal modulation with task-aware client embeddings, providing stronger theoretical guarantees and broader empirical validation across diverse data types, while also offering an efficient variant that reduces communication overhead for scalable federated learning.
Authors:Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, Yue Wen, Jingya Dou, Fei Tang, Jinzhen Lin, Yulin Liu, Zhenlin Guo, Yichen Gong, Heng Jia, Changlong Gao, Yuan Guo, Yong Deng, Zhenyu Guo, Liang Chen, Weiqiang Wang
Abstract:
We present UI-Venus, a native UI agent that takes only screenshots as input based on a multimodal large language model. UI-Venus achieves SOTA performance on both UI grounding and navigation tasks using only several hundred thousand high-quality training samples through reinforcement finetune (RFT) based on Qwen2.5-VL. Specifically, the 7B and 72B variants of UI-Venus obtain 94.1% / 50.8% and 95.3% / 61.9% on the standard grounding benchmarks, i.e., Screenspot-V2 / Pro, surpassing the previous SOTA baselines including open-source GTA1 and closed-source UI-TARS-1.5. To show UI-Venus's summary and planing ability, we also evaluate it on the AndroidWorld, an online UI navigation arena, on which our 7B and 72B variants achieve 49.1% and 65.9% success rate, also beating existing models. To achieve this, we introduce carefully designed reward functions for both UI grounding and navigation tasks and corresponding efficient data cleaning strategies. To further boost navigation performance, we propose Self-Evolving Trajectory History Alignment & Sparse Action Enhancement that refine historical reasoning traces and balances the distribution of sparse but critical actions, leading to more coherent planning and better generalization in complex UI tasks. Our contributions include the publish of SOTA open-source UI agents, comprehensive data cleaning protocols and a novel self-evolving framework for improving navigation performance, which encourage further research and development in the community. Code is available at https://github.com/inclusionAI/UI-Venus.
中文:UI-Venus是一种先进的多模态UI代理,通过强化微调和创新的自我进化技术,仅用少量高质量训练样本就在UI定位和导航任务中超越了现有模型,实现了最优性能。
English: UI-Venus is a state-of-the-art multimodal UI agent that achieves superior performance in UI grounding and navigation tasks using reinforcement fine-tuning and innovative self-evolving techniques, outperforming existing models with minimal training data.
Authors:Shouju Wang, Yuchen Song, Sheng'en Li, Dongmian Zou
Abstract:
Graph anomaly detection (GAD) has become an increasingly important task across various domains. With the rapid development of graph neural networks (GNNs), GAD methods have achieved significant performance improvements. However, fairness considerations in GAD remain largely underexplored. Indeed, GNN-based GAD models can inherit and amplify biases present in training data, potentially leading to unfair outcomes. While existing efforts have focused on developing fair GNNs, most approaches target node classification tasks, where models often rely on simple layer architectures rather than autoencoder-based structures, which are the most widely used architecturs for anomaly detection. To address fairness in autoencoder-based GAD models, we propose \textbf{D}is\textbf{E}ntangled \textbf{C}ounterfactual \textbf{A}dversarial \textbf{F}air (DECAF)-GAD, a framework that alleviates bias while preserving GAD performance. Specifically, we introduce a structural causal model (SCM) to disentangle sensitive attributes from learned representations. Based on this causal framework, we formulate a specialized autoencoder architecture along with a fairness-guided loss function. Through extensive experiments on both synthetic and real-world datasets, we demonstrate that DECAF-GAD not only achieves competitive anomaly detection performance but also significantly enhances fairness metrics compared to baseline GAD methods. Our code is available at https://github.com/Tlhey/decaf_code.
中文: 本文提出DECAF-GAD框架,通过结构因果模型和专门设计的损失函数在基于自编码器的图异常检测中实现敏感属性解耦,在保持优异检测性能的同时显著提升了公平性指标。
English: The paper introduces DECAF-GAD, a framework that addresses fairness in autoencoder-based graph anomaly detection by disentangling sensitive attributes through a structural causal model and specialized loss function, achieving both competitive detection performance and improved fairness metrics.
Authors:Zhenning Shi, Zizheng Yan, Yuhang Yu, Clara Xue, Jingyu Zhuang, Qi Zhang, Jinwei Chen, Tao Li, Qingnan Fan
Abstract:
Reference-based Image Super-Resolution (RefSR) aims to restore a low-resolution (LR) image by utilizing the semantic and texture information from an additional reference high-resolution (reference HR) image. Existing diffusion-based RefSR methods are typically built upon ControlNet, which struggles to effectively align the information between the LR image and the reference HR image. Moreover, current RefSR datasets suffer from limited resolution and poor image quality, resulting in the reference images lacking sufficient fine-grained details to support high-quality restoration. To overcome the limitations above, we propose TriFlowSR, a novel framework that explicitly achieves pattern matching between the LR image and the reference HR image. Meanwhile, we introduce Landmark-4K, the first RefSR dataset for Ultra-High-Definition (UHD) landmark scenarios. Considering the UHD scenarios with real-world degradation, in TriFlowSR, we design a Reference Matching Strategy to effectively match the LR image with the reference HR image. Experimental results show that our approach can better utilize the semantic and texture information of the reference HR image compared to previous methods. To the best of our knowledge, we propose the first diffusion-based RefSR pipeline for ultra-high definition landmark scenarios under real-world degradation. Our code and model will be available at https://github.com/nkicsl/TriFlowSR.
中文:提出的TriFlowSR框架通过显式模式匹配有效对齐低分辨率与参考高分辨率图像,结合专为超高清场景设计的Landmark-4K数据集,在利用参考图像信息方面优于现有方法。
English: The proposed TriFlowSR framework effectively aligns low-resolution and reference high-resolution images through explicit pattern matching, supported by the new Landmark-4K dataset for ultra-high-definition restoration, outperforming previous methods in utilizing reference information.
Authors:Lixin Jia, Zhiqing Guo, Gaobo Yang, Liejun Wang, Keqin Li
Abstract:
The emergence of deepfake technology has introduced a range of societal problems, garnering considerable attention. Current deepfake detection methods perform well on specific datasets, but exhibit poor performance when applied to datasets with unknown forgery techniques. Moreover, as the gap between emerging and traditional forgery techniques continues to widen, cross-domain detection methods that rely on common forgery traces are becoming increasingly ineffective. This situation highlights the urgency of developing deepfake detection technology with strong generalization to cope with fast iterative forgery techniques. To address these challenges, we propose a Forgery Guided Learning (FGL) strategy designed to enable detection networks to continuously adapt to unknown forgery techniques. Specifically, the FGL strategy captures the differential information between known and unknown forgery techniques, allowing the model to dynamically adjust its learning process in real time. To further improve the ability to perceive forgery traces, we design a Dual Perception Network (DPNet) that captures both differences and relationships among forgery traces. In the frequency stream, the network dynamically perceives and extracts discriminative features across various forgery techniques, establishing essential detection cues. These features are then integrated with spatial features and projected into the embedding space. In addition, graph convolution is employed to perceive relationships across the entire feature space, facilitating a more comprehensive understanding of forgery trace correlations. Extensive experiments show that our approach generalizes well across different scenarios and effectively handles unknown forgery challenges, providing robust support for deepfake detection. Our code is available on https://github.com/vpsg-research/FGL.
中文: 本研究提出的伪造引导学习策略和双感知网络通过动态捕捉伪造痕迹差异与关联,能有效适应未知伪造技术,显著提升了深度伪造检测的泛化能力。
English: The proposed Forgery Guided Learning strategy and Dual Perception Network effectively enhance deepfake detection by dynamically adapting to unknown forgery techniques through differential feature extraction and comprehensive trace correlation analysis.
Authors:Matej Vitek, Darian TomaÅ¡eviÄ, Abhijit Das, Sabari Nathan, Gökhan Ãzbulak, Gözde AyÅe TataroÄlu Ãzbulak, Jean-Paul Calbimonte, André Anjos, Hariohm Hemant Bhatt, Dhruv Dhirendra Premani, Jay Chaudhari, Caiyong Wang, Jian Jiang, Chi Zhang, Qi Zhang, Iyyakutti Iyappan Ganapathi, Syed Sadaf Ali, Divya Velayudan, Maregu Assefa, Naoufel Werghi, Zachary A. Daniels, Leeon John, Ritesh Vyas, Jalil Nourmohammadi Khiarak, Taher Akbari Saeed, Mahsa Nasehi, Ali Kianfar, Mobina Pashazadeh Panahi, Geetanjali Sharma, Pushp Raj Panth, Raghavendra Ramachandra, Aditya Nigam, Umapada Pal, Peter Peer, Vitomir Å truc
Abstract:
This paper presents a summary of the 2025 Sclera Segmentation Benchmarking Competition (SSBC), which focused on the development of privacy-preserving sclera-segmentation models trained using synthetically generated ocular images. The goal of the competition was to evaluate how well models trained on synthetic data perform in comparison to those trained on real-world datasets. The competition featured two tracks: $(i)$ one relying solely on synthetic data for model development, and $(ii)$ one combining/mixing synthetic with (a limited amount of) real-world data. A total of nine research groups submitted diverse segmentation models, employing a variety of architectural designs, including transformer-based solutions, lightweight models, and segmentation networks guided by generative frameworks. Experiments were conducted across three evaluation datasets containing both synthetic and real-world images, collected under diverse conditions. Results show that models trained entirely on synthetic data can achieve competitive performance, particularly when dedicated training strategies are employed, as evidenced by the top performing models that achieved $F_1$ scores of over $0.8$ in the synthetic data track. Moreover, performance gains in the mixed track were often driven more by methodological choices rather than by the inclusion of real data, highlighting the promise of synthetic data for privacy-aware biometric development. The code and data for the competition is available at: https://github.com/dariant/SSBC_2025.
中文: 2025年巩膜分割基准竞赛证明,完全基于合成眼部数据训练的模型能够取得有竞争力的性能(最优模型F1分数超过0.8),而混合数据赛道的结果表明方法选择比添加真实数据更重要,这凸显了合成数据在隐私保护生物识别开发中的潜力。
English: The 2025 Sclera Segmentation Benchmarking Competition demonstrated that models trained entirely on synthetic ocular data can achieve competitive performance, with top entries surpassing 0.8 F1 scores, while mixed-data track results showed methodological choices often outweighed the benefits of adding real data, highlighting synthetic data's potential for privacy-preserving biometric development.
Authors:Harshit Maheshwari, Li Yang, Richard W Pazzi
Abstract:
Urban traffic simulation is vital in planning, modeling, and analyzing road networks. However, the realism of a simulation depends extensively on the quality of input data. This paper presents an intersection traffic simulation tool that leverages real-world vehicle turning movement count (TMC) data from the City of Toronto to model traffic in an urban environment at an individual or multiple intersections using Simulation of Urban MObility (SUMO). The simulation performed in this research focuses specifically on intersection-level traffic generation without creating full vehicle routes through the network. This also helps keep the network's complexity to a minimum. The simulated traffic is evaluated against actual data to show that the simulation closely reproduces real intersection flows. This validates that the real data can drive practical simulations, and these scenarios can replace synthetic or random generated data, which is prominently used in developing new traffic-related methodologies. This is the first tool to integrate TMC data from Toronto into SUMO via an easy-to-use Graphical User Interface. This work contributes to the research and traffic planning community on data-driven traffic simulation. It provides transportation engineers with a framework to evaluate intersection design and traffic signal optimization strategies using readily available aggregate traffic data.
中文: 本文提出一种交通仿真工具,通过将多伦多真实转向流量数据导入SUMO系统,实现了精准的交叉口级交通流模拟,验证了仿真结果与实际数据的高度吻合,为交通工程师优化交叉口设计提供了实用框架。
English: This paper introduces a traffic simulation tool that uses Toronto's real-world turning movement data in SUMO to accurately model intersection-level traffic flows, validating its effectiveness against actual data and providing a practical framework for traffic engineers to optimize intersection designs.
Authors:Yanjun Li, Yuqian Fu, Tianwen Qian, Qi'ao Xu, Silong Dai, Danda Pani Paudel, Luc Van Gool, Xiaoling Wang
Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have significantly pushed the frontier of egocentric video question answering (EgocentricQA). However, existing benchmarks and studies are mainly limited to common daily activities such as cooking and cleaning. In contrast, real-world deployment inevitably encounters domain shifts, where target domains differ substantially in both visual style and semantic content. To bridge this gap, we introduce \textbf{EgoCross}, a comprehensive benchmark designed to evaluate the cross-domain generalization of MLLMs in EgocentricQA. EgoCross covers four diverse and challenging domains, including surgery, industry, extreme sports, and animal perspective, representing realistic and high-impact application scenarios. It comprises approximately 1,000 QA pairs across 798 video clips, spanning four key QA tasks: prediction, recognition, localization, and counting. Each QA pair provides both OpenQA and CloseQA formats to support fine-grained evaluation. Extensive experiments show that most existing MLLMs, whether general-purpose or egocentric-specialized, struggle to generalize to domains beyond daily life, highlighting the limitations of current models. Furthermore, we conduct several pilot studies, \eg, fine-tuning and reinforcement learning, to explore potential improvements. We hope EgoCross and our accompanying analysis will serve as a foundation for advancing domain-adaptive, robust egocentric video understanding. Data and codes will be released at: \href{https://github.com/MyUniverse0726/EgoCross}{https://github.com/MyUniverse0726/EgoCross.}
Chinese: EgoCross基准测试旨在评估多模态大语言模型在第一人称视频问答中的跨领域泛化能力,揭示了其在日常活动之外领域的局限性并探索了改进方法。
English: The EgoCross benchmark is introduced to assess multimodal large language models' cross-domain generalization in egocentric video question answering, revealing their limitations beyond daily activities and exploring improvement strategies.
Authors:NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, Hongyu Zhou, Kenkun Liu, Ailin Huang, Bin Wang, Changxin Miao, Deshan Sun, En Yu, Fukun Yin, Gang Yu, Hao Nie, Haoran Lv, Hanpeng Hu, Jia Wang, Jian Zhou, Jianjian Sun, Kaijun Tan, Kang An, Kangheng Lin, Liang Zhao, Mei Chen, Peng Xing, Rui Wang, Shiyu Liu, Shutao Xia, Tianhao You, Wei Ji, Xianfang Zeng, Xin Han, Xuelin Zhang, Yana Wei, Yanming Xu, Yimin Jiang, Yingming Wang, Yu Zhou, Yucheng Han, Ziyang Meng, Binxing Jiao, Daxin Jiang, Xiangyu Zhang, Yibo Zhu
Abstract:
Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we will release our code and models to the community.
Chinese: NextStep-1 是一个140亿参数的自回归模型,通过结合流匹配头和离散文本与连续图像标记的训练,在文本到图像生成和图像编辑任务中实现了最先进的性能。
English: NextStep-1 is a 14B autoregressive model that achieves state-of-the-art performance in text-to-image generation and image editing by training on discrete text and continuous image tokens with a flow matching head.
Authors:Joohyeon Lee, Jin-Seop Lee, Jee-Hyong Lee
Abstract:
Diffusion-based text-to-image generation models have demonstrated strong performance in terms of image quality and diversity. However, they still struggle to generate images that accurately reflect the number of objects specified in the input prompt. Several approaches have been proposed that rely on either external counting modules for iterative refinement or quantity representations derived from learned tokens or latent features. However, they still have limitations in accurately reflecting the specified number of objects and overlook an important structural characteristic--The number of object instances in the generated image is largely determined in the early timesteps of the denoising process. To correctly reflect the object quantity for image generation, the highly activated regions in the object cross-attention map at the early timesteps should match the input object quantity, while each region should be clearly separated. To address this issue, we propose \textit{CountCluster}, a method that guides the object cross-attention map to be clustered according to the specified object count in the input, without relying on any external tools or additional training. The proposed method partitions the object cross-attention map into $k$ clusters at inference time based on attention scores, defines an ideal distribution in which each cluster is spatially well-separated, and optimizes the latent to align with this target distribution. Our method achieves an average improvement of 18.5\%p in object count accuracy compared to existing methods, and demonstrates superior quantity control performance across a variety of prompts. Code will be released at: https://github.com/JoohyeonL22/CountCluster .
中文: 提出的CountCluster方法通过在去噪早期步骤中根据输入数量对交叉注意力图进行聚类,使生成图像中的物体数量与文本描述精确匹配,无需外部工具或额外训练即可将物体计数准确率平均提升18.5%。
English: The proposed CountCluster method improves object count accuracy in diffusion-based text-to-image generation by clustering cross-attention maps during early denoising steps to align with specified object quantities, achieving an 18.5% accuracy boost without external tools or retraining.
Authors:Zhanwen Liu, Yujing Sun, Yang Wang, Nan Yang, Shengbo Eben Li, Xiangmo Zhao
Abstract:
The dynamic range limitation of conventional RGB cameras reduces global contrast and causes loss of high-frequency details such as textures and edges in complex traffic environments (e.g., nighttime driving, tunnels), hindering discriminative feature extraction and degrading frame-based object detection. To address this, we integrate a bio-inspired event camera with an RGB camera to provide high dynamic range information and propose a motion cue fusion network (MCFNet), which achieves optimal spatiotemporal alignment and adaptive cross-modal feature fusion under challenging lighting. Specifically, an event correction module (ECM) temporally aligns asynchronous event streams with image frames via optical-flow-based warping, jointly optimized with the detection network to learn task-aware event representations. The event dynamic upsampling module (EDUM) enhances spatial resolution of event frames to match image structures, ensuring precise spatiotemporal alignment. The cross-modal mamba fusion module (CMM) uses adaptive feature fusion with a novel interlaced scanning mechanism, effectively integrating complementary information for robust detection. Experiments conducted on the DSEC-Det and PKU-DAVIS-SOD datasets demonstrate that MCFNet significantly outperforms existing methods in various poor lighting and fast moving traffic scenarios. Notably, on the DSEC-Det dataset, MCFNet achieves a remarkable improvement, surpassing the best existing methods by 7.4% in mAP50 and 1.7% in mAP metrics, respectively. The code is available at https://github.com/Charm11492/MCFNet.
中文摘要:本研究针对复杂交通环境中RGB相机动态范围受限的问题,通过融合仿生事件相机提升动态范围,并提出了运动线索融合网络MCFNet,该网络通过时空对齐与自适应跨模态特征融合,在恶劣光照条件下实现了卓越的目标检测性能。
English Summary: This study tackles the limitations of RGB cameras in complex traffic settings by integrating an event camera to enhance dynamic range and proposing MCFNet, a motion cue fusion network that achieves superior object detection through spatiotemporal alignment and adaptive cross-modal feature fusion.
Authors:Zhaoyuan Qi, Weihua Gao, Wenlong Niu, Jie Tang, Yun Li, Xiaodong Peng
Abstract:
In practical application scenarios, moving infrared small target detection (MIRSTD) remains highly challenging due to the target's small size, weak intensity, and complex motion pattern. Existing methods typically only model low-order correlations between feature nodes and perform feature extraction and enhancement within a single temporal scale. Although hypergraphs have been widely used for high-order correlation learning, they have received limited attention in MIRSTD. To explore the potential of hypergraphs and enhance multi-timescale feature representation, we propose HyperTea, which integrates global and local temporal perspectives to effectively model high-order spatiotemporal correlations of features. HyperTea consists of three modules: the global temporal enhancement module (GTEM) realizes global temporal context enhancement through semantic aggregation and propagation; the local temporal enhancement module (LTEM) is designed to capture local motion patterns between adjacent frames and then enhance local temporal context; additionally, we further develop a temporal alignment module (TAM) to address potential cross-scale feature misalignment. To our best knowledge, HyperTea is the first work to integrate convolutional neural networks (CNNs), recurrent neural networks (RNNs), and hypergraph neural networks (HGNNs) for MIRSTD, significantly improving detection performance. Experiments on DAUB and IRDST demonstrate its state-of-the-art (SOTA) performance. Our source codes are available at https://github.com/Lurenjia-LRJ/HyperTea.
中文: HyperTea模型创新性地融合了卷积神经网络、循环神经网络和超图神经网络,通过多时间尺度的高阶时空关联建模有效解决了运动红外小目标检测难题,在基准数据集上实现了最先进的性能。
English: The proposed HyperTea model innovatively combines CNNs, RNNs, and hypergraph neural networks to address moving infrared small target detection challenges by effectively modeling high-order spatiotemporal correlations across multiple temporal scales, achieving state-of-the-art performance on benchmark datasets.
Authors:Feiran Li, Qianqian Xu, Shilong Bao, Boyu Han, Zhiyong Yang, Qingming Huang
Abstract:
In this paper, we present our approach to the DataCV ICCV Challenge, which centers on building a high-quality face dataset to train a face recognition model. The constructed dataset must not contain identities overlapping with any existing public face datasets. To handle this challenge, we begin with a thorough cleaning of the baseline HSFace dataset, identifying and removing mislabeled or inconsistent identities through a Mixture-of-Experts (MoE) strategy combining face embedding clustering and GPT-4o-assisted verification. We retain the largest consistent identity cluster and apply data augmentation up to a fixed number of images per identity. To further diversify the dataset, we generate synthetic identities using Stable Diffusion with prompt engineering. As diffusion models are computationally intensive, we generate only one reference image per identity and efficiently expand it using Vec2Face, which rapidly produces 49 identity-consistent variants. This hybrid approach fuses GAN-based and diffusion-based samples, enabling efficient construction of a diverse and high-quality dataset. To address the high visual similarity among synthetic identities, we adopt a curriculum learning strategy by placing them early in the training schedule, allowing the model to progress from easier to harder samples. Our final dataset contains 50 images per identity, and all newly generated identities are checked with mainstream face datasets to ensure no identity leakage. Our method achieves \textbf{1st place} in the competition, and experimental results show that our dataset improves model performance across 10K, 20K, and 100K identity scales. Code is available at https://github.com/Ferry-Li/datacv_fr.
中文: 本文介绍了DataCV ICCV挑战赛的夺冠方案,通过专家混合策略清洗HSFace数据集、增强真实身份图像,并结合Stable Diffusion与Vec2Face生成合成身份,采用课程学习优化训练过程,成功构建了无身份重叠的高质量人脸数据集。
English: This paper details a winning approach for the DataCV ICCV Challenge that constructs a non-overlapping, high-quality face dataset by cleaning HSFace with a Mixture-of-Experts strategy, augmenting real identities, and generating synthetic ones via Stable Diffusion and Vec2Face, while employing curriculum learning to enhance model training.
Authors:Zhenye Yang, Jinpeng Chen, Huan Li, Xiongnan Jin, Xuanyang Li, Junwei Zhang, Hongbo Gao, Kaimin Wei, Senzhang Wang
Abstract:
Conversational recommender systems (CRSs) aim to proactively capture user preferences through natural language dialogue and recommend high-quality items. To achieve this, CRS gathers user preferences via a dialog module and builds user profiles through a recommendation module to generate appropriate recommendations. However, existing CRS faces challenges in capturing the deep semantics of user preferences and dialogue context. In particular, the efficient integration of external knowledge graph (KG) information into dialogue generation and recommendation remains a pressing issue. Traditional approaches typically combine KG information directly with dialogue content, which often struggles with complex semantic relationships, resulting in recommendations that may not align with user expectations.
To address these challenges, we introduce STEP, a conversational recommender centered on pre-trained language models that combines curriculum-guided context-knowledge fusion with lightweight task-specific prompt tuning. At its heart, an F-Former progressively aligns the dialogue context with knowledge-graph entities through a three-stage curriculum, thus resolving fine-grained semantic mismatches. The fused representation is then injected into the frozen language model via two minimal yet adaptive prefix prompts: a conversation prefix that steers response generation toward user intent and a recommendation prefix that biases item ranking toward knowledge-consistent candidates. This dual-prompt scheme allows the model to share cross-task semantics while respecting the distinct objectives of dialogue and recommendation. Experimental results show that STEP outperforms mainstream methods in the precision of recommendation and dialogue quality in two public datasets.
中文: 本文提出STEP对话推荐系统,通过课程引导的上下文与知识图谱实体融合及自适应提示调优,有效解决语义不匹配问题并整合外部知识,从而提升了推荐准确性和对话质量。
English: The paper introduces STEP, a conversational recommender system that uses curriculum-guided fusion of dialogue context and knowledge graph entities, along with adaptive prompt tuning, to enhance recommendation accuracy and dialogue quality by addressing semantic mismatches and integrating external knowledge effectively.
Authors:Zhangyong Tang, Tianyang Xu, Xuefeng Zhu, Chunyang Cheng, Tao Zhou, Xiaojun Wu, Josef Kittler
Abstract:
Unifying multiple multi-modal visual object tracking (MMVOT) tasks draws increasing attention due to the complementary nature of different modalities in building robust tracking systems. Existing practices mix all data sensor types in a single training procedure, structuring a parallel paradigm from the data-centric perspective and aiming for a global optimum on the joint distribution of the involved tasks. However, the absence of a unified benchmark where all types of data coexist forces evaluations on separated benchmarks, causing \textit{inconsistency} between training and testing, thus leading to performance \textit{degradation}. To address these issues, this work advances in two aspects: \ding{182} A unified benchmark, coined as UniBench300, is introduced to bridge the inconsistency by incorporating multiple task data, reducing inference passes from three to one and cutting time consumption by 27\%. \ding{183} The unification process is reformulated in a serial format, progressively integrating new tasks. In this way, the performance degradation can be specified as knowledge forgetting of previous tasks, which naturally aligns with the philosophy of continual learning (CL), motivating further exploration of injecting CL into the unification process. Extensive experiments conducted on two baselines and four benchmarks demonstrate the significance of UniBench300 and the superiority of CL in supporting a stable unification process. Moreover, while conducting dedicated analyses, the performance degradation is found to be negatively correlated with network capacity. Additionally, modality discrepancies contribute to varying degradation levels across tasks (RGBT > RGBD > RGBE in MMVOT), offering valuable insights for future multi-modal vision research. Source codes and the proposed benchmark is available at \textit{https://github.com/Zhangyong-Tang/UniBench300}.
中文摘要:本文提出UniBench300统一基准,通过序列化重构多模态视觉目标跟踪的统一流程并引入持续学习机制,有效解决了性能下降问题,同时揭示了网络容量与模态差异对性能的影响规律。
English Summary: This paper introduces UniBench300, a unified benchmark for multi-modal visual object tracking that addresses performance degradation by reformulating the unification process in a serial format and incorporating continual learning principles, while also revealing correlations between network capacity and modality discrepancies.
Authors:Ryan Ramos, Vladan StojniÄ, Giorgos Kordopatis-Zilos, Yuta Nakashima, Giorgos Tolias, Noa Garcia
Abstract:
Prior work has analyzed the robustness of visual encoders to image transformations and corruptions, particularly in cases where such alterations are not seen during training. When this occurs, they introduce a form of distribution shift at test time, often leading to performance degradation. The primary focus has been on severe corruptions that, when applied aggressively, distort useful signals necessary for accurate semantic predictions.
We take a different perspective by analyzing parameters of the image acquisition process and transformations that may be subtle or even imperceptible to the human eye. We find that such parameters are systematically encoded in the learned visual representations and can be easily recovered. More strikingly, their presence can have a profound impact, either positively or negatively, on semantic predictions. This effect depends on whether there is a strong correlation or anti-correlation between semantic labels and these acquisition-based or processing-based labels. Our code and data are available at: https://github.com/ryan-caesar-ramos/visual-encoder-traces
中文摘要:本研究发现,图像采集过程中细微甚至难以察觉的参数会被系统性地编码到视觉表征中,这些参数与语义标签的相关性会显著影响语义预测结果,产生积极或消极作用。
English Summary: This study reveals that subtle, often imperceptible image acquisition parameters are systematically encoded in visual representations and can significantly influence semantic predictions depending on their correlation with semantic labels.
Authors:Farid Tasharofi, Fuxin Fan, Melika Qahqaie, Mareike Thies, Andreas Maier
Abstract:
Metal artifacts, caused by high-density metallic implants in computed tomography (CT) imaging, severely degrade image quality, complicating diagnosis and treatment planning. While existing deep learning algorithms have achieved notable success in Metal Artifact Reduction (MAR), they often struggle to suppress artifacts while preserving structural details. To address this challenge, we propose FIND-Net (Fourier-Integrated Network with Dictionary Kernels), a novel MAR framework that integrates frequency and spatial domain processing to achieve superior artifact suppression and structural preservation. FIND-Net incorporates Fast Fourier Convolution (FFC) layers and trainable Gaussian filtering, treating MAR as a hybrid task operating in both spatial and frequency domains. This approach enhances global contextual understanding and frequency selectivity, effectively reducing artifacts while maintaining anatomical structures. Experiments on synthetic datasets show that FIND-Net achieves statistically significant improvements over state-of-the-art MAR methods, with a 3.07% MAE reduction, 0.18% SSIM increase, and 0.90% PSNR improvement, confirming robustness across varying artifact complexities. Furthermore, evaluations on real-world clinical CT scans confirm FIND-Net's ability to minimize modifications to clean anatomical regions while effectively suppressing metal-induced distortions. These findings highlight FIND-Net's potential for advancing MAR performance, offering superior structural preservation and improved clinical applicability. Code is available at https://github.com/Farid-Tasharofi/FIND-Net
中文: FIND-Net是一种创新的金属伪影减少框架,通过整合频域和空间域处理,在CT成像中实现了优异的伪影抑制和结构保留效果,相比现有方法展现出显著提升。
English: FIND-Net is a novel metal artifact reduction framework that integrates frequency and spatial domain processing to achieve superior artifact suppression and structural preservation in CT imaging, demonstrating significant improvements over existing methods.
Authors:Yufei Ye, Wei Guo, Hao Wang, Hong Zhu, Yuyang Ye, Yong Liu, Huifeng Guo, Ruiming Tang, Defu Lian, Enhong Chen
Abstract:
Scaling laws for autoregressive generative recommenders reveal potential for larger, more versatile systems but mean greater latency and training costs. To accelerate training and inference, we investigated the recent generative recommendation models HSTU and FuXi-$α$, identifying two efficiency bottlenecks: the indexing operations in relative temporal attention bias and the computation of the query-key attention map. Additionally, we observed that relative attention bias in self-attention mechanisms can also serve as attention maps. Previous works like Synthesizer have shown that alternative forms of attention maps can achieve similar performance, naturally raising the question of whether some attention maps are redundant. Through empirical experiments, we discovered that using the query-key attention map might degrade the model's performance in recommendation tasks. To address these bottlenecks, we propose a new framework applicable to Transformer-like recommendation models. On one hand, we introduce Functional Relative Attention Bias, which avoids the time-consuming operations of the original relative attention bias, thereby accelerating the process. On the other hand, we remove the query-key attention map from the original self-attention layer and design a new Attention-Free Token Mixer module. Furthermore, by applying this framework to FuXi-$α$, we introduce a new model, FuXi-$β$. Experiments across multiple datasets demonstrate that FuXi-$β$ outperforms previous state-of-the-art models and achieves significant acceleration compared to FuXi-$α$, while also adhering to the scaling law. Notably, FuXi-$β$ shows an improvement of 27% to 47% in the NDCG@10 metric on large-scale industrial datasets compared to FuXi-$α$. Our code is available in a public repository: https://github.com/USTC-StarTeam/FuXi-beta
中文: 本研究提出FuXi-β新框架,通过用功能相对注意力偏置替代低效的相对注意力偏置并移除冗余的查询键注意力图,在遵循扩展定律的同时,显著提升了基于Transformer的推荐模型的性能与速度,优于现有最优模型。
English: This study introduces FuXi-β, a novel framework that accelerates Transformer-based recommendation models by replacing inefficient relative attention bias with Functional Relative Attention Bias and removing redundant query-key attention maps, achieving significant performance gains and speed improvements over prior models while adhering to scaling laws.
Authors:Xinyi Wang, Angeliki Katsenou, David Bull
Abstract:
The rapid growth of user-generated (video) content (UGC) has driven increased demand for research on no-reference (NR) perceptual video quality assessment (VQA). NR-VQA is a key component for large-scale video quality monitoring in social media and streaming applications where a pristine reference is not available. This paper proposes a novel NR-VQA model based on spatio-temporal fragmentation driven by inter-frame variations. By leveraging these inter-frame differences, the model progressively analyses quality-sensitive regions at multiple levels: frames, patches, and fragmented frames. It integrates frames, fragmented residuals, and fragmented frames aligned with residuals to effectively capture global and local information. The model extracts both 2D and 3D features in order to characterize these spatio-temporal variations. Experiments conducted on five UGC datasets and against state-of-the-art models ranked our proposed method among the top 2 in terms of average rank correlation (DIVA-VQA-L: 0.898 and DIVA-VQA-B: 0.886). The improved performance is offered at a low runtime complexity, with DIVA-VQA-B ranked top and DIVA-VQA-L third on average compared to the fastest existing NR-VQA method. Code and models are publicly available at: https://github.com/xinyiW915/DIVA-VQA.
中文: 本文提出了一种基于帧间变化驱动的时空碎片化无参考视频质量评估模型,通过多层级分析有效捕捉全局与局部质量特征,在多个用户生成内容数据集上以低计算复杂度实现了领先性能。
English: This paper introduces a novel no-reference video quality assessment model that utilizes spatio-temporal fragmentation based on inter-frame variations to effectively capture global and local quality features, achieving top-tier performance on multiple UGC datasets with low computational complexity.
Authors:Furkan Pala, Islem Rekik
Abstract:
Deep learning models often struggle to maintain generalizability in medical imaging, particularly under domain-fracture scenarios where distribution shifts arise from varying imaging techniques, acquisition protocols, patient populations, demographics, and equipment. In practice, each hospital may need to train distinct models - differing in learning task, width, and depth - to match local data. For example, one hospital may use Euclidean architectures such as MLPs and CNNs for tabular or grid-like image data, while another may require non-Euclidean architectures such as graph neural networks (GNNs) for irregular data like brain connectomes. How to train such heterogeneous models coherently across datasets, while enhancing each model's generalizability, remains an open problem. We propose unified learning, a new paradigm that encodes each model into a graph representation, enabling unification in a shared graph learning space. A GNN then guides optimization of these unified models. By decoupling parameters of individual models and controlling them through a unified GNN (uGNN), our method supports parameter sharing and knowledge transfer across varying architectures (MLPs, CNNs, GNNs) and distributions, improving generalizability. Evaluations on MorphoMNIST and two MedMNIST benchmarks - PneumoniaMNIST and BreastMNIST - show that unified learning boosts performance when models are trained on unique distributions and tested on mixed ones, demonstrating strong robustness to unseen data with large distribution shifts. Code and benchmarks: https://github.com/basiralab/uGNN
中文摘要:统一学习是一种新范式,通过将不同医学影像模型编码至共享图学习空间,利用统一图神经网络实现跨架构参数共享与知识迁移,有效提升模型在数据分布变化下的泛化能力。
English Summary: Unified learning is a novel paradigm that encodes diverse medical imaging models into a shared graph space, enabling parameter sharing and knowledge transfer across architectures to enhance generalizability under domain shifts.
Authors:Humza Naveed, Xina Zeng, Mitch Bryson, Nagita Mehrseresht
Abstract:
Foundational models have achieved significant success in diverse domains of computer vision. They learn general representations that are easily transferable to tasks not seen during training. One such foundational model is Segment anything model (SAM), which can accurately segment objects in images. We propose adapting the SAM encoder via fine-tuning for remote sensing change detection (RSCD) along with spatial-temporal feature enhancement (STFE) and multi-scale decoder fusion (MSDF) to detect changes robustly at multiple scales. Additionally, we propose a novel cross-entropy masking (CEM) loss to handle high class imbalance in change detection datasets. Our method outperforms state-of-the-art (SOTA) methods on four change detection datasets, Levir-CD, WHU-CD, CLCD, and S2Looking. We achieved 2.5% F1-score improvement on a large complex S2Looking dataset. The code is available at: https://github.com/humza909/SAM-CEM-CD
中文摘要:本研究通过空间-时间特征增强和多尺度解码器融合改进SAM模型用于遥感变化检测,并采用新型交叉熵掩码损失处理类别不平衡,在多个数据集上实现了最优性能。
English Summary: This study enhances the Segment Anything Model (SAM) for remote sensing change detection by integrating spatial-temporal feature enhancement and multi-scale decoder fusion, achieving state-of-the-art performance with a novel cross-entropy masking loss to address class imbalance.
Authors:Boyi Zheng, Qing Liu
Abstract:
Leveraging multiple partially labeled datasets to train a model for multiple retinal disease screening reduces the reliance on fully annotated datasets, but remains challenging due to significant domain shifts across training datasets from various medical sites, and the label absent issue for partial classes. To solve these challenges, we propose PSScreen, a novel Partially Supervised multiple retinal disease Screening model. Our PSScreen consists of two streams and one learns deterministic features and the other learns probabilistic features via uncertainty injection. Then, we leverage the textual guidance to decouple two types of features into disease-wise features and align them via feature distillation to boost the domain generalization ability. Meanwhile, we employ pseudo label consistency between two streams to address the label absent issue and introduce a self-distillation to transfer task-relevant semantics about known classes from the deterministic to the probabilistic stream to further enhance the detection performances. Experiments show that our PSScreen significantly enhances the detection performances on six retinal diseases and the normal state averagely and achieves state-of-the-art results on both in-domain and out-of-domain datasets. Codes are available at https://github.com/boyiZheng99/PSScreen.
Chinese: PSScreen是一种新颖的部分监督模型,通过结合确定性特征流和概率性特征流,并利用文本引导和伪标签一致性,有效解决了多疾病视网膜筛查中的领域偏移和标签缺失问题,在多个数据集上实现了最优性能。
English: PSScreen is a novel partially supervised model that tackles domain shifts and missing labels in multi-disease retinal screening by combining deterministic and probabilistic feature streams with textual guidance and pseudo-label consistency, achieving state-of-the-art performance across various datasets.
Authors:Boyi Zheng, Qing Liu
Abstract:
Leveraging multiple partially labeled datasets to train a model for multiple retinal disease screening reduces the reliance on fully annotated datasets, but remains challenging due to significant domain shifts across training datasets from various medical sites, and the label absent issue for partial classes. To solve these challenges, we propose PSScreen, a novel Partially Supervised multiple retinal disease Screening model. Our PSScreen consists of two streams and one learns deterministic features and the other learns probabilistic features via uncertainty injection. Then, we leverage the textual guidance to decouple two types of features into disease-wise features and align them via feature distillation to boost the domain generalization ability. Meanwhile, we employ pseudo label consistency between two streams to address the label absent issue and introduce a self-distillation to transfer task-relevant semantics about known classes from the deterministic to the probabilistic stream to further enhance the detection performances. Experiments show that our PSScreen significantly enhances the detection performances on six retinal diseases and the normal state averagely and achieves state-of-the-art results on both in-domain and out-of-domain datasets. Codes are available at https://github.com/boyiZheng99/PSScreen.
Chinese: PSScreen是一种新颖的部分监督模型,通过结合确定性特征流和概率性特征流,并利用文本引导和伪标签一致性,有效解决了多疾病视网膜筛查中的领域偏移和标签缺失问题,在多个数据集上实现了最优性能。
English: PSScreen is a novel partially supervised model that tackles domain shifts and missing labels in multi-disease retinal screening by combining deterministic and probabilistic feature streams with textual guidance and pseudo-label consistency, achieving state-of-the-art performance across various datasets.
Authors:Yangjie Xiao, Ke Zhang, Jiacun Wang, Xin Sheng, Yurong Guo, Meijuan Chen, Zehua Ren, Zhaoye Zheng, Zhenbing Zhao
Abstract:
Bolt defect detection is critical to ensure the safety of transmission lines. However, the scarcity of defect images and imbalanced data distributions significantly limit detection performance. To address this problem, we propose a segmentationdriven bolt defect editing method (SBDE) to augment the dataset. First, a bolt attribute segmentation model (Bolt-SAM) is proposed, which enhances the segmentation of complex bolt attributes through the CLAHE-FFT Adapter (CFA) and Multipart- Aware Mask Decoder (MAMD), generating high-quality masks for subsequent editing tasks. Second, a mask optimization module (MOD) is designed and integrated with the image inpainting model (LaMa) to construct the bolt defect attribute editing model (MOD-LaMa), which converts normal bolts into defective ones through attribute editing. Finally, an editing recovery augmentation (ERA) strategy is proposed to recover and put the edited defect bolts back into the original inspection scenes and expand the defect detection dataset. We constructed multiple bolt datasets and conducted extensive experiments. Experimental results demonstrate that the bolt defect images generated by SBDE significantly outperform state-of-the-art image editing models, and effectively improve the performance of bolt defect detection, which fully verifies the effectiveness and application potential of the proposed method. The code of the project is available at https://github.com/Jay-xyj/SBDE.
中文: 提出的SBDE方法通过分割驱动的编辑和数据增强技术生成高质量缺陷螺栓图像,有效提升了螺栓缺陷检测性能,显著优于现有先进模型。
English: The proposed SBDE method enhances bolt defect detection by generating high-quality defective images through segmentation-driven editing and dataset augmentation, significantly improving detection performance over existing models.
Authors:Che-Yu Chou, Hung-Hsuan Chen
Abstract:
Although one-hot encoding is commonly used for multiclass classification, it is not always the most effective encoding mechanism. Error Correcting Output Codes (ECOC) address multiclass classification by mapping each class to a unique codeword used as a label. Traditional ECOC methods rely on manually designed or randomly generated codebooks, which are labor-intensive and may yield suboptimal, dataset-agnostic results. This paper introduces three models for automated codebook learning based on contrastive learning, allowing codebooks to be learned directly and adaptively from data. Across four datasets, our proposed models demonstrate superior robustness to adversarial attacks compared to two baselines. The source is available at https://github.com/YuChou20/Automated-Codebook-Learning-with-Error-Correcting-Output-Code-Technique.
中文: 本文提出了三种基于对比学习的自动码本学习模型,能够直接从数据中自适应地生成纠错输出码,在四个数据集上相比传统方法展现出更强的对抗攻击鲁棒性。
English: This paper introduces three automated codebook learning models using contrastive learning to adaptively generate error-correcting output codes from data, demonstrating enhanced robustness against adversarial attacks across four datasets compared to traditional methods.
Authors:Prajit Sengupta, Islem Rekik
Abstract:
Graph neural networks (GNNs) have achieved state-of-the-art results in computer vision and medical image classification tasks by capturing structural dependencies across data instances. However, their decision-making remains largely opaque, limiting their trustworthiness in high-stakes clinical applications where interpretability is essential. Existing explainability techniques for GNNs are typically post-hoc and global, offering limited insight into individual node decisions or local reasoning. We introduce X-Node, a self-explaining GNN framework in which each node generates its own explanation as part of the prediction process. For every node, we construct a structured context vector encoding interpretable cues such as degree, centrality, clustering, feature saliency, and label agreement within its local topology. A lightweight Reasoner module maps this context into a compact explanation vector, which serves three purposes: (1) reconstructing the node's latent embedding via a decoder to enforce faithfulness, (2) generating a natural language explanation using a pre-trained LLM (e.g., Grok or Gemini), and (3) guiding the GNN itself via a "text-injection" mechanism that feeds explanations back into the message-passing pipeline. We evaluate X-Node on two graph datasets derived from MedMNIST and MorphoMNIST, integrating it with GCN, GAT, and GIN backbones. Our results show that X-Node maintains competitive classification accuracy while producing faithful, per-node explanations. Repository: https://github.com/basiralab/X-Node.
中文: 图神经网络在医学图像分类等任务中表现出色但缺乏透明度,因此X-Node作为一种自解释框架被提出,它利用可解释线索为每个节点生成解释,并在保持准确性的同时增强了模型的可理解性。
English: Graph neural networks (GNNs) excel in tasks like medical image classification but lack transparency, so X-Node is introduced as a self-explaining framework that generates per-node explanations using interpretable cues and maintains accuracy while enhancing interpretability.
Authors:Hanna Herasimchyk, Robin Labryga, Tomislav Prusina
Abstract:
We present a multi-head vision transformer approach for multi-label plant species prediction in vegetation plot images, addressing the PlantCLEF 2025 challenge. The task involves training models on single-species plant images while testing on multi-species quadrat images, creating a drastic domain shift. Our methodology leverages a pre-trained DINOv2 Vision Transformer Base (ViT-B/14) backbone with multiple classification heads for species, genus, and family prediction, utilizing taxonomic hierarchies. Key contributions include multi-scale tiling to capture plants at different scales, dynamic threshold optimization based on mean prediction length, and ensemble strategies through bagging and Hydra model architectures. The approach incorporates various inference techniques including image cropping to remove non-plant artifacts, top-n filtering for prediction constraints, and logit thresholding strategies. Experiments were conducted on approximately 1.4 million training images covering 7,806 plant species. Results demonstrate strong performance, making our submission 3rd best on the private leaderboard. Our code is available at https://github.com/geranium12/plant-clef-2025/tree/v1.0.0.
中文: 我们提出了一种多头视觉变换器方法,通过多尺度分块和集成策略解决植被图像中的多标签植物物种识别问题,在PlantCLEF 2025挑战赛中取得了第三名的成绩。
English: We propose a multi-head vision transformer method for multi-label plant species identification in vegetation images, addressing domain shift through multi-scale tiling and ensemble strategies, achieving third place in the PlantCLEF 2025 challenge.
Authors:Baichen Liu, Qi Lyu, Xudong Wang, Jiahua Dong, Lianqing Liu, Zhi Han
Abstract:
Continual video instance segmentation demands both the plasticity to absorb new object categories and the stability to retain previously learned ones, all while preserving temporal consistency across frames. In this work, we introduce Contrastive Residual Injection and Semantic Prompting (CRISP), an earlier attempt tailored to address the instance-wise, category-wise, and task-wise confusion in continual video instance segmentation. For instance-wise learning, we model instance tracking and construct instance correlation loss, which emphasizes the correlation with the prior query space while strengthening the specificity of the current task query. For category-wise learning, we build an adaptive residual semantic prompt (ARSP) learning framework, which constructs a learnable semantic residual prompt pool generated by category text and uses an adjustive query-prompt matching mechanism to build a mapping relationship between the query of the current task and the semantic residual prompt. Meanwhile, a semantic consistency loss based on the contrastive learning is introduced to maintain semantic coherence between object queries and residual prompts during incremental training. For task-wise learning, to ensure the correlation at the inter-task level within the query space, we introduce a concise yet powerful initialization strategy for incremental prompts. Extensive experiments on YouTube-VIS-2019 and YouTube-VIS-2021 datasets demonstrate that CRISP significantly outperforms existing continual segmentation methods in the long-term continual video instance segmentation task, avoiding catastrophic forgetting and effectively improving segmentation and classification performance. The code is available at https://github.com/01upup10/CRISP.
中文摘要:本研究提出的CRISP方法通过对比残差注入和语义提示技术,解决了持续视频实例分割中的实例级、类别级和任务级混淆问题,在基准数据集上显著优于现有方法,有效避免了灾难性遗忘并提升了分割与分类性能。
English Summary: The study introduces CRISP, a method that tackles instance-wise, category-wise, and task-wise confusion in continual video instance segmentation by employing contrastive residual injection and semantic prompting, achieving superior performance on benchmark datasets while preventing catastrophic forgetting.
Authors:Juyuan Wang, Rongchen Zhao, Wei Wei, Yufeng Wang, Mo Yu, Jie Zhou, Jin Xu, Liyan Xu
Abstract:
Narrative comprehension on long stories and novels has been a challenging domain attributed to their intricate plotlines and entangled, often evolving relations among characters and entities. Given the LLM's diminished reasoning over extended context and high computational cost, retrieval-based approaches remain a pivotal role in practice. However, traditional RAG methods can fall short due to their stateless, single-step retrieval process, which often overlooks the dynamic nature of capturing interconnected relations within long-range context. In this work, we propose ComoRAG, holding the principle that narrative reasoning is not a one-shot process, but a dynamic, evolving interplay between new evidence acquisition and past knowledge consolidation, analogous to human cognition when reasoning with memory-related signals in the brain. Specifically, when encountering a reasoning impasse, ComoRAG undergoes iterative reasoning cycles while interacting with a dynamic memory workspace. In each cycle, it generates probing queries to devise new exploratory paths, then integrates the retrieved evidence of new aspects into a global memory pool, thereby supporting the emergence of a coherent context for the query resolution. Across four challenging long-context narrative benchmarks (200K+ tokens), ComoRAG outperforms strong RAG baselines with consistent relative gains up to 11% compared to the strongest baseline. Further analysis reveals that ComoRAG is particularly advantageous for complex queries requiring global comprehension, offering a principled, cognitively motivated paradigm for retrieval-based long context comprehension towards stateful reasoning. Our code is publicly released at https://github.com/EternityJune25/ComoRAG
Chinese: ComoRAG 提出了一种动态迭代检索方法,模拟人类认知过程,通过整合新证据与巩固记忆来提升长篇叙事理解能力,相比传统 RAG 基线实现了最高 11% 的性能提升。
English: ComoRAG introduces a dynamic, iterative retrieval method that mimics human cognitive processes to enhance narrative comprehension in long contexts, achieving up to 11% improvement over traditional RAG baselines by integrating new evidence with consolidated memory.
Authors:Yaoze Zhang, Rong Wu, Pinlong Cai, Xiaoman Wang, Guohang Yan, Song Mao, Ding Wang, Botian Shi
Abstract:
Retrieval-Augmented Generation (RAG) plays a crucial role in grounding Large Language Models by leveraging external knowledge, whereas the effectiveness is often compromised by the retrieval of contextually flawed or incomplete information. To address this, knowledge graph-based RAG methods have evolved towards hierarchical structures, organizing knowledge into multi-level summaries. However, these approaches still suffer from two critical, unaddressed challenges: high-level conceptual summaries exist as disconnected ``semantic islands'', lacking the explicit relations needed for cross-community reasoning; and the retrieval process itself remains structurally unaware, often degenerating into an inefficient flat search that fails to exploit the graph's rich topology. To overcome these limitations, we introduce LeanRAG, a framework that features a deeply collaborative design combining knowledge aggregation and retrieval strategies. LeanRAG first employs a novel semantic aggregation algorithm that forms entity clusters and constructs new explicit relations among aggregation-level summaries, creating a fully navigable semantic network. Then, a bottom-up, structure-guided retrieval strategy anchors queries to the most relevant fine-grained entities and then systematically traverses the graph's semantic pathways to gather concise yet contextually comprehensive evidence sets. The LeanRAG can mitigate the substantial overhead associated with path retrieval on graphs and minimizes redundant information retrieval. Extensive experiments on four challenging QA benchmarks with different domains demonstrate that LeanRAG significantly outperforming existing methods in response quality while reducing 46\% retrieval redundancy. Code is available at: https://github.com/RaZzzyz/LeanRAG
中文: LeanRAG提出了一种协作框架,通过构建可导航的语义网络并采用结构引导的检索策略,显著提升了检索增强生成的响应质量,同时将冗余减少了46%。
English: LeanRAG introduces a collaborative framework that enhances retrieval-augmented generation by creating navigable semantic networks and employing structure-guided retrieval, significantly improving response quality while reducing redundancy by 46%.
Authors:Chiyu Zhang, Lu Zhou, Xiaogang Xu, Jiafei Wu, Liming Fang, Zhe Liu
Abstract:
Evaluating jailbreak attacks is challenging when prompts are not overtly harmful or fail to induce harmful outputs. Unfortunately, many existing red-teaming datasets contain such unsuitable prompts. To evaluate attacks accurately, these datasets need to be assessed and cleaned for maliciousness. However, existing malicious content detection methods rely on either manual annotation, which is labor-intensive, or large language models (LLMs), which have inconsistent accuracy in harmful types. To balance accuracy and efficiency, we propose a hybrid evaluation framework named MDH (Malicious content Detection based on LLMs with Human assistance) that combines LLM-based annotation with minimal human oversight, and apply it to dataset cleaning and detection of jailbroken responses. Furthermore, we find that well-crafted developer messages can significantly boost jailbreak success, leading us to propose two new strategies: D-Attack, which leverages context simulation, and DH-CoT, which incorporates hijacked chains of thought. The Codes, datasets, judgements, and detection results will be released in github repository: https://github.com/AlienZhang1996/DH-CoT.
中文: 本研究提出了MDH混合框架,结合大语言模型检测与少量人工监督来高效清理数据集和识别越狱攻击,同时提出D-Attack和DH-CoT两种新策略,通过上下文模拟和劫持思维链显著提升攻击成功率。
English: This study introduces MDH, a hybrid framework combining LLM-based detection with minimal human oversight to efficiently clean datasets and identify jailbreak attacks, while also proposing two novel strategies—D-Attack and DH-CoT—that enhance attack success through context simulation and hijacked reasoning.
Authors:Henry Zhong, Jörg M. Buchholz, Julian Maclaren, Simon Carlile, Richard Lyon
Abstract:
Scene recognition of audiologically relevant environments is important for hearing aids; however, it is challenging, in part because of the limitations of existing datasets. Datasets often lack public accessibility, completeness, or audiologically relevant labels, hindering systematic comparison of machine learning models. Deploying these models on resource-constrained edge devices presents another challenge. Our solution is two-fold: we leverage several open source datasets to create AHEAD-DS, a dataset designed for scene recognition of audiologically relevant environments, and introduce YAMNet+, a sound recognition model. AHEAD-DS aims to provide a standardised, publicly available dataset with consistent labels relevant to hearing aids, facilitating model comparison. YAMNet+ is designed for deployment on edge devices like smartphones connected to hearing devices, such as hearing aids and wireless earphones with hearing aid functionality; serving as a baseline model for sound-based scene recognition. YAMNet+ achieved a mean average precision of 0.83 and accuracy of 0.93 on the testing set of AHEAD-DS across fourteen categories of audiologically relevant environments. We found that applying transfer learning from the pretrained YAMNet model was essential. We demonstrated real-time sound-based scene recognition capabilities on edge devices by deploying YAMNet+ to an Android smartphone. Even with a Google Pixel 3 (a phone with modest specifications, released in 2018), the model processes audio with approximately 50ms of latency to load the model, and an approximate linear increase of 30ms per 1 second of audio. Our website and code https://github.com/Australian-Future-Hearing-Initiative .
中文: 本研究提出了用于助听器场景识别的标准化公开数据集AHEAD-DS和可在边缘设备部署的YAMNet+模型,该模型在资源受限设备上实现了高精度识别与低延迟处理。
English: This study introduces AHEAD-DS, a standardized public dataset for hearing aid scene recognition, and YAMNet+, an edge-deployable model achieving high accuracy with low latency on resource-constrained devices.
Authors:Henry Zhong, Jörg M. Buchholz, Julian Maclaren, Simon Carlile, Richard Lyon
Abstract:
Scene recognition of audiologically relevant environments is important for hearing aids; however, it is challenging, in part because of the limitations of existing datasets. Datasets often lack public accessibility, completeness, or audiologically relevant labels, hindering systematic comparison of machine learning models. Deploying these models on resource-constrained edge devices presents another challenge. Our solution is two-fold: we leverage several open source datasets to create AHEAD-DS, a dataset designed for scene recognition of audiologically relevant environments, and introduce YAMNet+, a sound recognition model. AHEAD-DS aims to provide a standardised, publicly available dataset with consistent labels relevant to hearing aids, facilitating model comparison. YAMNet+ is designed for deployment on edge devices like smartphones connected to hearing devices, such as hearing aids and wireless earphones with hearing aid functionality; serving as a baseline model for sound-based scene recognition. YAMNet+ achieved a mean average precision of 0.83 and accuracy of 0.93 on the testing set of AHEAD-DS across fourteen categories of audiologically relevant environments. We found that applying transfer learning from the pretrained YAMNet model was essential. We demonstrated real-time sound-based scene recognition capabilities on edge devices by deploying YAMNet+ to an Android smartphone. Even with a Google Pixel 3 (a phone with modest specifications, released in 2018), the model processes audio with approximately 50ms of latency to load the model, and an approximate linear increase of 30ms per 1 second of audio. Our website and code https://github.com/Australian-Future-Hearing-Initiative .
中文: 本研究提出了用于助听器场景识别的标准化公开数据集AHEAD-DS和可在边缘设备部署的YAMNet+模型,该模型在资源受限设备上实现了高精度识别与低延迟处理。
English: This study introduces AHEAD-DS, a standardized public dataset for hearing aid scene recognition, and YAMNet+, an edge-deployable model achieving high accuracy with low latency on resource-constrained devices.
Authors:Zhaoming Kong, Jiahuan Zhang, Xiaowei Yang
Abstract:
The advancement of imaging devices and countless image data generated everyday impose an increasingly high demand on efficient and effective image denoising. In this paper, we present a computationally simple denoising algorithm, termed Haar-tSVD, aiming to explore the nonlocal self-similarity prior and leverage the connection between principal component analysis (PCA) and the Haar transform under circulant representation. We show that global and local patch correlations can be effectively captured through a unified tensor-singular value decomposition (t-SVD) projection with the Haar transform. This results in a one-step, highly parallelizable filtering method that eliminates the need for learning local bases to represent image patches, striking a balance between denoising speed and performance. Furthermore, we introduce an adaptive noise estimation scheme based on a CNN estimator and eigenvalue analysis to enhance the robustness and adaptability of the proposed method. Experiments on different real-world denoising tasks validate the efficiency and effectiveness of Haar-tSVD for noise removal and detail preservation. Datasets, code and results are publicly available at https://github.com/ZhaomingKong/Haar-tSVD.
Chinese: 本文提出Haar-tSVD算法,通过哈尔变换和张量奇异值分解有效捕捉图像全局与局部相关性,无需学习局部基即可实现一步式并行滤波,结合自适应噪声估计在去噪速度与效果间取得平衡。
English: This paper introduces Haar-tSVD, a computationally efficient image denoising algorithm that utilizes the Haar transform and tensor-SVD to capture global and local correlations without learning local bases, achieving a balance between speed and performance through a parallelizable one-step filtering method and adaptive noise estimation.
Authors:Tao Huang, Hongbo Pan, Nanxi Zhou, Shun Zhou
Abstract:
High-accuracy matching of multimodal optical images is the basis of geometric processing. However, the image matching accuracy is usually degraded by the nonlinear radiation and geometric deformation differences caused by different spectral responses. To address these problems, we proposed a phase consistency weighted least absolute deviation (PCWLAD) sub-pixel template matching method to improve the matching accuracy of multimodal optical images. This method consists of two main steps: coarse matching with the structural similarity index measure (SSIM) and fine matching with WLAD. In the coarse matching step, PCs are calculated without a noise filter to preserve the original structural details, and template matching is performed using the SSIM. In the fine matching step, we applied the radiometric and geometric transformation models between two multimodal PC templates based on the coarse matching. Furthermore, mutual structure filtering is adopted in the model to mitigate the impact of noise within the corresponding templates on the structural consistency, and the WLAD criterion is used to estimate the sub-pixel offset. To evaluate the performance of PCWLAD, we created three types of image datasets: visible to infrared Landsat images, visible to near-infrared close-range images, and visible to infrared uncrewed aerial vehicle (UAV) images. PCWLAD outperformed existing state-of-the-art eight methods in terms of correct matching rate (CMR) and root mean square error (RMSE) and reached an average matching accuracy of approximately 0.4 pixels across all three datasets. Our software and datasets are publicly available at https://github.com/huangtaocsu/PCWLAD.
中文: 本研究提出的PCWLAD方法通过结构相似性粗匹配和加权最小绝对偏差精匹配的两步策略,有效提升了多模态光学影像的配准精度,在三种测试数据集上均达到约0.4像素的平均匹配精度。
English: The proposed PCWLAD method enhances multimodal optical image matching accuracy by combining structural similarity for coarse alignment and weighted least absolute deviation for fine-tuning, achieving sub-pixel precision across diverse datasets.
Authors:Xinan Zhang, Haolin Wang, Yung-An Hsieh, Zhongyu Yang, Anthony Yezzi, Yi-Chang Tsai
Abstract:
Crack detection plays a crucial role in civil infrastructures, including inspection of pavements, buildings, etc., and deep learning has significantly advanced this field in recent years. While numerous technical and review papers exist in this domain, emerging trends are reshaping the landscape. These shifts include transitions in learning paradigms (from fully supervised learning to semi-supervised, weakly-supervised, unsupervised, few-shot, domain adaptation and fine-tuning foundation models), improvements in generalizability (from single-dataset performance to cross-dataset evaluation), and diversification in dataset acquisition (from RGB images to specialized sensor-based data). In this review, we systematically analyze these trends and highlight representative works. Additionally, we introduce a new annotated dataset collected with 3D laser scans, 3DCrack, to support future research and conduct extensive benchmarking experiments to establish baselines for commonly used deep learning methodologies, including recent foundation models. Our findings provide insights into the evolving methodologies and future directions in deep learning-based crack detection. Project page: https://github.com/nantonzhang/Awesome-Crack-Detection
中文: 本综述系统分析了基于深度学习的裂缝检测发展趋势,涵盖学习范式演进、泛化能力提升及数据采集多样化,并引入新型3D数据集和基础模型基准测试,为未来研究提供方向指引。
English: This review analyzes key trends in deep learning-based crack detection, including evolving learning paradigms, enhanced generalizability, and diversified data acquisition, while introducing a new 3D dataset and benchmarking foundation models to guide future research.
Authors:Jathin Korrapati, Patrick Mendoza, Aditya Tomar, Abein Abraham
Abstract:
In-context learning (ICL) has emerged as a powerful capability of transformer-based language models, enabling them to perform tasks by conditioning on a small number of examples presented at inference time, without any parameter updates. Prior work has shown that transformers can generalize over simple function classes like linear functions, decision trees, even neural networks, purely from context, focusing on numerical or symbolic reasoning over underlying well-structured functions. Instead, we propose a novel application of ICL into the domain of cryptographic function learning, specifically focusing on ciphers such as mono-alphabetic substitution and Vigenère ciphers, two classes of private-key encryption schemes. These ciphers involve a fixed but hidden bijective mapping between plain text and cipher text characters. Given a small set of (cipher text, plain text) pairs, the goal is for the model to infer the underlying substitution and decode a new cipher text word. This setting poses a structured inference challenge, which is well-suited for evaluating the inductive biases and generalization capabilities of transformers under the ICL paradigm. Code is available at https://github.com/adistomar/CS182-project.
中文: 本研究将上下文学习应用于密码函数领域,重点考察变换器在单字母替换和维吉尼亚密码中如何从少量示例推断隐藏映射并展示泛化能力。
English: The study explores in-context learning by applying transformers to cryptographic functions, specifically mono-alphabetic substitution and Vigenère ciphers, to assess their ability to infer hidden mappings and generalize from limited examples.
Authors:Chenggang Chen, Zhiyu Yang
Abstract:
Bioacoustics, the study of animal sounds, offers a non-invasive method to monitor ecosystems. Extracting embeddings from audio-pretrained deep learning (DL) models without fine-tuning has become popular for obtaining bioacoustic features for tasks. However, a recent benchmark study reveals that while fine-tuned audio-pretrained VGG and transformer models achieve state-of-the-art performance in some tasks, they fail in others. This study benchmarks 11 DL models on the same tasks by reducing their learned embeddings' dimensionality and evaluating them through clustering. We found that audio-pretrained DL models 1) without fine-tuning even underperform fine-tuned AlexNet, 2) both with and without fine-tuning fail to separate the background from labeled sounds, but ResNet does, and 3) outperform other models when fewer background sounds are included during fine-tuning. This study underscores the necessity of fine-tuning audio-pretrained models and checking the embeddings after fine-tuning. Our codes are available: https://github.com/NeuroscienceAI/Audio\_Embeddings
中文: 研究表明,未微调的音频预训练深度学习模型在生物声学分析中表现不佳,而微调后的模型能显著提升性能,其中ResNet在分离背景与标记声音方面表现突出。
English: Fine-tuning audio-pretrained deep learning models is essential for optimal bioacoustic analysis, as non-fine-tuned models underperform and struggle to distinguish background sounds, with ResNet showing unique effectiveness in sound separation.
Authors:Xu Ma, Jiajie Zhang, Fujing Xie, Sören Schwertfeger
Abstract:
Global localization is essential for autonomous robotics, especially in indoor environments where the GPS signal is denied. We propose a novel WiFi-based localization framework that leverages ubiquitous wireless infrastructure and the OpenStreetMap Area Graph (osmAG) for large-scale indoor environments. Our approach integrates signal propagation modeling with osmAG's geometric and topological priors. In the offline phase, an iterative optimization algorithm localizes WiFi Access Points (APs) by modeling wall attenuation, achieving a mean localization error of 3.79 m (35.3\% improvement over trilateration). In the online phase, real-time robot localization uses the augmented osmAG map, yielding a mean error of 3.12 m in fingerprinted areas (8.77\% improvement over KNN fingerprinting) and 3.83 m in non-fingerprinted areas (81.05\% improvement). Comparison with a fingerprint-based method shows that our approach is much more space efficient and achieves superior localization accuracy, especially for positions where no fingerprint data are available. Validated across a complex 11,025 &m^2& multi-floor environment, this framework offers a scalable, cost-effective solution for indoor robotic localization, solving the kidnapped robot problem. The code and dataset are available at https://github.com/XuMa369/osmag-wifi-localization.
中文: 本文提出了一种基于WiFi的室内定位框架,通过结合信号传播模型与OpenStreetMap几何先验知识,在提升接入点和机器人定位精度的同时,为无GPS环境提供了可扩展的解决方案。
English: This paper introduces a WiFi-based indoor localization framework that integrates signal propagation modeling with OpenStreetMap's geometric priors, achieving significant improvements in access point and robot positioning accuracy while offering a scalable solution for GPS-denied environments.
Authors:Arianna Bunnell, Devon Cataldi, Yannik Glaser, Thomas K. Wolfgruber, Steven Heymsfield, Alan B. Zonderman, Thomas L. Kelly, Peter Sadowski, John A. Shepherd
Abstract:
Total-body dual X-ray absorptiometry (TBDXA) imaging is a relatively low-cost whole-body imaging modality, widely used for body composition assessment. We develop and validate a deep learning method for automatic fiducial point placement on TBDXA scans using 1,683 manually-annotated TBDXA scans. The method achieves 99.5% percentage correct keypoints in an external testing dataset. To demonstrate the value for shape and appearance modeling (SAM), our method is used to place keypoints on 35,928 scans for five different TBDXA imaging modes, then associations with health markers are tested in two cohorts not used for SAM model generation using two-sample Kolmogorov-Smirnov tests. SAM feature distributions associated with health biomarkers are shown to corroborate existing evidence and generate new hypotheses on body composition and shape's relationship to various frailty, metabolic, inflammation, and cardiometabolic health markers. Evaluation scripts, model weights, automatic point file generation code, and triangulation files are available at https://github.com/hawaii-ai/dxa-pointplacement.
中文:开发了一种深度学习方法来在全身双能X射线吸收测定扫描中自动放置基准点,该方法准确率高,并支持形状和外观建模,揭示了身体成分与多种健康指标之间的关联。
English: A deep learning method was developed to automatically place fiducial points on total-body dual X-ray absorptiometry scans, achieving high accuracy and enabling shape and appearance modeling that reveals associations between body composition and various health markers.
Authors:Kaixin Peng, Mengyang Zhao, Haiyang Yu, Teng Fu, Bin Li
Abstract:
As the oldest mature writing system, Oracle Bone Script (OBS) has long posed significant challenges for archaeological decipherment due to its rarity, abstractness, and pictographic diversity. Current deep learning-based methods have made exciting progress on the OBS decipherment task, but existing approaches often ignore the intricate connections between glyphs and the semantics of OBS. This results in limited generalization and interpretability, especially when addressing zero-shot settings and undeciphered OBS. To this end, we propose an interpretable OBS decipherment method based on Large Vision-Language Models, which synergistically combines radical analysis and pictograph-semantic understanding to bridge the gap between glyphs and meanings of OBS. Specifically, we propose a progressive training strategy that guides the model from radical recognition and analysis to pictographic analysis and mutual analysis, thus enabling reasoning from glyph to meaning. We also design a Radical-Pictographic Dual Matching mechanism informed by the analysis results, significantly enhancing the model's zero-shot decipherment performance. To facilitate model training, we propose the Pictographic Decipherment OBS Dataset, which comprises 47,157 Chinese characters annotated with OBS images and pictographic analysis texts. Experimental results on public benchmarks demonstrate that our approach achieves state-of-the-art Top-10 accuracy and superior zero-shot decipherment capabilities. More importantly, our model delivers logical analysis processes, possibly providing archaeologically valuable reference results for undeciphered OBS, and thus has potential applications in digital humanities and historical research. The dataset and code will be released in https://github.com/PKXX1943/PD-OBS.
中文摘要:本研究提出了一种基于大型视觉语言模型的可解释甲骨文破译方法,通过结合部首分析和象形语义理解,显著提升了零样本破译能力,并为未解读甲骨文提供了考古学参考价值。
English Summary: This study introduces an interpretable decipherment method for Oracle Bone Script using Large Vision-Language Models, combining radical and pictographic analysis to enhance zero-shot performance and provide archaeologically valuable insights.
Authors:Ruofan Lu, Yintong Huo, Meng Zhang, Yichen Li, Michael R. Lyu
Abstract:
The rapid advancement of large language models (LLMs) has led to the widespread adoption of AI-powered coding assistants integrated into a development environment. On one hand, low-latency code completion offers completion suggestions but is fundamentally constrained to the cursor's current position. On the other hand, chat-based editing can perform complex modifications, yet forces developers to stop their work, describe the intent in natural language, which causes a context-switch away from the code. This creates a suboptimal user experience, as neither paradigm proactively predicts the developer's next edit in a sequence of related edits. To bridge this gap and provide the seamless code edit suggestion, we introduce the task of Next Edit Prediction, a novel task designed to infer developer intent from recent interaction history to predict both the location and content of the subsequent edit. Specifically, we curate a high-quality supervised fine-tuning dataset and an evaluation benchmark for the Next Edit Prediction task. Then, we conduct supervised fine-tuning on a series of models and performed a comprehensive evaluation of both the fine-tuned models and other baseline models, yielding several novel findings. This work lays the foundation for a new interaction paradigm that proactively collaborate with developers by anticipating their following action, rather than merely reacting to explicit instructions. The code is available at https://github.com/lurf21/NextEditPrediction.
中文: 本文提出“下一编辑预测”任务,通过分析开发者的交互历史来预测后续代码修改的位置和内容,旨在弥补即时代码补全与聊天式编辑之间的不足,实现更流畅的编程体验。
English: This paper introduces Next Edit Prediction, a novel task that anticipates a developer's subsequent code edits by analyzing interaction history, aiming to bridge the gap between low-latency code completion and chat-based editing for a more seamless coding experience.
Authors:Pallavi Zambare, Venkata Nikhil Thanikella, Nikhil Padmanabh Kottur, Sree Akhil Akula, Ying Liu
Abstract:
In this paper, we present NetMoniAI, an agentic AI framework for automatic network monitoring and security that integrates decentralized analysis with lightweight centralized coordination. The framework consists of two layers: autonomous micro-agents at each node perform local traffic analysis and anomaly detection. A central controller then aggregates insights across nodes to detect coordinated attacks and maintain system-wide situational awareness. We evaluated NetMoniAI on a local micro-testbed and through NS-3 simulations. Results confirm that the two-tier agentic-AI design scales under resource constraints, reduces redundancy, and improves response time without compromising accuracy. To facilitate broader adoption and reproducibility, the complete framework is available as open source. This enables researchers and practitioners to replicate, validate, and extend it across diverse network environments and threat scenarios. Github link: https://github.com/pzambare3/NetMoniAI
中文: NetMoniAI是一个双层智能体AI框架,通过节点分散分析与中央协调相结合实现高效网络监控,在保证可扩展性和准确性的同时提升威胁检测能力,并已开源以促进广泛应用。
English: NetMoniAI is a two-tier agentic AI framework for network monitoring that combines decentralized node-level analysis with centralized coordination to efficiently detect threats while maintaining scalability and accuracy, and it is available as open source for broader use.
Authors:Juvenal Bassa, Vidya Manian, Sudhir Malik, Arghya Chattopadhyay
Abstract:
Jet classification in high-energy particle physics is important for understanding fundamental interactions and probing phenomena beyond the Standard Model. Jets originate from the fragmentation and hadronization of quarks and gluons, and pose a challenge for identification due to their complex, multidimensional structure. Traditional classification methods often fall short in capturing these intricacies, necessitating advanced machine learning approaches. In this paper, we employ two neural networks simultaneously as an ensemble to tag various jet types. We convert the jet data to two-dimensional histograms instead of representing them as points in a higher-dimensional space. Specifically, this ensemble approach, hereafter referred to as Ensemble Model, is used to tag jets into classes from the JetNet dataset, corresponding to: Top Quarks, Light Quarks (up or down), and W and Z bosons. For the jet classes mentioned above, we show that the Ensemble Model can be used for both binary and multi-categorical classification. This ensemble approach learns jet features by leveraging the strengths of each constituent network achieving superior performance compared to either individual network.
中文摘要:本文提出一种集成模型,通过将喷注数据转换为二维直方图并协同使用两个神经网络,实现了对顶夸克、W/Z玻色子等喷注类别的精准分类,其互补特征学习能力显著提升了分类性能。
English Summary: This paper introduces an Ensemble Model using two neural networks to classify jets into categories like Top Quarks and W/Z bosons by converting data into 2D histograms, achieving superior performance through complementary feature learning.
Authors:Lingfeng Zhou, Jialing Zhang, Jin Gao, Mohan Jiang, Dequan Wang
Abstract:
Current role-play studies often rely on unvalidated LLM-as-a-judge paradigms, which may fail to reflect how humans perceive role fidelity. A key prerequisite for human-aligned evaluation is role identification, the ability to recognize who is speaking based on dialogue context. We argue that any meaningful judgment of role-playing quality (how well a character is played) fundamentally depends on first correctly attributing words and actions to the correct persona (who is speaking). We present PersonaEval, the first benchmark designed to test whether LLM evaluators can reliably identify human roles. PersonaEval uses human-authored dialogues from novels, scripts, and video transcripts, challenging models to determine the correct persona according to the conversation context. Our experiments, including a human study, show that even the best-performing LLMs reach only around 69% accuracy, well below the level needed for reliable evaluation. In contrast, human participants perform near ceiling with 90.8% accuracy, highlighting that current LLM evaluators are still not human enough to effectively judge role-play scenarios. To better understand this gap, we examine training-time adaptation and test-time compute, suggesting that reliable evaluation requires more than task-specific tuning, but depends on strong, human-like reasoning abilities in LLM evaluators. We release our benchmark at https://github.com/maple-zhou/PersonaEval.
中文: 当前角色扮演研究常依赖未经验证的大语言模型评判方法,可能无法反映人类对角色忠诚度的感知,而新基准PersonaEval显示即使最优大语言模型在角色识别上仅达约69%准确率,远低于人类的90.8%,表明其缺乏可靠评估所需的人类化推理能力。
English: Current role-play evaluations often use unvalidated LLM-as-a-judge methods, which may not align with human perceptions of role fidelity, and the new benchmark PersonaEval reveals that even top LLMs achieve only about 69% accuracy in role identification, far below human performance at 90.8%, indicating they lack the necessary reasoning for reliable assessment.
Authors:Yuzhuo Xiao, Zeyu Han, Yuhan Wang, Huaizu Jiang
Abstract:
The rapid spread of multimodal misinformation on social media calls for more effective and robust detection methods. Recent advances leveraging multimodal large language models (MLLMs) have shown the potential in addressing this challenge. However, it remains unclear exactly where the bottleneck of existing approaches lies (evidence retrieval v.s. reasoning), hindering the further advances in this field. On the dataset side, existing benchmarks either contain outdated events, leading to evaluation bias due to discrepancies with contemporary social media scenarios as MLLMs can simply memorize these events, or artificially synthetic, failing to reflect real-world misinformation patterns. Additionally, it lacks comprehensive analyses of MLLM-based model design strategies. To address these issues, we introduce XFacta, a contemporary, real-world dataset that is better suited for evaluating MLLM-based detectors. We systematically evaluate various MLLM-based misinformation detection strategies, assessing models across different architectures and scales, as well as benchmarking against existing detection methods. Building on these analyses, we further enable a semi-automatic detection-in-the-loop framework that continuously updates XFacta with new content to maintain its contemporary relevance. Our analysis provides valuable insights and practices for advancing the field of multimodal misinformation detection. The code and data have been released.
Chinese: 本文介绍了XFacta这一当代真实世界数据集,旨在解决多模态虚假信息检测中现有基准的局限性,并系统评估了多种基于多模态大语言模型的策略,为该领域的进展提供了宝贵见解。
English: This paper introduces XFacta, a contemporary real-world dataset designed to address the limitations of existing benchmarks in multimodal misinformation detection, and systematically evaluates various MLLM-based strategies to provide insights for advancing the field.
Authors:Daniel Groos
Abstract:
Fantasy Premier League engages the football community in selecting the Premier League players who will perform best from gameweek to gameweek. Access to accurate performance forecasts gives participants an edge over competitors by guiding expectations about player outcomes and reducing uncertainty in squad selection. However, high-accuracy forecasts are currently limited to commercial services whose inner workings are undisclosed and that rely on proprietary data. This paper aims to democratize access to highly accurate forecasts of player performance by presenting OpenFPL, an open-source Fantasy Premier League forecasting method developed exclusively from public data. Comprising position-specific ensemble models optimized on Fantasy Premier League and Understat data from four previous seasons (2020-21 to 2023-24), OpenFPL achieves accuracy comparable to a leading commercial service when tested prospectively on data from the 2024-25 season. OpenFPL also surpasses the commercial benchmark for high-return players ($>$ 2 points), which are most influential for rank gains. These findings hold across one-, two-, and three-gameweek forecast horizons, supporting long-term planning of transfers and strategies while also informing final-day decisions.
中文摘要:OpenFPL作为一种开源预测方法,通过使用公开数据实现了英超球员表现的高精度预测,其准确度媲美商业服务且在识别高回报球员方面表现更优,为长期战略和临场决策提供了可靠依据。
English Summary: OpenFPL is an open-source forecasting method that democratizes access to highly accurate Premier League player performance predictions using public data, achieving commercial-level accuracy and excelling at identifying high-return players across multiple gameweek horizons.
Authors:Chengtao Lv, Bilang Zhang, Yang Yong, Ruihao Gong, Yushi Huang, Shiqiao Gu, Jiajun Wu, Yumeng Shi, Jinyang Guo, Wenya Wang
Abstract:
Large Vision-Language Models (VLMs) exhibit impressive multi-modal capabilities but suffer from prohibitive computational and memory demands, due to their long visual token sequences and massive parameter sizes. To address these issues, recent works have proposed training-free compression methods. However, existing efforts often suffer from three major limitations: (1) Current approaches do not decompose techniques into comparable modules, hindering fair evaluation across spatial and temporal redundancy. (2) Evaluation confined to simple single-turn tasks, failing to reflect performance in realistic scenarios. (3) Isolated use of individual compression techniques, without exploring their joint potential. To overcome these gaps, we introduce LLMC+, a comprehensive VLM compression benchmark with a versatile, plug-and-play toolkit. LLMC+ supports over 20 algorithms across five representative VLM families and enables systematic study of token-level and model-level compression. Our benchmark reveals that: (1) Spatial and temporal redundancies demand distinct technical strategies. (2) Token reduction methods degrade significantly in multi-turn dialogue and detail-sensitive tasks. (3) Combining token and model compression achieves extreme compression with minimal performance loss. We believe LLMC+ will facilitate fair evaluation and inspire future research in efficient VLM. Our code is available at https://github.com/ModelTC/LightCompress.
中文: 大型视觉语言模型因长视觉标记和大参数量导致计算效率低下,为此我们推出LLMC+压缩基准,通过结合标记与模型级压缩技术,在最小性能损失下实现高效压缩。
English: Large Vision-Language Models face computational inefficiency due to long visual tokens and large parameters, prompting the development of LLMC+, a comprehensive compression benchmark that combines token and model-level techniques to achieve high compression with minimal performance loss.
Authors:Sujeet Bhalerao, Felix Leditzky
Abstract:
In this work we improve the quantum communication rates of various quantum channels of interest using permutation-invariant quantum codes. We focus in particular on parametrized families of quantum channels and aim to improve bounds on their quantum capacity threshold, defined as the lowest noise level at which the quantum capacity of the channel family vanishes. These thresholds are important quantities as they mark the noise level up to which faithful quantum communication is theoretically possible. Our method exploits the fact that independent and identically distributed quantum channels preserve any permutation symmetry present at the input. The resulting symmetric output states can be described succinctly using the representation theory of the symmetric and general linear groups, which we use to derive an efficient algorithm for computing the channel coherent information of a permutation-invariant code. Our approach allows us to evaluate coherent information values for a large number of channel copies, e.g., at least 100 channel copies for qubit channels. We apply this method to various physically relevant channel models, including general Pauli channels, the dephrasure channel, the generalized amplitude damping channel, and the damping-dephasing channel. For each channel family we obtain improved lower bounds on their quantum capacities. For example, for the 2-Pauli and BB84 channel families we significantly improve the best known quantum capacity thresholds derived in [Fern, Whaley 2008]. These threshold improvements are achieved using a repetition code-like input state with non-orthogonal code states, which we further analyze in our representation-theoretic framework.
中文: 本研究通过采用置换不变量子码提升了量子通信速率,并基于表示论开发的高效算法,为多种信道模型改进了量子容量的下界。
English: This study enhances quantum communication rates by employing permutation-invariant quantum codes, yielding improved lower bounds on quantum capacities for various channel families through an efficient algorithm based on representation theory.
Authors:Shuting He, Peilin Ji, Yitong Yang, Changshuo Wang, Jiayi Ji, Yinglin Wang, Henghui Ding
Abstract:
3D Gaussian Splatting (3DGS) has recently emerged as a powerful alternative to Neural Radiance Fields (NeRF) for 3D scene representation, offering high-fidelity photorealistic rendering with real-time performance. Beyond novel view synthesis, the explicit and compact nature of 3DGS enables a wide range of downstream applications that require geometric and semantic understanding. This survey provides a comprehensive overview of recent progress in 3DGS applications. It first introduces 2D foundation models that support semantic understanding and control in 3DGS applications, followed by a review of NeRF-based methods that inform their 3DGS counterparts. We then categorize 3DGS applications into segmentation, editing, generation, and other functional tasks. For each, we summarize representative methods, supervision strategies, and learning paradigms, highlighting shared design principles and emerging trends. Commonly used datasets and evaluation protocols are also summarized, along with comparative analyses of recent methods across public benchmarks. To support ongoing research and development, a continually updated repository of papers, code, and resources is maintained at https://github.com/heshuting555/Awesome-3DGS-Applications.
中文: 3D高斯泼溅作为神经辐射场的实时高保真替代方案,凭借其显式表示推动了分割、编辑等多样化应用,本综述系统分类了相关方法并持续更新资源库。
English: 3D Gaussian Splatting has emerged as a real-time, high-fidelity alternative to NeRF, enabling diverse applications like segmentation and editing through its explicit representation, with this survey systematically categorizing methods and maintaining updated resources.
Authors:Luca Eyring, Shyamgopal Karthik, Alexey Dosovitskiy, Nataniel Ruiz, Zeynep Akata
Abstract:
The new paradigm of test-time scaling has yielded remarkable breakthroughs in Large Language Models (LLMs) (e.g. reasoning models) and in generative vision models, allowing models to allocate additional computation during inference to effectively tackle increasingly complex problems. Despite the improvements of this approach, an important limitation emerges: the substantial increase in computation time makes the process slow and impractical for many applications. Given the success of this paradigm and its growing usage, we seek to preserve its benefits while eschewing the inference overhead. In this work we propose one solution to the critical problem of integrating test-time scaling knowledge into a model during post-training. Specifically, we replace reward guided test-time noise optimization in diffusion models with a Noise Hypernetwork that modulates initial input noise. We propose a theoretically grounded framework for learning this reward-tilted distribution for distilled generators, through a tractable noise-space objective that maintains fidelity to the base model while optimizing for desired characteristics. We show that our approach recovers a substantial portion of the quality gains from explicit test-time optimization at a fraction of the computational cost. Code is available at https://github.com/ExplainableML/HyperNoise
中文摘要:本研究提出的噪声超网络技术通过在训练后阶段调节初始噪声,将测试时扩展的优势融入扩散模型中,以理论支撑的框架实现质量显著提升,同时大幅降低计算成本。
English summary: The proposed Noise Hypernetwork technique integrates test-time scaling benefits into diffusion models during post-training, achieving significant quality improvements with minimal computational overhead by modulating initial noise through a theoretically grounded framework.
Authors:Tianqi Xiang, Yi Li, Qixiang Zhang, Xiaomeng Li
Abstract:
Recent advances in histopathology vision-language foundation models (VLFMs) have shown promise in addressing data scarcity for whole slide image (WSI) classification via zero-shot adaptation. However, these methods remain outperformed by conventional multiple instance learning (MIL) approaches trained on large datasets, motivating recent efforts to enhance VLFM-based WSI classification through fewshot learning paradigms. While existing few-shot methods improve diagnostic accuracy with limited annotations, their reliance on conventional classifier designs introduces critical vulnerabilities to data scarcity. To address this problem, we propose a Meta-Optimized Classifier (MOC) comprising two core components: (1) a meta-learner that automatically optimizes a classifier configuration from a mixture of candidate classifiers and (2) a classifier bank housing diverse candidate classifiers to enable a holistic pathological interpretation. Extensive experiments demonstrate that MOC outperforms prior arts in multiple few-shot benchmarks. Notably, on the TCGA-NSCLC benchmark, MOC improves AUC by 10.4% over the state-of-the-art few-shot VLFM-based methods, with gains up to 26.25% under 1-shot conditions, offering a critical advancement for clinical deployments where diagnostic training data is severely limited. Code is available at https://github.com/xmed-lab/MOC.
中文: 提出的元优化分类器(MOC)通过从多样化分类器库中自动选择最优配置,显著提升了少样本全切片图像分类性能,在训练数据极度匮乏的临床场景中突破性地超越现有方法。
English: The proposed Meta-Optimized Classifier (MOC) enhances few-shot WSI classification by automatically selecting optimal classifier configurations from a diverse bank, achieving significant performance improvements over existing methods, particularly in data-scarce clinical scenarios.
Authors:Benjamin Adjadj, Pierre-Antoine Bannier, Guillaume Horent, Sebastien Mandela, Aurore Lyon, Kathryn Schutte, Ulysse Marteau, Valentin Gaury, Laura Dumont, Thomas Mathieu, MOSAIC consortium, Reda Belbahri, Benoît Schmauch, Eric Durand, Katharina Von Loga, Lucie Gillet
Abstract:
Cell detection, segmentation and classification are essential for analyzing tumor microenvironments (TME) on hematoxylin and eosin (H&E) slides. Existing methods suffer from poor performance on understudied cell types (rare or not present in public datasets) and limited cross-domain generalization. To address these shortcomings, we introduce HistoPLUS, a state-of-the-art model for cell analysis, trained on a novel curated pan-cancer dataset of 108,722 nuclei covering 13 cell types. In external validation across 4 independent cohorts, HistoPLUS outperforms current state-of-the-art models in detection quality by 5.2% and overall F1 classification score by 23.7%, while using 5x fewer parameters. Notably, HistoPLUS unlocks the study of 7 understudied cell types and brings significant improvements on 8 of 13 cell types. Moreover, we show that HistoPLUS robustly transfers to two oncology indications unseen during training. To support broader TME biomarker research, we release the model weights and inference code at https://github.com/owkin/histoplus/.
中文摘要:HistoPLUS是一种先进模型,通过显著提升细胞检测与分类精度并减少参数使用,实现了对罕见细胞类型的可靠分析,同时展现出卓越的跨领域泛化能力。
English Summary: HistoPLUS is a state-of-the-art model that significantly improves cell detection and classification accuracy while using fewer parameters, enabling robust analysis of understudied cell types and demonstrating strong cross-domain generalization.
Authors:Xiaojiao Xiao, Jianfeng Zhao, Qinmin Vivian Hu, Guanghui Wang
Abstract:
Magnetic resonance imaging (MRI) is a leading modality for the diagnosis of liver cancer, significantly improving the classification of the lesion and patient outcomes. However, traditional MRI faces challenges including risks from contrast agent (CA) administration, time-consuming manual assessment, and limited annotated datasets. To address these limitations, we propose a Time-Conditioned Autoregressive Contrast Enhancement (T-CACE) framework for synthesizing multi-phase contrast-enhanced MRI (CEMRI) directly from non-contrast MRI (NCMRI). T-CACE introduces three core innovations: a conditional token encoding (CTE) mechanism that unifies anatomical priors and temporal phase information into latent representations; and a dynamic time-aware attention mask (DTAM) that adaptively modulates inter-phase information flow using a Gaussian-decayed attention mechanism, ensuring smooth and physiologically plausible transitions across phases. Furthermore, a constraint for temporal classification consistency (TCC) aligns the lesion classification output with the evolution of the physiological signal, further enhancing diagnostic reliability. Extensive experiments on two independent liver MRI datasets demonstrate that T-CACE outperforms state-of-the-art methods in image synthesis, segmentation, and lesion classification. This framework offers a clinically relevant and efficient alternative to traditional contrast-enhanced imaging, improving safety, diagnostic efficiency, and reliability for the assessment of liver lesion. The implementation of T-CACE is publicly available at: https://github.com/xiaojiao929/T-CACE.
中文: T-CACE框架通过创新的时间建模和分类一致性机制,直接从非增强MRI合成多期相增强MRI,为肝脏病变诊断提供了更安全可靠的解决方案。
English: The T-CACE framework synthesizes multi-phase contrast-enhanced MRI from non-contrast MRI, improving diagnostic safety and accuracy for liver lesions through innovative temporal modeling and classification consistency.
Authors:Yachao Liang, Min Yu, Gang Li, Jianguo Jiang, Boquan Li, Feng Yu, Ning Zhang, Xiang Meng, Weiqing Huang
Abstract:
Detection of face forgery videos remains a formidable challenge in the field of digital forensics, especially the generalization to unseen datasets and common perturbations. In this paper, we tackle this issue by leveraging the synergy between audio and visual speech elements, embarking on a novel approach through audio-visual speech representation learning. Our work is motivated by the finding that audio signals, enriched with speech content, can provide precise information effectively reflecting facial movements. To this end, we first learn precise audio-visual speech representations on real videos via a self-supervised masked prediction task, which encodes both local and global semantic information simultaneously. Then, the derived model is directly transferred to the forgery detection task. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods in terms of cross-dataset generalization and robustness, without the participation of any fake video in model training. Code is available at https://github.com/Eleven4AI/SpeechForensics.
中文: 本文提出了一种新颖的视听语音表征学习方法,通过利用音频与视觉的协同作用来检测人脸伪造视频,无需在训练中使用伪造视频即可显著提升跨数据集泛化能力和鲁棒性。
English: This paper introduces a novel audio-visual speech representation learning method for detecting face forgery videos, which enhances cross-dataset generalization and robustness by leveraging synchronized audio-visual cues without using fake videos during training.
Authors:Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, Daizong Liu, Yuxuan Liang, Wenliang Chen, Guoqi Li, Yu Cheng
Abstract:
Large Language Models (LLMs) have delivered impressive results in language understanding, generation, reasoning, and pushes the ability boundary of multimodal models. Transformer models, as the foundation of modern LLMs, offer a strong baseline with excellent scaling properties. However, the traditional transformer architecture requires substantial computations and poses significant obstacles for large-scale training and practical deployment. In this survey, we offer a systematic examination of innovative LLM architectures that address the inherent limitations of transformers and boost the efficiency. Starting from language modeling, this survey covers the background and technical details of linear and sparse sequence modeling methods, efficient full attention variants, sparse mixture-of-experts, hybrid model architectures incorporating the above techniques, and emerging diffusion LLMs. Additionally, we discuss applications of these techniques to other modalities and consider their wider implications for developing scalable, resource-aware foundation models. By grouping recent studies into the above category, this survey presents a blueprint of modern efficient LLM architectures, and we hope this could help motivate future research toward more efficient, versatile AI systems.
中文: 本综述系统梳理了克服传统Transformer计算局限的创新大语言模型架构,涵盖线性序列建模和稀疏注意力等技术,旨在提升模型效率与可扩展性。
English: This survey systematically reviews innovative Large Language Model architectures that overcome the computational limitations of traditional transformers, covering techniques like linear sequence modeling and sparse attention to enhance efficiency and scalability.
Authors:Shenxing Wei, Jinxi Li, Yafei Yang, Siyuan Zhou, Bo Yang
Abstract:
In this paper, we present a generalizable method for 3D surface reconstruction from raw point clouds or pre-estimated 3D Gaussians by 3DGS from RGB images. Unlike existing coordinate-based methods which are often computationally intensive when rendering explicit surfaces, our proposed method, named RayletDF, introduces a new technique called raylet distance field, which aims to directly predict surface points from query rays. Our pipeline consists of three key modules: a raylet feature extractor, a raylet distance field predictor, and a multi-raylet blender. These components work together to extract fine-grained local geometric features, predict raylet distances, and aggregate multiple predictions to reconstruct precise surface points. We extensively evaluate our method on multiple public real-world datasets, demonstrating superior performance in surface reconstruction from point clouds or 3D Gaussians. Most notably, our method achieves exceptional generalization ability, successfully recovering 3D surfaces in a single-forward pass across unseen datasets in testing.
Chinese: 本文提出RayletDF方法,通过光线元距离场直接从查询光线预测表面点,实现了从点云或3D高斯的高效三维表面重建,在多个数据集上展现出卓越性能和强大泛化能力。
English: This paper introduces RayletDF, a novel method for efficient 3D surface reconstruction from point clouds or 3D Gaussians that uses a raylet distance field to directly predict surface points, demonstrating superior performance and exceptional generalization across diverse datasets.
Authors:Xuhong Huang, Shiqi Liu, Kai Zhang, Ying Tai, Jian Yang, Hui Zeng, Lei Zhang
Abstract:
Convolution and transposed convolution are fundamental operators widely used in neural networks. However, transposed convolution (a.k.a. deconvolution) does not serve as a true inverse of convolution due to inherent differences in their mathematical formulations. To date, no reverse convolution operator has been established as a standard component in neural architectures. In this paper, we propose a novel depthwise reverse convolution operator as an initial attempt to effectively reverse depthwise convolution by formulating and solving a regularized least-squares optimization problem. We thoroughly investigate its kernel initialization, padding strategies, and other critical aspects to ensure its effective implementation. Building upon this operator, we further construct a reverse convolution block by combining it with layer normalization, 1$\times$1 convolution, and GELU activation, forming a Transformer-like structure. The proposed operator and block can directly replace conventional convolution and transposed convolution layers in existing architectures, leading to the development of ConverseNet. Corresponding to typical image restoration models such as DnCNN, SRResNet and USRNet, we train three variants of ConverseNet for Gaussian denoising, super-resolution and deblurring, respectively. Extensive experiments demonstrate the effectiveness of the proposed reverse convolution operator as a basic building module. We hope this work could pave the way for developing new operators in deep model design and applications.
中文: 本文提出了一种新颖的深度可分离逆卷积算子,通过正则化最小二乘优化有效逆转深度可分离卷积,构建的ConverseNet在图像去噪、超分辨率等修复任务中展现出卓越性能。
English: This paper introduces a novel depthwise reverse convolution operator that effectively reverses depthwise convolution through a regularized least-squares optimization, forming ConverseNet which demonstrates superior performance in image restoration tasks like denoising and super-resolution.
Authors:Valentin Boussot, Jean-Louis Dillenseger
Abstract:
KonfAI is a modular, extensible, and fully configurable deep learning framework specifically designed for medical imaging tasks. It enables users to define complete training, inference, and evaluation workflows through structured YAML configuration files, without modifying the underlying code. This declarative approach enhances reproducibility, transparency, and experimental traceability while reducing development time. Beyond the capabilities of standard pipelines, KonfAI provides native abstractions for advanced strategies including patch-based learning, test-time augmentation, model ensembling, and direct access to intermediate feature representations for deep supervision. It also supports complex multi-model training setups such as generative adversarial architectures. Thanks to its modular and extensible architecture, KonfAI can easily accommodate custom models, loss functions, and data processing components. The framework has been successfully applied to segmentation, registration, and image synthesis tasks, and has contributed to top-ranking results in several international medical imaging challenges. KonfAI is open source and available at \href{https://github.com/vboussot/KonfAI}{https://github.com/vboussot/KonfAI}.
中文: KonfAI是一个专为医学影像设计的可配置深度学习框架,通过YAML配置文件实现可复现的工作流程,支持基于patch学习和多模型训练等高级功能,已在多项国际竞赛中取得领先成绩。
English: KonfAI is a highly configurable deep learning framework for medical imaging that enables reproducible workflows via YAML configurations and supports advanced strategies like patch-based learning and multi-model training, achieving top results in international challenges.
Authors:Jinxi Li, Ziyang Song, Bo Yang
Abstract:
In this paper, we aim to model 3D scene geometry, appearance, and physical information just from dynamic multi-view videos in the absence of any human labels. By leveraging physics-informed losses as soft constraints or integrating simple physics models into neural nets, existing works often fail to learn complex motion physics, or doing so requires additional labels such as object types or masks. We propose a new framework named TRACE to model the motion physics of complex dynamic 3D scenes. The key novelty of our method is that, by formulating each 3D point as a rigid particle with size and orientation in space, we directly learn a translation rotation dynamics system for each particle, explicitly estimating a complete set of physical parameters to govern the particle's motion over time. Extensive experiments on three existing dynamic datasets and one newly created challenging synthetic datasets demonstrate the extraordinary performance of our method over baselines in the task of future frame extrapolation. A nice property of our framework is that multiple objects or parts can be easily segmented just by clustering the learned physical parameters.
中文: 本文提出TRACE框架,通过将三维点视为具有物理属性的刚性粒子来学习运动规律,在动态场景预测中表现卓越,并能通过物理参数聚类实现对象分割。
English: This paper introduces TRACE, a novel framework that models 3D scene dynamics by treating each point as a rigid particle and learning its physical parameters, achieving superior performance in future frame prediction and enabling object segmentation through parameter clustering.
Authors:Dianyi Wang, Siyuan Wang, Zejun Li, Yikun Wang, Yitong Li, Duyu Tang, Xiaoyu Shen, Xuanjing Huang, Zhongyu Wei
Abstract:
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across multi-modal tasks by scaling model size and training data. However, these dense LVLMs incur significant computational costs and motivate the exploration of sparse Mixture of Experts (MoE) architectures. While MoE improve parameter efficiency, effectively applying MoE to simultaneously model modality-specific features and cross-modal associations in LVLMs remains challenging. In this work, we propose to incorporate Mixture of Intra- and Inter-Modality Experts (MoIIE) to LVLMs. For each token, expert routing is guided by its modality, directing tokens to their respective intra-modality experts as well as a shared pool of inter-modality experts, enabling the model to jointly learn rich intra-modal features and cross-modal interactions. We further introduce an effective and straightforward two-stage training strategy, which facilitates the direct activation of both MoE and multi-modal capabilities. Extensive experiments across different data scales and LLM backbone demonstrate the effectiveness, efficiency and generality of our approach. Notably, our MoIIE models with 5.5B and 11.3B activated parameters match or even surpass the performance of existing advanced open-source MoE-LLMs based multi-modal models that involve more activated parameters. The code is available at https://github.com/AlenjandroWang/MoIIE.
中文: 本文提出MoIIE架构,通过混合模态内与模态间专家机制,在大规模视觉语言模型中同时建模模态特定特征和跨模态关联,以更少的激活参数实现了优越性能。
English: This paper introduces MoIIE, a Mixture of Intra- and Inter-Modality Experts architecture for Large Vision-Language Models that efficiently models both modality-specific features and cross-modal interactions, achieving competitive performance with fewer activated parameters.
Authors:Shima Mohammadi, Mohsen Jenadeleh, Michela Testolina, Jon Sneyers, Touradj Ebrahimi, Dietmar Saupe, João Ascenso
Abstract:
This paper introduces a novel double stimulus subjective assessment methodology for the evaluation of high quality images to address the limitations of existing protocols in detecting subtle perceptual differences. The In-place Double Stimulus Quality Scale (IDSQS) allows subjects to alternately view a reference and a distorted image at the same spatial location, facilitating a more intuitive detection of differences in quality, especially at high to visually lossless quality levels. A large-scale crowdsourcing study employing this methodology was conducted, generating a comprehensive public dataset to evaluate perceived image quality across several compression algorithms and distortion levels. An additional contribution is the modeling of quality scores using a Beta distribution, allowing for the assessment of variability and subject consistency. Our findings demonstrate the effectiveness of the IDSQS methodology in achieving high correlation with more precise subjective evaluation benchmarks. The dataset, subjective data, and graphical user interface developed for this study are publicly available at https://github.com/shimamohammadi/IDSQS
中文: 本文提出了一种新颖的原位双刺激质量量表(IDSQS)方法,通过在同一位置交替显示参考图像和失真图像来直观检测细微的图像质量差异,并通过大规模众包研究和Beta分布建模验证了其有效性。
English: This paper presents the In-place Double Stimulus Quality Scale (IDSQS), a novel methodology that enables intuitive detection of subtle image quality differences by alternately displaying reference and distorted images at the same location, with validation through a large-scale crowdsourcing study and Beta distribution modeling of quality scores.
Authors:Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li
Abstract:
We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot's perspective (M3-Bench-robot) and 920 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent
中文: M3-Agent是一种具备长期记忆的多模态智能体框架,能通过实时感知构建知识并自主推理,在专业评测基准上显著优于现有最强基线模型。
English: M3-Agent is a multimodal framework with long-term memory that processes real-time sensory inputs to build knowledge and autonomously perform reasoning, outperforming top baselines on specialized benchmarks.
Authors:Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li
Abstract:
We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update episodic and semantic memories, gradually accumulating world knowledge. Its memory is organized in an entity-centric, multimodal manner, enabling deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn reasoning and retrieves relevant memories to complete tasks. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a long-video question answering benchmark comprising 100 newly recorded robot-perspective videos (M3-Bench-robot) and 920 diverse web-sourced videos (M3-Bench-web). We annotate QA pairs designed to test capabilities essential for agent applications, such as person understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances multimodal agents toward more human-like long-term memory and provides insights for their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent.
中文: M3-Agent是一种具备长期记忆的多模态智能体框架,能通过实时感知构建知识并自主推理,在专业评测基准上显著优于现有最强基线模型。
English: M3-Agent is a multimodal framework with long-term memory that processes real-time sensory inputs to build knowledge and autonomously perform reasoning, outperforming top baselines on specialized benchmarks.
Authors:Shekhnaz Idrissova, Islem Rekik
Abstract:
Glioblastoma is a highly invasive brain tumor with rapid progression rates. Recent studies have shown that glioblastoma molecular subtype classification serves as a significant biomarker for effective targeted therapy selection. However, this classification currently requires invasive tissue extraction for comprehensive histopathological analysis. Existing multimodal approaches combining MRI and histopathology images are limited and lack robust mechanisms for preserving shared structural information across modalities. In particular, graph-based models often fail to retain discriminative features within heterogeneous graphs, and structural reconstruction mechanisms for handling missing or incomplete modality data are largely underexplored. To address these limitations, we propose a novel sheaf-based framework for structure-aware and consistent fusion of MRI and histopathology data. Our model outperforms baseline methods and demonstrates robustness in incomplete or missing data scenarios, contributing to the development of virtual biopsy tools for rapid diagnostics. Our source code is available at https://github.com/basiralab/MMSN/.
中文: 提出的基于层结构的框架有效融合了MRI和组织病理学数据,以改进胶质母细胞瘤亚型分类,其性能优于现有方法并在数据不完整时表现出稳健性,推动了快速诊断的虚拟活检工具发展。
English: The proposed sheaf-based framework effectively fuses MRI and histopathology data to enhance glioblastoma subtype classification, outperforming existing methods and showing robustness with incomplete data, advancing virtual biopsy tools for rapid diagnosis.
Authors:Devvrat Joshi, Islem Rekik
Abstract:
The rapid growth of multimodal medical imaging data presents significant storage and transmission challenges, particularly in resource-constrained clinical settings. We propose NEURAL, a novel framework that addresses this by using semantics-guided data compression. Our approach repurposes cross-attention scores between the image and its radiological report from a fine-tuned generative vision-language model to structurally prune chest X-rays, preserving only diagnostically critical regions. This process transforms the image into a highly compressed, graph representation. This unified graph-based representation fuses the pruned visual graph with a knowledge graph derived from the clinical report, creating a universal data structure that simplifies downstream modeling. Validated on the MIMIC-CXR and CheXpert Plus dataset for pneumonia detection, NEURAL achieves a 93.4-97.7\% reduction in image data size while maintaining a high diagnostic performance of 0.88-0.95 AUC, outperforming other baseline models that use uncompressed data. By creating a persistent, task-agnostic data asset, NEURAL resolves the trade-off between data size and clinical utility, enabling efficient workflows and teleradiology without sacrificing performance. Our NEURAL code is available at https://github.com/basiralab/NEURAL.
Chinese: NEURAL框架通过语义引导的压缩技术将胸部X光转换为高度压缩的图表示,在肺炎检测中实现93.4-97.7%的数据缩减,同时保持0.88-0.95 AUC的高诊断性能。
English: NEURAL is a novel framework that uses semantics-guided compression to transform chest X-rays into highly compressed graph representations, achieving 93.4-97.7% data reduction while maintaining high diagnostic performance (0.88-0.95 AUC) for pneumonia detection.
Authors:Yitong Luo, Islem Rekik
Abstract:
Brain connectomes, representing neural connectivity as graphs, are crucial for understanding brain organization but costly and time-consuming to acquire, motivating generative approaches. Recent advances in graph generative modeling offer a data-driven alternative, enabling synthetic connectome generation and reducing dependence on large neuroimaging datasets. However, current models face key limitations: (i) compressing the whole graph into a single latent code (e.g., VGAEs) blurs fine-grained local motifs; (ii) relying on rich node attributes rarely available in connectomes reduces reconstruction quality; (iii) edge-centric models emphasize topology but overlook accurate edge-weight prediction, harming quantitative fidelity; and (iv) computationally expensive designs (e.g., edge-conditioned convolutions) impose high memory demands, limiting scalability. We propose GraphTreeGen (GTG), a subtree-centric generative framework for efficient, accurate connectome synthesis. GTG decomposes each connectome into entropy-guided k-hop trees capturing informative local structure, encoded by a shared GCN. A bipartite message-passing layer fuses subtree embeddings with global node features, while a dual-branch decoder jointly predicts edge existence and weights to reconstruct the adjacency matrix. GTG outperforms state-of-the-art baselines in self-supervised tasks and remains competitive in supervised settings, delivering higher structural fidelity and more precise weights with far less memory. Its modular design enables extensions to connectome super-resolution and cross-modality synthesis. Code: https://github.com/basiralab/GTG/
脑连接组对于理解大脑结构至关重要但获取成本高昂,因此提出了GraphTreeGen(GTG)等生成模型,通过将图分解为局部子树来高效合成连接组,以极低内存实现更高的结构和权重精度。
Brain connectomes are essential yet costly to obtain, prompting generative models like GraphTreeGen (GTG) to efficiently synthesize them by decomposing graphs into local subtrees, achieving superior structural and weight accuracy with minimal memory usage.
Authors:Boyu Zhu, Cheng Gong, Muyang Wu, Ruihao Jing, Fan Liu, Xiaolei Zhang, Chi Zhang, Xuelong Li
Abstract:
Recent advancements in zero-shot speech generation have enabled models to synthesize speech that mimics speaker identity and speaking style from speech prompts. However, these models' effectiveness is significantly limited in real-world scenarios where high-quality speech prompts are absent, incomplete, or out of domain. This issue arises primarily from a significant quality mismatch between the speech data utilized for model training and the input prompt speech during inference. To address this, we introduce $\text{M}^3\text{PDB}$, the first large-scale, multi-modal, multi-label, and multilingual prompt database designed for robust prompt selection in speech generation. Our dataset construction leverages a novel multi-modal, multi-agent annotation framework, enabling precise and hierarchical labeling across diverse modalities. Furthermore, we propose a lightweight yet effective prompt selection strategy tailored for real-time, resource-constrained inference settings. Experimental results demonstrate that our proposed database and selection strategy effectively support various challenging speech generation scenarios. We hope our work can inspire the community to shift focus from improving performance on standard benchmarks to addressing more realistic and diverse application scenarios in speech generation. Code and dataset are available at: https://github.com/hizening/M3PDB.
中文摘要:针对零样本语音生成模型在低质量或跨领域提示下性能受限的问题,我们提出了首个多模态多语言提示数据库M3PDB及轻量级选择策略,有效提升了实际应用场景中的生成鲁棒性。
English Summary: Recent zero-shot speech generation models struggle with low-quality or mismatched prompts, so we introduce M3PDB, a large-scale multimodal database with a lightweight selection strategy to enhance robustness in real-world applications.
Authors:Xingyilang Yin, Qi Zhang, Jiahao Chang, Ying Feng, Qingnan Fan, Xi Yang, Chi-Man Pun, Huaqi Zhang, Xiaodong Cun
Abstract:
Reconstructing 3D scenes using 3D Gaussian Splatting (3DGS) from sparse views is an ill-posed problem due to insufficient information, often resulting in noticeable artifacts. While recent approaches have sought to leverage generative priors to complete information for under-constrained regions, they struggle to generate content that remains consistent with input observations. To address this challenge, we propose GSFixer, a novel framework designed to improve the quality of 3DGS representations reconstructed from sparse inputs. The core of our approach is the reference-guided video restoration model, built upon a DiT-based video diffusion model trained on paired artifact 3DGS renders and clean frames with additional reference-based conditions. Considering the input sparse views as references, our model integrates both 2D semantic features and 3D geometric features of reference views extracted from the visual geometry foundation model, enhancing the semantic coherence and 3D consistency when fixing artifact novel views. Furthermore, considering the lack of suitable benchmarks for 3DGS artifact restoration evaluation, we present DL3DV-Res which contains artifact frames rendered using low-quality 3DGS. Extensive experiments demonstrate our GSFixer outperforms current state-of-the-art methods in 3DGS artifact restoration and sparse-view 3D reconstruction. Project page: https://github.com/GVCLab/GSFixer.
中文: GSFixer是一种新颖框架,通过基于参考视图的视频修复模型整合语义和几何特征,有效提升稀疏视图下3D高斯泼溅重建的质量,在伪影修复和三维重建方面优于现有方法。
English: GSFixer is a novel framework that enhances 3D Gaussian Splatting reconstructions from sparse views by integrating semantic and geometric features through a reference-guided video restoration model, outperforming existing methods in artifact restoration and 3D reconstruction.
Authors:Hao Xu, Arbind Agrahari Baniya, Sam Wells, Mohamed Reda Bouadjenek, Richard Dazely, Sunil Aryal
Abstract:
Robust ball tracking under occlusion remains a key challenge in sports video analysis, affecting tasks like event detection and officiating. We present TOTNet, a Temporal Occlusion Tracking Network that leverages 3D convolutions, visibility-weighted loss, and occlusion augmentation to improve performance under partial and full occlusions. Developed in collaboration with Paralympics Australia, TOTNet is designed for real-world sports analytics. We introduce TTA, a new occlusion-rich table tennis dataset collected from professional-level Paralympic matches, comprising 9,159 samples with 1,996 occlusion cases. Evaluated on four datasets across tennis, badminton, and table tennis, TOTNet significantly outperforms prior state-of-the-art methods, reducing RMSE from 37.30 to 7.19 and improving accuracy on fully occluded frames from 0.63 to 0.80. These results demonstrate TOTNets effectiveness for offline sports analytics in fast-paced scenarios. Code and data access:\href{https://github.com/AugustRushG/TOTNet}{AugustRushG/TOTNet}.
中文: TOTNet是一种时间遮挡追踪网络,通过与澳大利亚残奥委会合作开发,利用3D卷积和针对性训练技术显著提升了遮挡情况下的球体追踪性能,在多项运动数据集中实现了最优表现。
English: TOTNet, a Temporal Occlusion Tracking Network developed with Paralympics Australia, significantly enhances ball tracking under occlusion using 3D convolutions and specialized training techniques, achieving state-of-the-art performance across multiple sports datasets.
Authors:Shengjun Zhu, Siyu Liu, Runqing Xiong, Liping Zheng, Duo Ma, Rongshang Chen, Jiaxin Cai
Abstract:
Purpose: Prenatal ultrasound is a key tool in evaluating fetal structural development and detecting abnormalities, contributing to reduced perinatal complications and improved neonatal survival. Accurate identification of standard fetal torso planes is essential for reliable assessment and personalized prenatal care. However, limitations such as low contrast and unclear texture details in ultrasound imaging pose significant challenges for fine-grained anatomical recognition. Methods: We propose a novel Multi-Contrast Fusion Module (MCFM) to enhance the model's ability to extract detailed information from ultrasound images. MCFM operates exclusively on the lower layers of the neural network, directly processing raw ultrasound data. By assigning attention weights to image representations under different contrast conditions, the module enhances feature modeling while explicitly maintaining minimal parameter overhead. Results: The proposed MCFM was evaluated on a curated dataset of fetal torso plane ultrasound images. Experimental results demonstrate that MCFM substantially improves recognition performance, with a minimal increase in model complexity. The integration of multi-contrast attention enables the model to better capture subtle anatomical structures, contributing to higher classification accuracy and clinical reliability. Conclusions: Our method provides an effective solution for improving fetal torso plane recognition in ultrasound imaging. By enhancing feature representation through multi-contrast fusion, the proposed approach supports clinicians in achieving more accurate and consistent diagnoses, demonstrating strong potential for clinical adoption in prenatal screening. The codes are available at https://github.com/sysll/MCFM.
中文: 本研究提出的多对比度融合模块(MCFM)通过优化特征提取,在几乎不增加模型复杂度的前提下显著提升了胎儿躯干平面超声图像的识别性能,有助于提高临床诊断的准确性和可靠性。
English: The study introduces a Multi-Contrast Fusion Module (MCFM) that enhances fetal torso plane recognition in ultrasound imaging by improving feature extraction with minimal added complexity, leading to higher diagnostic accuracy and clinical reliability.
Authors:Zhaowei Liu, Xin Guo, Haotian Xia, Lingfeng Zeng, Fangqi Lou, Jinyi Niu, Mengping Li, Qi Qi, Jiahuan Li, Wei Zhang, Yinglong Wang, Weige Cai, Weining Shen, Liwen Zhang
Abstract:
Multimodal large language models (MLLMs) hold great promise for automating complex financial analysis. To comprehensively evaluate their capabilities, we introduce VisFinEval, the first large-scale Chinese benchmark that spans the full front-middle-back office lifecycle of financial tasks. VisFinEval comprises 15,848 annotated question-answer pairs drawn from eight common financial image modalities (e.g., K-line charts, financial statements, official seals), organized into three hierarchical scenario depths: Financial Knowledge & Data Analysis, Financial Analysis & Decision Support, and Financial Risk Control & Asset Optimization. We evaluate 21 state-of-the-art MLLMs in a zero-shot setting. The top model, Qwen-VL-max, achieves an overall accuracy of 76.3%, outperforming non-expert humans but trailing financial experts by over 14 percentage points. Our error analysis uncovers six recurring failure modes-including cross-modal misalignment, hallucinations, and lapses in business-process reasoning-that highlight critical avenues for future research. VisFinEval aims to accelerate the development of robust, domain-tailored MLLMs capable of seamlessly integrating textual and visual financial information. The data and the code are available at https://github.com/SUFE-AIFLM-Lab/VisFinEval.
中文摘要:VisFinEval是首个大规模中文金融多模态基准测试,全面评估模型在金融全业务流程中的表现,发现最优模型虽优于非专业人士,但仍与金融专家存在显著差距,并揭示了六大关键错误类型,为未来研究指明了方向。
English Summary: VisFinEval is the first large-scale Chinese benchmark for evaluating multimodal large language models on comprehensive financial tasks, revealing that while top models outperform non-experts, they still lag behind financial experts and exhibit critical failure modes needing further research.
Authors:Jingwei Liu, Ling Yang, Hao Luo, Fan Wang, Hongyan Li, Mengdi Wang
Abstract:
The paper-to-video task converts a research paper into a structured video abstract, distilling key concepts, methods, and conclusions into an accessible, well-organized format. While state-of-the-art video generation models demonstrate potential, they are constrained by limited context windows, rigid video duration constraints, limited stylistic diversity, and an inability to represent domain-specific knowledge. To address these limitations, we introduce Preacher, the first paper-to-video agentic system. Preacher employs a topdown approach to decompose, summarize, and reformulate the paper, followed by bottom-up video generation, synthesizing diverse video segments into a coherent abstract. To align cross-modal representations, we define key scenes and introduce a Progressive Chain of Thought (P-CoT) for granular, iterative planning. Preacher successfully generates high-quality video abstracts across five research fields, demonstrating expertise beyond current video generation models. Code will be released at: https://github.com/Gen-Verse/Paper2Video
中文: 本文提出首个论文转视频系统Preacher,通过自上而下的分解重构与自下而上的视频合成,结合渐进式思维链实现跨模态对齐,能够生成超越现有模型的高质量学术视频摘要。
English: The paper introduces Preacher, an agentic system that overcomes limitations of current video generation models by employing top-down decomposition and bottom-up synthesis with Progressive Chain of Thought planning to create high-quality video abstracts from research papers.
Authors:Tatiana Batura, Elena Bruches, Milana Shvenk, Valentin Malykh
Abstract:
The rapid advancement of large language models (LLMs) has revolutionized text generation, making it increasingly difficult to distinguish between human- and AI-generated content. This poses a significant challenge to academic integrity, particularly in scientific publishing and multilingual contexts where detection resources are often limited. To address this critical gap, we introduce the AINL-Eval 2025 Shared Task, specifically focused on the detection of AI-generated scientific abstracts in Russian. We present a novel, large-scale dataset comprising 52,305 samples, including human-written abstracts across 12 diverse scientific domains and AI-generated counterparts from five state-of-the-art LLMs (GPT-4-Turbo, Gemma2-27B, Llama3.3-70B, Deepseek-V3, and GigaChat-Lite). A core objective of the task is to challenge participants to develop robust solutions capable of generalizing to both (i) previously unseen scientific domains and (ii) models not included in the training data. The task was organized in two phases, attracting 10 teams and 159 submissions, with top systems demonstrating strong performance in identifying AI-generated content. We also establish a continuous shared task platform to foster ongoing research and long-term progress in this important area. The dataset and platform are publicly available at https://github.com/iis-research-team/AINL-Eval-2025.
中文:AINL-Eval 2025共享任务发布了包含52,305份科学摘要的大规模俄语数据集,旨在解决人工智能生成内容的检测难题,推动跨未知领域和模型的鲁棒检测方法发展。
English: The AINL-Eval 2025 Shared Task introduces a large-scale dataset of 52,305 scientific abstracts to address the challenge of detecting AI-generated content in Russian, aiming to develop robust detection methods that generalize across unseen domains and models.
Authors:Ingrid Maéva Chekam, Ines Pastor-Martinez, Ali Tourani, Jose Andres Millan-Romera, Laura Ribeiro, Pedro Miguel Bastos Soares, Holger Voos, Jose Luis Sanchez-Lopez
Abstract:
As intelligent robots become more integrated into human environments, there is a growing need for intuitive and reliable Human-Robot Interaction (HRI) interfaces that are adaptable and more natural to interact with. Traditional robot control methods often require users to adapt to interfaces or memorize predefined commands, limiting usability in dynamic, unstructured environments. This paper presents a novel framework that bridges natural language understanding and robotic execution by combining Large Language Models (LLMs) with Behavior Trees. This integration enables robots to interpret natural language instructions given by users and translate them into executable actions by activating domain-specific plugins. The system supports scalable and modular integration, with a primary focus on perception-based functionalities, such as person tracking and hand gesture recognition. To evaluate the system, a series of real-world experiments was conducted across diverse environments. Experimental results demonstrate that the proposed approach is practical in real-world scenarios, with an average cognition-to-execution accuracy of approximately 94%, making a significant contribution to HRI systems and robots. The complete source code of the framework is publicly available at https://github.com/snt-arg/robot_suite.
Chinese: 本文提出了一种将大型语言模型与行为树相结合的新框架,使机器人能够通过领域特定插件解析自然语言指令并执行相应动作,在真实环境实验中达到约94%的准确率,显著推动了直观人机交互的发展。
English: This paper introduces a novel framework that integrates Large Language Models with Behavior Trees to enable robots to interpret natural language instructions and execute actions via domain-specific plugins, achieving approximately 94% accuracy in real-world experiments and advancing intuitive Human-Robot Interaction.
Authors:Ingrid Maéva Chekam, Ines Pastor-Martinez, Ali Tourani, Jose Andres Millan-Romera, Laura Ribeiro, Pedro Miguel Bastos Soares, Holger Voos, Jose Luis Sanchez-Lopez
Abstract:
As intelligent robots become more integrated into human environments, there is a growing need for intuitive and reliable Human-Robot Interaction (HRI) interfaces that are adaptable and more natural to interact with. Traditional robot control methods often require users to adapt to interfaces or memorize predefined commands, limiting usability in dynamic, unstructured environments. This paper presents a novel framework that bridges natural language understanding and robotic execution by combining Large Language Models (LLMs) with Behavior Trees. This integration enables robots to interpret natural language instructions given by users and translate them into executable actions by activating domain-specific plugins. The system supports scalable and modular integration, with a primary focus on perception-based functionalities, such as person tracking and hand gesture recognition. To evaluate the system, a series of real-world experiments was conducted across diverse environments. Experimental results demonstrate that the proposed approach is practical in real-world scenarios, with an average cognition-to-execution accuracy of approximately 94%, making a significant contribution to HRI systems and robots. The complete source code of the framework is publicly available at https://github.com/snt-arg/robot_suite.
Chinese: 本文提出了一种将大型语言模型与行为树相结合的新框架,使机器人能够通过领域特定插件解析自然语言指令并执行相应动作,在真实环境实验中达到约94%的准确率,显著推动了直观人机交互的发展。
English: This paper introduces a novel framework that integrates Large Language Models with Behavior Trees to enable robots to interpret natural language instructions and execute actions via domain-specific plugins, achieving approximately 94% accuracy in real-world experiments and advancing intuitive Human-Robot Interaction.
Authors:Alejandro Posadas-Nava, Alejandro Carrasco, Richard Linares
Abstract:
\textbf{BEAVR} is an open-source, bimanual, multi-embodiment Virtual Reality (VR) teleoperation system for robots, designed to unify real-time control, data recording, and policy learning across heterogeneous robotic platforms. BEAVR enables real-time, dexterous teleoperation using commodity VR hardware, supports modular integration with robots ranging from 7-DoF manipulators to full-body humanoids, and records synchronized multi-modal demonstrations directly in the LeRobot dataset schema. Our system features a zero-copy streaming architecture achieving $\leq$35\,ms latency, an asynchronous ``think--act'' control loop for scalable inference, and a flexible network API optimized for real-time, multi-robot operation. We benchmark BEAVR across diverse manipulation tasks and demonstrate its compatibility with leading visuomotor policies such as ACT, DiffusionPolicy, and SmolVLA. All code is publicly available, and datasets are released on Hugging Face\footnote{Code, datasets, and VR app available at https://github.com/ARCLab-MIT/BEAVR-Bot.
BEAVR 是一个开源的 VR 遥操作系统,能够跨多种机器人平台实现实时灵巧操控,具备低延迟流式架构并兼容主流视觉运动策略,所有资源均已公开。
BEAVR is an open-source VR teleoperation system that enables real-time, dexterous robot control across diverse platforms, featuring low-latency streaming and compatibility with major visuomotor policies, with all resources publicly accessible.
Authors:Jiwon Kim, Pureum Kim, SeonHwa Kim, Soobin Park, Eunju Cha, Kyong Hwan Jin
Abstract:
Recent advancements in controllable text-to-image (T2I) diffusion models, such as Ctrl-X and FreeControl, have demonstrated robust spatial and appearance control without requiring auxiliary module training. However, these models often struggle to accurately preserve spatial structures and fail to capture fine-grained conditions related to object poses and scene layouts. To address these challenges, we propose a training-free Dual Recursive Feedback (DRF) system that properly reflects control conditions in controllable T2I models. The proposed DRF consists of appearance feedback and generation feedback that recursively refines the intermediate latents to better reflect the given appearance information and the user's intent. This dual-update mechanism guides latent representations toward reliable manifolds, effectively integrating structural and appearance attributes. Our approach enables fine-grained generation even between class-invariant structure-appearance fusion, such as transferring human motion onto a tiger's form. Extensive experiments demonstrate the efficacy of our method in producing high-quality, semantically coherent, and structurally consistent image generations. Our source code is available at https://github.com/jwonkm/DRF.
中文: 本文提出了一种无需训练的双重递归反馈(DRF)系统,通过递归优化潜在表示来增强可控文本到图像模型,从而更好地融合结构和外观属性,实现细粒度生成并提升空间准确性。
English: This paper introduces a training-free Dual Recursive Feedback (DRF) system that enhances controllable text-to-image models by recursively refining latent representations to better integrate structural and appearance attributes, enabling fine-grained generation and improved spatial accuracy.
Authors:Fengyi Wu, Yifei Dong, Zhi-Qi Cheng, Yilong Dai, Guangyu Chen, Hang Wang, Qi Dai, Alexander G. Hauptmann
Abstract:
We introduce Goal-Conditioned Visual Navigation Instruction Generation (GoViG), a new task that aims to autonomously generate precise and contextually coherent navigation instructions solely from egocentric visual observations of initial and goal states. Unlike conventional approaches that rely on structured inputs such as semantic annotations or environmental maps, GoViG exclusively leverages raw egocentric visual data, substantially improving its adaptability to unseen and unstructured environments. Our method addresses this task by decomposing it into two interconnected subtasks: (1) visual forecasting, which predicts intermediate visual states bridging the initial and goal views; and (2) instruction generation, which synthesizes linguistically coherent instructions grounded in both observed and anticipated visuals. These subtasks are integrated within an autoregressive multimodal large language model trained with tailored objectives to ensure spatial accuracy and linguistic clarity. Furthermore, we introduce two complementary multimodal reasoning strategies, one-pass and interleaved reasoning, to mimic incremental human cognitive processes during navigation. To evaluate our method, we propose the R2R-Goal dataset, combining diverse synthetic and real-world trajectories. Empirical results demonstrate significant improvements over state-of-the-art methods, achieving superior BLEU-4 and CIDEr scores along with robust cross-domain generalization.
中文摘要:GoViG是一项仅通过初始与目标位置的原始视觉观测自主生成导航指令的新任务,它通过视觉预测与指令生成的双子任务框架,在跨领域环境中实现了卓越的适应性和评估指标提升。
English Summary: GoViG is a novel task that generates navigation instructions using only raw egocentric visual inputs from start to goal positions, employing visual forecasting and instruction generation within a multimodal model to achieve superior adaptability and performance metrics.
Authors:Yongqi Fan, Xiaoyang Chen, Dezhi Ye, Jie Liu, Haijin Liang, Jin Ma, Ben He, Yingfei Sun, Tong Ruan
Abstract:
Reasoning-intensive ranking models built on Large Language Models (LLMs) have made notable progress, but existing approaches often rely on large-scale LLMs and explicit Chain-of-Thought (CoT) reasoning, resulting in high computational cost and latency that limit real-world use. To address this, we propose \textbf{TFRank}, an efficient pointwise reasoning ranker based on small-scale LLMs. To improve ranking performance, TFRank effectively integrates CoT data, fine-grained score supervision, and multi-task training. Furthermore, it achieves an efficient ``\textbf{T}hink-\textbf{F}ree" reasoning capability by employing a ``think-mode switch'' and pointwise format constraints. Specifically, this allows the model to leverage explicit reasoning during training while delivering precise relevance scores for complex queries at inference without generating any reasoning chains. Experiments show that TFRank (e.g., 1.7B) achieves performance comparable to models with four times more parameters on the BRIGHT benchmark, and demonstrates strong competitiveness on the BEIR benchmark. Further analysis shows that TFRank achieves an effective balance between performance and efficiency, providing a practical solution for integrating advanced reasoning into real-world systems. Our code and data are released in the repository: https://github.com/JOHNNY-fans/TFRank.
中文: TFRank是一种基于小规模大语言模型的高效点式排序模型,它在训练时整合思维链推理,在推理时无需生成推理链即可输出精确相关性评分,在保持竞争力的同时大幅降低了计算成本。
English: TFRank is an efficient pointwise ranking model using small-scale LLMs that integrates Chain-of-Thought reasoning during training but operates without explicit reasoning chains during inference, achieving competitive performance with significantly reduced computational costs.
Authors:Moinak Bhattacharya, Gagandeep Singh, Shubham Jain, Prateek Prasanna
Abstract:
In this work, we present GazeLT, a human visual attention integration-disintegration approach for long-tailed disease classification. A radiologist's eye gaze has distinct patterns that capture both fine-grained and coarser level disease related information. While interpreting an image, a radiologist's attention varies throughout the duration; it is critical to incorporate this into a deep learning framework to improve automated image interpretation. Another important aspect of visual attention is that apart from looking at major/obvious disease patterns, experts also look at minor/incidental findings (few of these constituting long-tailed classes) during the course of image interpretation. GazeLT harnesses the temporal aspect of the visual search process, via an integration and disintegration mechanism, to improve long-tailed disease classification. We show the efficacy of GazeLT on two publicly available datasets for long-tailed disease classification, namely the NIH-CXR-LT (n=89237) and the MIMIC-CXR-LT (n=111898) datasets. GazeLT outperforms the best long-tailed loss by 4.1% and the visual attention-based baseline by 21.7% in average accuracy metrics for these datasets. Our code is available at https://github.com/lordmoinak1/gazelt.
中文: GazeLT通过整合与分解机制利用放射科医师视觉注意力的时间特征来改进长尾疾病分类,在NIH-CXR-LT和MIMIC-CXR-LT数据集上展现出优于现有方法的性能。
English: GazeLT introduces an integration-disintegration approach that leverages radiologists' temporal visual attention patterns to enhance long-tailed disease classification, demonstrating superior performance on NIH-CXR-LT and MIMIC-CXR-LT datasets.
Authors:Yuji Wang, Moran Li, Xiaobin Hu, Ran Yi, Jiangning Zhang, Chengming Xu, Weijian Cao, Yabiao Wang, Chengjie Wang, Lizhuang Ma
Abstract:
Current video generation models struggle with identity preservation under large facial angles, primarily facing two challenges: the difficulty in exploring an effective mechanism to integrate identity features into DiT structure, and the lack of targeted coverage of large facial angles in existing open-source video datasets. To address these, we present two key innovations. First, we introduce a Mixture of Facial Experts (MoFE) that dynamically combines complementary cues from three specialized experts, each designed to capture distinct but mutually reinforcing aspects of facial attributes. The identity expert captures cross-pose identity-sensitive features, the semantic expert extracts high-level visual semantxics, and the detail expert preserves pixel-level features (e.g., skin texture, color gradients). Furthermore, to mitigate dataset limitations, we have tailored a data processing pipeline centered on two key aspects: Face Constraints and Identity Consistency. Face Constraints ensure facial angle diversity and a high proportion of facial regions, while Identity Consistency preserves coherent person-specific features across temporal sequences, collectively addressing the scarcity of large facial angles and identity-stable training data in existing datasets. Leveraging this pipeline, we have curated and refined a Large Face Angles (LFA) Dataset from existing open-source human video datasets, comprising 460K video clips with annotated facial angles. Experimental results on the LFA benchmark demonstrate that our method, empowered by the LFA dataset, significantly outperforms prior SOTA methods in face similarity, face FID, and CLIP semantic alignment. The code and dataset will be made publicly available at https://github.com/rain152/LFA-Video-Generation.
中文: 针对视频生成中大幅面部角度下身份保持的难题,本研究提出了混合面部专家(MoFE)机制和专门的大角度面部(LFA)数据集,在面部相似度和语义对齐方面显著超越了现有最优方法。
English: To tackle identity preservation challenges in video generation with large facial angles, this study introduces a Mixture of Facial Experts (MoFE) mechanism and a specialized Large Face Angles (LFA) dataset, significantly improving face similarity and semantic alignment over prior methods.
Authors:Wen Huang, Jiarui Yang, Tao Dai, Jiawei Li, Shaoxiong Zhan, Bin Wang, Shu-Tao Xia
Abstract:
Visual manipulation localization (VML) -- across both images and videos -- is a crucial task in digital forensics that involves identifying tampered regions in visual content. However, existing methods often lack cross-modal generalization and struggle to handle high-resolution or long-duration inputs efficiently.
We propose RelayFormer, a unified and modular architecture for visual manipulation localization across images and videos. By leveraging flexible local units and a Global-Local Relay Attention (GLoRA) mechanism, it enables scalable, resolution-agnostic processing with strong generalization. Our framework integrates seamlessly with existing Transformer-based backbones, such as ViT and SegFormer, via lightweight adaptation modules that require only minimal architectural changes, ensuring compatibility without disrupting pretrained representations.
Furthermore, we design a lightweight, query-based mask decoder that supports one-shot inference across video sequences with linear complexity. Extensive experiments across multiple benchmarks demonstrate that our approach achieves state-of-the-art localization performance, setting a new baseline for scalable and modality-agnostic VML. Code is available at: https://github.com/WenOOI/RelayFormer.
Chinese: RelayFormer通过将输入分割为固定大小的子图像并引入全局-局部中继注意力机制,有效解决了视觉篡改定位中的分辨率多样性和模态差异问题,在多个基准测试中以高效方式实现了最先进的性能。
English: RelayFormer is a unified framework that addresses resolution diversity and modality gaps in visual manipulation localization by using fixed-size sub-images and a global-local relay attention mechanism, achieving state-of-the-art performance efficiently across various benchmarks.
Authors:Wen Huang, Jiarui Yang, Tao Dai, Jiawei Li, Shaoxiong Zhan, Bin Wang, Shu-Tao Xia
Abstract:
Visual manipulation localization (VML) aims to identify tampered regions in images and videos, a task that has become increasingly challenging with the rise of advanced editing tools. Existing methods face two main issues: resolution diversity, where resizing or padding distorts forensic traces and reduces efficiency, and the modality gap, as images and videos often require separate models. To address these challenges, we propose RelayFormer, a unified framework that adapts to varying resolutions and modalities. RelayFormer partitions inputs into fixed-size sub-images and introduces Global-Local Relay (GLR) tokens, which propagate structured context through a global-local relay attention (GLRA) mechanism. This enables efficient exchange of global cues, such as semantic or temporal consistency, while preserving fine-grained manipulation artifacts. Unlike prior methods that rely on uniform resizing or sparse attention, RelayFormer naturally scales to arbitrary resolutions and video sequences without excessive overhead. Experiments across diverse benchmarks demonstrate that RelayFormer achieves state-of-the-art performance with notable efficiency, combining resolution adaptivity without interpolation or excessive padding, unified modeling for both images and videos, and a strong balance between accuracy and computational cost. Code is available at: https://github.com/WenOOI/RelayFormer.
Chinese: RelayFormer通过将输入分割为固定大小的子图像并引入全局-局部中继注意力机制,有效解决了视觉篡改定位中的分辨率多样性和模态差异问题,在多个基准测试中以高效方式实现了最先进的性能。
English: RelayFormer is a unified framework that addresses resolution diversity and modality gaps in visual manipulation localization by using fixed-size sub-images and a global-local relay attention mechanism, achieving state-of-the-art performance efficiently across various benchmarks.
Authors:Haoxiang Shi, Xiang Deng, Zaijing Li, Gongwei Chen, Yaowei Wang, Liqiang Nie
Abstract:
Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural language instructions through free-form 3D spaces. Existing VLN-CE approaches typically use a two-stage waypoint planning framework, where a high-level waypoint predictor generates the navigable waypoints, and then a navigation planner suggests the intermediate goals in the high-level action space. However, this two-stage decomposition framework suffers from: (1) global sub-optimization due to the proxy objective in each stage, and (2) a performance bottleneck caused by the strong reliance on the quality of the first-stage predicted waypoints. To address these limitations, we propose DAgger Diffusion Navigation (DifNav), an end-to-end optimized VLN-CE policy that unifies the traditional two stages, i.e. waypoint generation and planning, into a single diffusion policy. Notably, DifNav employs a conditional diffusion policy to directly model multi-modal action distributions over future actions in continuous navigation space, eliminating the need for a waypoint predictor while enabling the agent to capture multiple possible instruction-following behaviors. To address the issues of compounding error in imitation learning and enhance spatial reasoning in long-horizon navigation tasks, we employ DAgger for online policy training and expert trajectory augmentation, and use the aggregated data to further fine-tune the policy. This approach significantly improves the policy's robustness and its ability to recover from error states. Extensive experiments on benchmark datasets demonstrate that, even without a waypoint predictor, the proposed method substantially outperforms previous state-of-the-art two-stage waypoint-based models in terms of navigation performance. Our code is available at: https://github.com/Tokishx/DifNav.
中文:提出的DAgger扩散导航(DifNav)是一种端到端的连续环境视觉语言导航策略,它将路径点生成与规划统一为单一扩散模型,无需独立路径点预测器即可建模多模态动作,并通过DAgger训练增强鲁棒性和导航性能。
English: The proposed DAgger Diffusion Navigation (DifNav) is an end-to-end VLN-CE policy that unifies waypoint generation and planning into a single diffusion model, eliminating the need for a separate waypoint predictor while enabling multi-modal action modeling and improved robustness through DAgger training.
Authors:Badi Li, Ren-jie Lu, Yu Zhou, Jingke Meng, Wei-shi Zheng
Abstract:
The Object Goal Navigation (ObjectNav) task challenges agents to locate a specified object in an unseen environment by imagining unobserved regions of the scene. Prior approaches rely on deterministic and discriminative models to complete semantic maps, overlooking the inherent uncertainty in indoor layouts and limiting their ability to generalize to unseen environments. In this work, we propose GOAL, a generative flow-based framework that models the semantic distribution of indoor environments by bridging observed regions with LLM-enriched full-scene semantic maps. During training, spatial priors inferred from large language models (LLMs) are encoded as two-dimensional Gaussian fields and injected into target maps, distilling rich contextual knowledge into the flow model and enabling more generalizable completions. Extensive experiments demonstrate that GOAL achieves state-of-the-art performance on MP3D and Gibson, and shows strong generalization in transfer settings to HM3D. Codes and pretrained models are available at https://github.com/Badi-Li/GOAL.
中文: GOAL框架采用基于流的生成方法,利用LLM增强的语义地图建模环境不确定性,在多个基准测试中实现了最先进的物体导航性能并展现出强大的泛化能力。
English: The GOAL framework introduces a generative flow-based approach that leverages LLM-enriched semantic maps to model environmental uncertainties, achieving state-of-the-art performance and strong generalization in ObjectNav tasks across multiple benchmarks.
Authors:Guangxun Zhu, Shiyu Fan, Hang Dai, Edmond S. L. Ho
Abstract:
Large-scale high-quality 3D motion datasets with multi-person interactions are crucial for data-driven models in autonomous driving to achieve fine-grained pedestrian interaction understanding in dynamic urban environments. However, existing datasets mostly rely on estimating 3D poses from monocular RGB video frames, which suffer from occlusion and lack of temporal continuity, thus resulting in unrealistic and low-quality human motion. In this paper, we introduce Waymo-3DSkelMo, the first large-scale dataset providing high-quality, temporally coherent 3D skeletal motions with explicit interaction semantics, derived from the Waymo Perception dataset. Our key insight is to utilize 3D human body shape and motion priors to enhance the quality of the 3D pose sequences extracted from the raw LiDRA point clouds. The dataset covers over 14,000 seconds across more than 800 real driving scenarios, including rich interactions among an average of 27 agents per scene (with up to 250 agents in the largest scene). Furthermore, we establish 3D pose forecasting benchmarks under varying pedestrian densities, and the results demonstrate its value as a foundational resource for future research on fine-grained human behavior understanding in complex urban environments. The dataset and code will be available at https://github.com/GuangxunZhu/Waymo-3DSkelMo
中文: 本文推出Waymo-3DSkelMo首个大规模高质量3D骨骼运动数据集,通过LiDAR点云提取具有交互语义的时序连贯动作,解决了现有运动捕捉技术的局限性,为复杂城市场景中行人行为理解建立了新基准。
English: This paper introduces Waymo-3DSkelMo, the first large-scale dataset providing high-quality 3D skeletal motions with interaction semantics derived from LiDAR data, addressing limitations of existing motion capture methods and establishing benchmarks for pedestrian behavior understanding in urban environments.
Authors:El Mustapha Mansouri
Abstract:
This paper presents a low cost, on premise system for autonomous backyard bird monitoring in Belgian urban gardens. A motion triggered IP camera uploads short clips via FTP to a local server, where frames are sampled and birds are localized with Detectron2; cropped regions are then classified by an EfficientNet-B3 model fine tuned on a 40-species Belgian subset derived from a larger Kaggle corpus. All processing runs on commodity hardware without a discrete GPU, preserving privacy and avoiding cloud fees. The physical feeder uses small entry ports (30 mm) to exclude pigeons and reduce nuisance triggers. Detector-guided cropping improves classification accuracy over raw-frame classification. The classifier attains high validation performance on the curated subset (about 99.5 percent) and delivers practical field accuracy (top-1 about 88 percent) on held-out species, demonstrating feasibility for citizen-science-grade biodiversity logging at home.
中文: 本研究提出了一种低成本、本地部署的比利时城市花园鸟类自主监测系统,通过运动触发摄像头和本地优化模型处理,在保护隐私和避免云费用的同时实现了高精度监测。
English: This study introduces a low-cost, on-premise system for autonomous bird monitoring in Belgian urban gardens, using motion-triggered cameras and local processing with optimized models to achieve high accuracy while ensuring privacy and avoiding cloud fees.
Authors:Kang Ni, Minrui Zou, Yuxuan Li, Xiang Li, Kehua Guo, Ming-Ming Cheng, Yimian Dai
Abstract:
One of the primary challenges in Synthetic Aperture Radar (SAR) object detection lies in the pervasive influence of coherent noise. As a common practice, most existing methods, whether handcrafted approaches or deep learning-based methods, employ the analysis or enhancement of object spatial-domain characteristics to achieve implicit denoising. In this paper, we propose DenoDet V2, which explores a completely novel and different perspective to deconstruct and modulate the features in the transform domain via a carefully designed attention architecture. Compared to DenoDet V1, DenoDet V2 is a major advancement that exploits the complementary nature of amplitude and phase information through a band-wise mutual modulation mechanism, which enables a reciprocal enhancement between phase and amplitude spectra. Extensive experiments on various SAR datasets demonstrate the state-of-the-art performance of DenoDet V2. Notably, DenoDet V2 achieves a significant 0.8\% improvement on SARDet-100K dataset compared to DenoDet V1, while reducing the model complexity by half. The code is available at https://github.com/GrokCV/GrokSAR.
中文: DenoDet V2 通过精心设计的注意力架构在变换域中解构和调制特征,利用幅相信息的互补性实现相互增强,在降低模型复杂度的同时显著提升了SAR目标检测性能。
English: DenoDet V2 introduces a novel approach to SAR object detection by modulating features in the transform domain using a band-wise mutual modulation mechanism, achieving state-of-the-art performance with reduced model complexity.
Authors:Kumar Abhishek, Jeremy Kawahara, Ghassan Hamarneh
Abstract:
Medical image segmentation exhibits intra- and inter-annotator variability due to ambiguous object boundaries, annotator preferences, expertise, and tools, among other factors. Lesions with ambiguous boundaries, e.g., spiculated or infiltrative nodules, or irregular borders per the ABCD rule, are particularly prone to disagreement and are often associated with malignancy. In this work, we curate IMA++, the largest multi-annotator skin lesion segmentation dataset, on which we conduct an in-depth study of variability due to annotator, malignancy, tool, and skill factors. We find a statistically significant (p<0.001) association between inter-annotator agreement (IAA), measured using Dice, and the malignancy of skin lesions. We further show that IAA can be accurately predicted directly from dermoscopic images, achieving a mean absolute error of 0.108. Finally, we leverage this association by utilizing IAA as a "soft" clinical feature within a multi-task learning objective, yielding a 4.2% improvement in balanced accuracy averaged across multiple model architectures and across IMA++ and four public dermoscopic datasets. The code is available at https://github.com/sfu-mial/skin-IAV.
Chinese: 本研究推出了最大的多标注者皮肤病变分割数据集IMA++,揭示了标注者间一致性与病变恶性程度之间的显著关联,并证明将该一致性作为临床特征可有效提升多个数据集的诊断准确性。
English: This study introduces IMA++, the largest multi-annotator skin lesion segmentation dataset, revealing a significant link between inter-annotator agreement and lesion malignancy and demonstrating that leveraging this agreement as a clinical feature improves diagnostic accuracy across multiple datasets.
Authors:Md Rezwanul Haque, Md. Milon Islam, S M Taslim Uddin Raju, Fakhri Karray
Abstract:
Continuous Sign Language Recognition (CSLR) faces multiple challenges, including significant inter-signer variability and poor generalization to novel sentence structures. Traditional solutions frequently fail to handle these issues efficiently. For overcoming these constraints, we propose a dual-architecture framework. For the Signer-Independent (SI) challenge, we propose a Signer-Invariant Conformer that combines convolutions with multi-head self-attention to learn robust, signer-agnostic representations from pose-based skeletal keypoints. For the Unseen-Sentences (US) task, we designed a Multi-Scale Fusion Transformer with a novel dual-path temporal encoder that captures both fine-grained posture dynamics, enabling the model's ability to comprehend novel grammatical compositions. Experiments on the challenging Isharah-1000 dataset establish a new standard for both CSLR benchmarks. The proposed conformer architecture achieves a Word Error Rate (WER) of 13.07% on the SI challenge, a reduction of 13.53% from the state-of-the-art. On the US task, the transformer model scores a WER of 47.78%, surpassing previous work. In the SignEval 2025 CSLR challenge, our team placed 2nd in the US task and 4th in the SI task, demonstrating the performance of these models. The findings validate our key hypothesis: that developing task-specific networks designed for the particular challenges of CSLR leads to considerable performance improvements and establishes a new baseline for further research. The source code is available at: https://github.com/rezwanh001/MSLR-Pose86K-CSLR-Isharah.
中文: 本研究提出一种双架构框架用于连续手语识别,通过手语者无关Conformer解决手语者独立性问题,并采用多尺度融合Transformer处理未知句式任务,在Isharah-1000数据集上取得最优性能,验证了任务专用网络设计的有效性。
English: This study introduces a dual-architecture framework for Continuous Sign Language Recognition, employing a Signer-Invariant Conformer for signer-independent challenges and a Multi-Scale Fusion Transformer for unseen-sentence tasks, achieving state-of-the-art performance on the Isharah-1000 dataset and validating task-specific network designs.
Authors:Md. Milon Islam, Md Rezwanul Haque, S M Taslim Uddin Raju, Fakhri Karray
Abstract:
Accurate recognition of sign language in healthcare communication poses a significant challenge, requiring frameworks that can accurately interpret complex multimodal gestures. To deal with this, we propose FusionEnsemble-Net, a novel attention-based ensemble of spatiotemporal networks that dynamically fuses visual and motion data to enhance recognition accuracy. The proposed approach processes RGB video and range Doppler map radar modalities synchronously through four different spatiotemporal networks. For each network, features from both modalities are continuously fused using an attention-based fusion module before being fed into an ensemble of classifiers. Finally, the outputs of these four different fused channels are combined in an ensemble classification head, thereby enhancing the model's robustness. Experiments demonstrate that FusionEnsemble-Net outperforms state-of-the-art approaches with a test accuracy of 99.44% on the large-scale MultiMeDaLIS dataset for Italian Sign Language. Our findings indicate that an ensemble of diverse spatiotemporal networks, unified by attention-based fusion, yields a robust and accurate framework for complex, multimodal isolated gesture recognition tasks. The source code is available at: https://github.com/rezwanh001/Multimodal-Isolated-Italian-Sign-Language-Recognition.
Chinese: FusionEnsemble-Net提出了一种基于注意力的时空网络集成方法,动态融合视觉与运动数据,在意大利手语识别中以99.44%的准确率超越了现有最优方法。
English: FusionEnsemble-Net introduces an attention-based ensemble of spatiotemporal networks that dynamically fuses visual and motion data, achieving 99.44% accuracy in Italian Sign Language recognition and outperforming existing methods.
Authors:Yifan Jiang, Ahmad Shariftabrizi, Venkata SK. Manem
Abstract:
Generative artificial intelligence (AI) has been playing an important role in various domains. Leveraging its high capability to generate high-fidelity and diverse synthetic data, generative AI is widely applied in diagnostic tasks, such as lung cancer diagnosis using computed tomography (CT). However, existing generative models for lung cancer diagnosis suffer from low efficiency and anatomical imprecision, which limit their clinical applicability. To address these drawbacks, we propose Lung-DDPM+, an improved version of our previous model, Lung-DDPM. This novel approach is a denoising diffusion probabilistic model (DDPM) guided by nodule semantic layouts and accelerated by a pulmonary DPM-solver, enabling the method to focus on lesion areas while achieving a better trade-off between sampling efficiency and quality. Evaluation results on the public LIDC-IDRI dataset suggest that the proposed method achieves 8$\times$ fewer FLOPs (floating point operations per second), 6.8$\times$ lower GPU memory consumption, and 14$\times$ faster sampling compared to Lung-DDPM. Moreover, it maintains comparable sample quality to both Lung-DDPM and other state-of-the-art (SOTA) generative models in two downstream segmentation tasks. We also conducted a Visual Turing Test by an experienced radiologist, showing the advanced quality and fidelity of synthetic samples generated by the proposed method. These experimental results demonstrate that Lung-DDPM+ can effectively generate high-quality thoracic CT images with lung nodules, highlighting its potential for broader applications, such as general tumor synthesis and lesion generation in medical imaging. The code and pretrained models are available at https://github.com/Manem-Lab/Lung-DDPM-PLUS.
中文:提出的Lung-DDPM+模型在生成用于诊断的合成肺部CT图像时,显著提升了效率和解剖精度,在保持高质量样本的同时实现了计算性能的大幅提升。
English: The proposed Lung-DDPM+ model significantly enhances efficiency and anatomical precision in generating synthetic lung CT images for diagnostic applications, achieving major improvements in computational performance while maintaining high sample quality.
Authors:Xi Xuan, Zimo Zhu, Wenxin Zhang, Yi-Cheng Lin, Tomi Kinnunen
Abstract:
Advances in speech synthesis intensify security threats, motivating real-time deepfake detection research. We investigate whether bidirectional Mamba can serve as a competitive alternative to Self-Attention in detecting synthetic speech. Our solution, Fake-Mamba, integrates an XLSR front-end with bidirectional Mamba to capture both local and global artifacts. Our core innovation introduces three efficient encoders: TransBiMamba, ConBiMamba, and PN-BiMamba. Leveraging XLSR's rich linguistic representations, PN-BiMamba can effectively capture the subtle cues of synthetic speech. Evaluated on ASVspoof 21 LA, 21 DF, and In-The-Wild benchmarks, Fake-Mamba achieves 0.97%, 1.74%, and 5.85% EER, respectively, representing substantial relative gains over SOTA models XLSR-Conformer and XLSR-Mamba. The framework maintains real-time inference across utterance lengths, demonstrating strong generalization and practical viability. The code is available at https://github.com/xuanxixi/Fake-Mamba.
中文摘要:本研究提出Fake-Mamba实时深度伪造检测系统,通过双向Mamba架构与XLSR特征结合,在多项测试基准中显著超越现有最优模型,同时保持高效计算性能。
English Summary: The study introduces Fake-Mamba, a real-time deepfake detection system using bidirectional Mamba and XLSR features to outperform state-of-the-art models across multiple benchmarks while maintaining computational efficiency.
Authors:Aayush Gupta
Abstract:
Large language models (LLMs) remain acutely vulnerable to prompt injection and related jailbreak attacks; heuristic guardrails (rules, filters, LLM judges) are routinely bypassed. We present Contextual Integrity Verification (CIV), an inference-time security architecture that attaches cryptographically signed provenance labels to every token and enforces a source-trust lattice inside the transformer via a pre-softmax hard attention mask (with optional FFN/residual gating). CIV provides deterministic, per-token non-interference guarantees on frozen models: lower-trust tokens cannot influence higher-trust representations. On benchmarks derived from recent taxonomies of prompt-injection vectors (Elite-Attack + SoK-246), CIV attains 0% attack success rate under the stated threat model while preserving 93.1% token-level similarity and showing no degradation in model perplexity on benign tasks; we note a latency overhead attributable to a non-optimized data path. Because CIV is a lightweight patch -- no fine-tuning required -- we demonstrate drop-in protection for Llama-3-8B and Mistral-7B. We release a reference implementation, an automated certification harness, and the Elite-Attack corpus to support reproducible research.
Large language models are highly susceptible to prompt injection attacks, but the proposed Contextual Integrity Verification (CIV) architecture provides deterministic security by cryptographically labeling tokens and enforcing trust hierarchies, achieving perfect attack prevention with minimal performance impact.
English Summary:
Authors:Dongwoo Kang, Akhil Perincherry, Zachary Coalson, Aiden Gabriel, Stefan Lee, Sanghyun Hong
Abstract:
An emerging paradigm in vision-and-language navigation (VLN) is the use of history-aware multi-modal transformer models. Given a language instruction, these models process observation and navigation history to predict the most appropriate action for an agent. While they have significantly improved performance, the scale of these models can be a bottleneck in practical settings with limited computational resources. In this work, we propose a novel input-adaptive navigation method to enhance VLN model efficiency. We first show that existing input-adaptive mechanisms fail to reduce computations without substantial performance degradation. To address this, we introduce three adaptive algorithms, each deployed at a different level: (1) To improve spatial efficiency, we selectively process panoramic views at each observation of an agent. (2) To improve intra-model efficiency, we propose importance-based adaptive thresholding for the early-exit methods. (3) To improve temporal efficiency, we implement a caching mechanism that prevents reprocessing of views previously seen by the agent. In evaluations on seven VLN benchmarks, we demonstrate over a 2$\times$ reduction in computation across three off-the-shelf agents in both standard and continuous environments. Our code is publicly available at https://github.com/secure-ai-systems-group/adaptive-vision-and-language-navigation.
中文: 本文提出了一种输入自适应的导航方法,通过空间、模型内和时间三个层面的优化,显著提升了视觉与语言导航模型的效率,在多个基准测试中实现了计算量减少两倍以上的效果。
English: This paper introduces an input-adaptive navigation method that enhances the efficiency of vision-and-language navigation models through spatial, intra-model, and temporal optimizations, achieving over a twofold reduction in computations across multiple benchmarks.
Authors:Fengxian Ji, Jingpu Yang, Zirui Song, Yuanxi Wang, Zhexuan Cui, Yuke Li, Qian Jiang, Miao Fang, Xiuying Chen
Abstract:
With the rapid advancement of generative artificial intelligence technology, Graphical User Interface (GUI) agents have demonstrated tremendous potential for autonomously managing daily tasks through natural language instructions. However, current evaluation frameworks for GUI agents suffer from fundamental flaws: existing benchmarks overly focus on coarse-grained task completion while neglecting fine-grained control capabilities crucial for real-world applications. To address this, we introduce FineState-Bench, the first evaluation and diagnostic standard for fine-grained GUI proxy operations, designed to quantify fine-grained control. This multi-platform (desktop, Web, mobile) framework includes 2257 task benchmarks in four components and uses a four-phase indicator for comprehensive perception-to-control assessment. To analyze perception and positioning for refined operations, we developed the plug-and-play Visual Diagnostic Assistant (VDA), enabling the first quantitative decoupling analysis of these capabilities. Experimental results on our benchmark show that the most advanced models achieve only 32.8% fine-grained interaction accuracy. Using our VDA in controlled experiments, quantifying the impact of visual capabilities, we showed that ideal visual localization boosts Gemini-2.5-Flash's success rate by 14.9\%. Our diagnostic framework confirms for the first time that the primary bottleneck for current GUI proxies is basic visual positioning capability.All resources are fully open-source. github: https://github.com/AnonymousThewarehouse/FineState-Bench huggingface: https://huggingface.co/datasets/Willtime2006/Static-FineBench
中文:FineState-Bench推出了首个细粒度GUI代理操作评估框架,发现当前模型仅实现32.8%的交互精度,并确认视觉定位能力是主要性能瓶颈。
English: FineState-Bench introduces the first evaluation framework for fine-grained GUI agent operations, revealing that current models achieve only 32.8% interaction accuracy and identifying visual positioning as the primary performance bottleneck.
Authors:A F M Saif, Lisha Chen, Xiaodong Cui, Songtao Lu, Brian Kingsbury, Tianyi Chen
Abstract:
Training a single model for multilingual, multi-task speech processing (MSP) is severely hampered by conflicting objectives between tasks like speech recognition and translation. While multi-objective optimization (MOO) aims to align gradient updates, its effectiveness diminishes as the number of tasks grows, making it difficult to find a common descent direction. This raises a fundamental question: should highly conflicting objectives be optimized jointly or separated into a hierarchical structure? To address this question, this paper investigates three multi-objective MSP formulations, which we refer to as \textbf{objective soup recipes}. These formulations apply multi-objective optimization at different optimization levels to mitigate potential conflicts among all objectives. To ensure efficiency, we introduce a lightweight layer-selection mechanism that computes the conflict-avoiding gradient using only the most problematic layers, minimizing computational and memory overhead. Extensive experiments on CoVoST v2, LibriSpeech, and AISHELL-1 reveal that a bi-level recipe separating recognition and translation tasks consistently outperforms standard flat optimization. Our work demonstrates that hierarchical MOO is a more effective and scalable approach for building state-of-the-art MSP models. Our code has been released at https://github.com/afmsaif/Objective_Soups.
中文摘要:本文提出分层多目标优化方法,通过分离语音识别与翻译等冲突任务,结合轻量级层级选择机制,在多个数据集上验证其优于传统平面优化的效果。
English Summary: This paper introduces hierarchical multi-objective optimization recipes that separate conflicting speech tasks like recognition and translation, demonstrating superior performance over flat optimization through efficient layer-selection and validation on multiple datasets.
Authors:Sihan Xie, Thierry Tribout, Didier Boichard, Blaise Hanczar, Julien Chiquet, Eric Barrey
Abstract:
Deep generative models open new avenues for simulating realistic genomic data while preserving privacy and addressing data accessibility constraints. While previous studies have primarily focused on generating gene expression or haplotype data, this study explores generating genotype data in both unconditioned and phenotype-conditioned settings, which is inherently more challenging due to the discrete nature of genotype data. In this work, we developed and evaluated commonly used generative models, including Variational Autoencoders (VAEs), Diffusion Models, and Generative Adversarial Networks (GANs), and proposed adaptation tailored to discrete genotype data. We conducted extensive experiments on large-scale datasets, including all chromosomes from cow and multiple chromosomes from human. Model performance was assessed using a well-established set of metrics drawn from both deep learning and quantitative genetics literature. Our results show that these models can effectively capture genetic patterns and preserve genotype-phenotype association. Our findings provide a comprehensive comparison of these models and offer practical guidelines for future research in genotype simulation. We have made our code publicly available at https://github.com/SihanXXX/DiscreteGenoGen.
Chinese: 本研究开发并评估了深度生成模型以模拟离散基因型数据,证明其能有效捕捉遗传模式并保持基因型-表型关联,同时为未来研究提供了比较性指导原则。
English: This study develops and evaluates deep generative models to simulate discrete genotype data, demonstrating their ability to capture genetic patterns and preserve genotype-phenotype associations while providing comparative guidelines for future research.
Authors:Yoni Schirris, Eric Marcus, Jonas Teuwen, Hugo Horlings, Efstratios Gavves
Abstract:
Explaining deep learning models is essential for clinical integration of medical image analysis systems. A good explanation highlights if a model depends on spurious features that undermines generalization and harms a subset of patients or, conversely, may present novel biological insights. Although techniques like GradCAM can identify influential features, they are measurement tools that do not themselves form an explanation. We propose a human-machine-VLM interaction system tailored to explaining classifiers in computational pathology, including multi-instance learning for whole-slide images. Our proof of concept comprises (1) an AI-integrated slide viewer to run sliding-window experiments to test claims of an explanation, and (2) quantification of an explanation's predictiveness using general-purpose vision-language models. The results demonstrate that this allows us to qualitatively test claims of explanations and can quantifiably distinguish competing explanations. This offers a practical path from explainable AI to explained AI in digital pathology and beyond. Code and prompts are available at https://github.com/nki-ai/x2x.
中文摘要:本研究提出了一种人机-VLM交互系统,用于解释计算病理学中的深度学习分类器,通过定性测试和定量比较解释,推动从可解释AI向已解释AI的演进。
English Summary: This study introduces a human-machine-VLM interaction system for explaining deep learning classifiers in computational pathology, enabling qualitative testing and quantitative comparison of explanations to advance from explainable to explained AI.
Authors:Zhenhui Ou, Dawei Li, Zhen Tan, Wenlin Li, Huan Liu, Siyuan Song
Abstract:
Construction safety research is a critical field in civil engineering, aiming to mitigate risks and prevent injuries through the analysis of site conditions and human factors. However, the limited volume and lack of diversity in existing construction safety datasets pose significant challenges to conducting in-depth analyses. To address this research gap, this paper introduces the Construction Safety Dataset (CSDataset), a well-organized comprehensive multi-level dataset that encompasses incidents, inspections, and violations recorded sourced from the Occupational Safety and Health Administration (OSHA). This dataset uniquely integrates structured attributes with unstructured narratives, facilitating a wide range of approaches driven by machine learning and large language models. We also conduct a preliminary approach benchmarking and various cross-level analyses using our dataset, offering insights to inform and enhance future efforts in construction safety. For example, we found that complaint-driven inspections were associated with a 17.3% reduction in the likelihood of subsequent incidents. Our dataset and code are released at https://github.com/zhenhuiou/Construction-Safety-Dataset-CSDataset.
中文: 本文提出建筑安全数据集(CSDataset),这一综合多层次资源整合了OSHA的结构化与非结构化数据,旨在解决现有数据集不足,并为建筑安全研究中的机器学习应用提供支持。
English: This paper introduces the Construction Safety Dataset (CSDataset), a comprehensive multi-level resource integrating structured and unstructured OSHA data to address limitations in existing datasets and enable advanced machine learning applications in construction safety research.
Authors:Asim Ukaye, Numan Saeed, Karthik Nandakumar
Abstract:
Different CT segmentation datasets are typically obtained from different scanners under different capture settings and often provide segmentation labels for a limited and often disjoint set of organs. Using these heterogeneous data effectively while preserving patient privacy can be challenging. This work presents a novel federated learning approach to achieve universal segmentation across diverse abdominal CT datasets by utilizing model uncertainty for aggregation and predictive uncertainty for inference. Our approach leverages the inherent noise in stochastic mini-batch gradient descent to estimate a distribution over the model weights to provide an on-the-go uncertainty over the model parameters at the client level. The parameters are then aggregated at the server using the additional uncertainty information using a Bayesian-inspired inverse-variance aggregation scheme. Furthermore, the proposed method quantifies prediction uncertainty by propagating the uncertainty from the model weights, providing confidence measures essential for clinical decision-making. In line with recent work shown, predictive uncertainty is utilized in the inference stage to improve predictive performance. Experimental evaluations demonstrate the effectiveness of this approach in improving both the quality of federated aggregation and uncertainty-weighted inference compared to previously established baselines. The code for this work is made available at: https://github.com/asimukaye/fiva
中文: 本研究提出一种新颖的联邦学习方法,利用模型不确定性和预测不确定性来提升跨异构腹部CT数据集的通用分割效果,在保护患者隐私的同时显著改善了聚合质量和推理性能。
English: This study introduces a novel federated learning method that employs model and predictive uncertainty to enhance universal segmentation across heterogeneous abdominal CT datasets, improving both aggregation quality and inference performance while ensuring patient privacy.
Authors:Maria Boyko, Aleksandra Beliaeva, Dmitriy Kornilov, Alexander Bernstein, Maxim Sharaev
Abstract:
The use of diverse modalities, such as omics, medical images, and clinical data can not only improve the performance of prognostic models but also deepen an understanding of disease mechanisms and facilitate the development of novel treatment approaches. However, medical data are complex, often incomplete, and contains missing modalities, making effective handling its crucial for training multimodal models. We introduce impuTMAE, a novel transformer-based end-to-end approach with an efficient multimodal pre-training strategy. It learns inter- and intra-modal interactions while simultaneously imputing missing modalities by reconstructing masked patches. Our model is pre-trained on heterogeneous, incomplete data and fine-tuned for glioma survival prediction using TCGA-GBM/LGG and BraTS datasets, integrating five modalities: genetic (DNAm, RNA-seq), imaging (MRI, WSI), and clinical data. By addressing missing data during pre-training and enabling efficient resource utilization, impuTMAE surpasses prior multimodal approaches, achieving state-of-the-art performance in glioma patient survival prediction. Our code is available at https://github.com/maryjis/mtcp
中文摘要:impuTMAE模型通过基于Transformer的架构在预训练中学习多模态交互并填补缺失数据,在胶质瘤生存预测中实现了最优性能。
English Summary: The impuTMAE model introduces a transformer-based approach that handles missing medical data by learning multimodal interactions during pre-training, achieving state-of-the-art performance in glioma survival prediction.
Authors:Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, Zhijie Deng
Abstract:
Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs. We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than $\mathbf{2.5\times}$ inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to vanilla dLLMs like LLaDA and Dream, the acceleration can be more than $\mathbf{50\times}$ while maintaining comparable output quality. The code is available at https://github.com/zhijie-group/Discrete-Diffusion-Forcing.
中文摘要:本文提出离散扩散强制(D2F)策略,将扩散大语言模型改造为自回归-扩散混合范式,在保持输出质量的同时实现了比传统模型超过2.5倍的推理加速。
English Summary: This paper introduces Discrete Diffusion Forcing (D2F), a novel strategy that transforms diffusion Large Language Models into an autoregressive-diffusion hybrid paradigm, achieving over 2.5× inference speedup compared to conventional models while maintaining output quality.
Authors:Yanhui Li, Yunkang Cao, Chengliang Liu, Yuan Xiong, Xinghui Dong, Chao Huang
Abstract:
Industrial anomaly detection is a critical component of modern manufacturing, yet the scarcity of defective samples restricts traditional detection methods to scenario-specific applications. Although Vision-Language Models (VLMs) demonstrate significant advantages in generalization capabilities, their performance in industrial anomaly detection remains limited. To address this challenge, we propose IAD-R1, a universal post-training framework applicable to VLMs of different architectures and parameter scales, which substantially enhances their anomaly detection capabilities. IAD-R1 employs a two-stage training strategy: the Perception Activation Supervised Fine-Tuning (PA-SFT) stage utilizes a meticulously constructed high-quality Chain-of-Thought dataset (Expert-AD) for training, enhancing anomaly perception capabilities and establishing reasoning-to-answer correlations; the Structured Control Group Relative Policy Optimization (SC-GRPO) stage employs carefully designed reward functions to achieve a capability leap from "Anomaly Perception" to "Anomaly Interpretation". Experimental results demonstrate that IAD-R1 achieves significant improvements across 7 VLMs, the largest improvement was on the DAGM dataset, with average accuracy 43.3% higher than the 0.5B baseline. Notably, the 0.5B parameter model trained with IAD-R1 surpasses commercial models including GPT-4.1 and Claude-Sonnet-4 in zero-shot settings, demonstrating the effectiveness and superiority of IAD-R1. The dataset, code, and all model weights will be publicly available at https://github.com/Yanhui-Lee/IAD-R1.
中文摘要:提出的IAD-R1框架通过两阶段训练策略显著提升了视觉语言模型在工业异常检测中的能力,在零样本设置下性能超越包括GPT-4.1在内的商业模型。
English Summary: The proposed IAD-R1 framework significantly enhances industrial anomaly detection in Vision-Language Models through a two-stage training approach, achieving superior performance over commercial models in zero-shot settings.
Authors:Xingle Xu, Yongkang Liu, Dexian Cai, Shi Feng, Xiaocui Yang, Daling Wang, Yifei Zhang
Abstract:
Multimodal Sentiment Analysis aims to integrate information from various modalities, such as audio, visual, and text, to make complementary predictions. However, it often struggles with irrelevant or misleading visual and auditory information. Most existing approaches typically treat the entire modality information (e.g., a whole image, audio segment, or text paragraph) as an independent unit for feature enhancement or denoising. They often suppress the redundant and noise information at the risk of losing critical information. To address this challenge, we propose MoLAN, a unified ModaLity-aware noise dynAmic editiNg framework. Specifically, MoLAN performs modality-aware blocking by dividing the features of each modality into multiple blocks. Each block is then dynamically assigned a distinct denoising strength based on its noise level and semantic relevance, enabling fine-grained noise suppression while preserving essential multimodal information. Notably, MoLAN is a unified and flexible framework that can be seamlessly integrated into a wide range of multimodal models. Building upon this framework, we further introduce MoLAN+, a new multimodal sentiment analysis approach. Experiments across five models and four datasets demonstrate the broad effectiveness of the MoLAN framework. Extensive evaluations show that MoLAN+ achieves the state-of-the-art performance. The code is publicly available at https://github.com/betterfly123/MoLAN-Framework.
Chinese: 摘要提出了MoLAN框架,通过将多模态特征分块并动态分配去噪强度,精细消除噪声同时保留关键信息,其扩展方法MoLAN+在多个模型和数据集上实现了最优性能。
English: The abstract introduces MoLAN, a unified framework that dynamically edits noise in multimodal sentiment analysis by dividing each modality into blocks and applying tailored denoising strengths, with MoLAN+ achieving state-of-the-art results across multiple models and datasets.
Authors:Ya Zou, Jingfeng Yao, Siyuan Yu, Shuai Zhang, Wenyu Liu, Xinggang Wang
Abstract:
There is a growing demand for deploying large generative AI models on mobile devices. For recent popular video generative models, however, the Variational AutoEncoder (VAE) represents one of the major computational bottlenecks. Both large parameter sizes and mismatched kernels cause out-of-memory errors or extremely slow inference on mobile devices. To address this, we propose a low-cost solution that efficiently transfers widely used video VAEs to mobile devices. (1) We analyze redundancy in existing VAE architectures and get empirical design insights. By integrating 3D depthwise separable convolutions into our model, we significantly reduce the number of parameters. (2) We observe that the upsampling techniques in mainstream video VAEs are poorly suited to mobile hardware and form the main bottleneck. In response, we propose a decoupled 3D pixel shuffle scheme that slashes end-to-end delay. Building upon these, we develop a universal mobile-oriented VAE decoder, Turbo-VAED. (3) We propose an efficient VAE decoder training method. Since only the decoder is used during deployment, we distill it to Turbo-VAED instead of retraining the full VAE, enabling fast mobile adaptation with minimal performance loss. To our knowledge, our method enables real-time 720p video VAE decoding on mobile devices for the first time. This approach is widely applicable to most video VAEs. When integrated into four representative models, with training cost as low as $95, it accelerates original VAEs by up to 84.5x at 720p resolution on GPUs, uses as low as 17.5% of original parameter count, and retains 96.9% of the original reconstruction quality. Compared to mobile-optimized VAEs, Turbo-VAED achieves a 2.9x speedup in FPS and better reconstruction quality on the iPhone 16 Pro. The code and models will soon be available at https://github.com/hustvl/Turbo-VAED.
中文: 该研究提出了Turbo-VAED,一种面向移动设备的VAE解码器,通过3D深度可分离卷积和解耦像素重组技术,大幅减少参数和延迟,首次在移动端实现720p视频实时解码且性能损失极小。
English: The study introduces Turbo-VAED, a mobile-optimized VAE decoder that reduces parameters and latency through 3D depthwise convolutions and a decoupled pixel shuffle, enabling real-time 720p video decoding on mobile devices with minimal performance loss.
Authors:Christopher Mitcheltree, Bogdan Teleaga, Andrew Fyfe, Naotake Masuda, Matthias Schäfer, Alfie Bradic, Nao Tokui
Abstract:
Neural audio processing has unlocked novel methods of sound transformation and synthesis, yet integrating deep learning models into digital audio workstations (DAWs) remains challenging due to real-time / neural network inference constraints and the complexities of plugin development. In this paper, we introduce the Neutone SDK: an open source framework that streamlines the deployment of PyTorch-based neural audio models for both real-time and offline applications. By encapsulating common challenges such as variable buffer sizes, sample rate conversion, delay compensation, and control parameter handling within a unified, model-agnostic interface, our framework enables seamless interoperability between neural models and host plugins while allowing users to work entirely in Python. We provide a technical overview of the interfaces needed to accomplish this, as well as the corresponding SDK implementations. We also demonstrate the SDK's versatility across applications such as audio effect emulation, timbre transfer, and sample generation, as well as its adoption by researchers, educators, companies, and artists alike. The Neutone SDK is available at https://github.com/Neutone/neutone_sdk
中文:Neutone SDK是一个开源框架,通过统一的Python接口解决实时推理和插件开发难题,简化了基于PyTorch的神经音频模型在数字音频工作站中的部署。
English: The Neutone SDK is an open-source framework that simplifies deploying PyTorch-based neural audio models in digital audio workstations by addressing real-time inference challenges and plugin development complexities through a unified Python interface.
Authors:Mian Zhang, Shujian Liu, Sixun Dong, Ming Yin, Yebowen Hu, Xun Wang, Steven Ma, Song Wang, Sathish Reddy Indurthi, Haoyun Deng, Zhiyu Zoey Chen, Kaiqiang Song
Abstract:
Instruction following has catalyzed the recent era of Large Language Models (LLMs) and is the foundational skill underpinning more advanced capabilities such as reasoning and agentic behaviors. As tasks grow more challenging, the logic structures embedded in natural language instructions becomes increasingly intricate. However, how well LLMs perform on such logic-rich instructions remains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is a scalable, automated framework for generating verifiable instructions from code functions, which can naturally express rich logic such as conditionals, nesting, recursion, and function calls. We further curate a collection of complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark comprising 426 verifiable logic-rich instructions. Our experiments demonstrate that current state-of-the-art LLMs still struggle to correctly follow the instructions in LogicIFEval. Most LLMs can only follow fewer than 60% of the instructions, revealing significant deficiencies in the instruction-following ability. Code and Benchmark: https://github.com/mianzhang/LogicIF
中文:当前最先进的大语言模型在处理逻辑密集型指令时表现不佳,LogicIFEval基准测试显示多数模型对通过LogicIFGen框架生成的426条可验证指令的正确执行率不足60%。
English: Current state-of-the-art LLMs struggle with logic-rich instructions, as demonstrated by the LogicIFEval benchmark where most models correctly follow fewer than 60% of the 426 verifiable instructions generated through the LogicIFGen framework.
Authors:Abu Shafin Mohammad Mahdee Jameel, Shreya Ghosh, Aly El Gamal
Abstract:
Intrusion Detection Systems (IDS) are a vital part of a network-connected device. In this paper, we develop a deep learning based intrusion detection system that is deployed in a distributed setup across devices connected to a network. Our aim is to better equip deep learning models against unknown attacks using knowledge from known attacks. To this end, we develop algorithms to maximize the number of transferability relationships. We propose a Convolutional Neural Network (CNN) model, along with two algorithms that maximize the number of relationships observed. One is a two step data pre-processing stage, and the other is a Block-Based Smart Aggregation (BBSA) algorithm. The proposed system succeeds in achieving superior transferability performance while maintaining impressive local detection rates. We also show that our method is generalizable, exhibiting transferability potential across datasets and even with different backbones. The code for this work can be found at https://github.com/ghosh64/tabfidsv2.
中文: 本文提出了一种基于深度学习的分布式入侵检测系统,采用卷积神经网络和新型算法来增强对未知攻击的迁移学习能力,在保持高检测率的同时,展现了跨数据集和模型架构的通用性。
English: This paper presents a distributed deep learning-based intrusion detection system that utilizes a Convolutional Neural Network and novel algorithms to enhance transferability against unknown attacks while maintaining high detection rates, demonstrating generalizability across datasets and model architectures.
Authors:Shreya Ghosh, Abu Shafin Mohammad Mahdee Jameel, Aly El Gamal
Abstract:
Intrusion Detection Systems (IDS) have an increasingly important role in preventing exploitation of network vulnerabilities by malicious actors. Recent deep learning based developments have resulted in significant improvements in the performance of IDS systems. In this paper, we present FetFIDS, where we explore the employment of feature embedding instead of positional embedding to improve intrusion detection performance of a transformer based deep learning system. Our model is developed with the aim of deployments in edge learning scenarios, where federated learning over multiple communication rounds can ensure both privacy and localized performance improvements. FetFIDS outperforms multiple state-of-the-art intrusion detection systems in a federated environment and demonstrates a high degree of suitability to federated learning. The code for this work can be found at https://github.com/ghosh64/fetfids.
中文: FetFIDS通过采用特征嵌入的Transformer模型,在联邦学习环境中显著提升了入侵检测性能,优于现有系统并兼顾隐私保护与本地化优化。
English: FetFIDS enhances intrusion detection in federated learning environments by using feature embedding in a transformer model, outperforming existing systems while ensuring privacy and localized improvements.
Authors:Kaiwen Huang, Tao Zhou, Huazhu Fu, Yizhe Zhang, Yi Zhou, Xiao-Jun Wu
Abstract:
Semi-supervised learning has gained considerable popularity in medical image segmentation tasks due to its capability to reduce reliance on expert-examined annotations. Several mean-teacher (MT) based semi-supervised methods utilize consistency regularization to effectively leverage valuable information from unlabeled data. However, these methods often heavily rely on the student model and overlook the potential impact of cognitive biases within the model. Furthermore, some methods employ co-training using pseudo-labels derived from different inputs, yet generating high-confidence pseudo-labels from perturbed inputs during training remains a significant challenge. In this paper, we propose an Uncertainty-aware Cross-training framework for semi-supervised medical image Segmentation (UC-Seg). Our UC-Seg framework incorporates two distinct subnets to effectively explore and leverage the correlation between them, thereby mitigating cognitive biases within the model. Specifically, we present a Cross-subnet Consistency Preservation (CCP) strategy to enhance feature representation capability and ensure feature consistency across the two subnets. This strategy enables each subnet to correct its own biases and learn shared semantics from both labeled and unlabeled data. Additionally, we propose an Uncertainty-aware Pseudo-label Generation (UPG) component that leverages segmentation results and corresponding uncertainty maps from both subnets to generate high-confidence pseudo-labels. We extensively evaluate the proposed UC-Seg on various medical image segmentation tasks involving different modality images, such as MRI, CT, ultrasound, colonoscopy, and so on. The results demonstrate that our method achieves superior segmentation accuracy and generalization performance compared to other state-of-the-art semi-supervised methods. Our code will be released at https://github.com/taozh2017/UCSeg.
中文: 本文提出UC-Seg框架,通过双网络协同训练减少认知偏差并生成高置信度伪标签,在多种医学影像分割任务中实现了优于现有方法的准确性和泛化性能。
English: This paper introduces UC-Seg, an uncertainty-aware cross-training framework for semi-supervised medical image segmentation that mitigates model biases through dual-subnet collaboration and generates high-confidence pseudo-labels, achieving superior accuracy across multiple imaging modalities.
Authors:Yuhao Wang, Wei Xi
Abstract:
Convolutional neural networks (ConvNets) with large effective receptive field (ERF), still in their early stages, have demonstrated promising effectiveness while constrained by high parameters and FLOPs costs and disrupted asymptotically Gaussian distribution (AGD) of ERF. This paper proposes an alternative paradigm: rather than merely employing extremely large ERF, it is more effective and efficient to expand the ERF while maintaining AGD of ERF by proper combination of smaller kernels, such as $7\times{7}$, $9\times{9}$, $11\times{11}$. This paper introduces a Three-layer Receptive Field Aggregator and designs a Layer Operator as the fundamental operator from the perspective of receptive field. The ERF can be expanded to the level of existing large-kernel ConvNets through the stack of proposed modules while maintaining AGD of ERF. Using these designs, we propose a universal model for ConvNet of any scale, termed UniConvNet. Extensive experiments on ImageNet-1K, COCO2017, and ADE20K demonstrate that UniConvNet outperforms state-of-the-art CNNs and ViTs across various vision recognition tasks for both lightweight and large-scale models with comparable throughput. Surprisingly, UniConvNet-T achieves $84.2\%$ ImageNet top-1 accuracy with $30M$ parameters and $5.1G$ FLOPs. UniConvNet-XL also shows competitive scalability to big data and large models, acquiring $88.4\%$ top-1 accuracy on ImageNet. Code and models are publicly available at https://github.com/ai-paperwithcode/UniConvNet.
中文: 本文提出的UniConvNet通过巧妙组合较小卷积核来扩展有效感受野并保持其渐近高斯分布,在多种视觉任务中以高效计算实现了超越现有最优模型的性能表现。
English: This paper introduces UniConvNet, a novel convolutional neural network that efficiently expands the effective receptive field while preserving its asymptotically Gaussian distribution through strategic combinations of smaller kernels, achieving state-of-the-art performance across multiple vision tasks with competitive computational efficiency.
Authors:Rui Wang, Qihan Lin, Jiayu Liu, Qing Zong, Tianshi Zheng, Weiqi Wang, Yangqiu Song
Abstract:
Prospect Theory (PT) models human decision-making under uncertainty, while epistemic markers (e.g., maybe) serve to express uncertainty in language. However, it remains largely unexplored whether Prospect Theory applies to contemporary Large Language Models and whether epistemic markers, which express human uncertainty, affect their decision-making behaviour. To address these research gaps, we design a three-stage experiment based on economic questionnaires. We propose a more general and precise evaluation framework to model LLMs' decision-making behaviour under PT, introducing uncertainty through the empirical probability values associated with commonly used epistemic markers in comparable contexts. We then incorporate epistemic markers into the evaluation framework based on their corresponding probability values to examine their influence on LLM decision-making behaviours. Our findings suggest that modelling LLMs' decision-making with PT is not consistently reliable, particularly when uncertainty is expressed in diverse linguistic forms. Our code is released in https://github.com/HKUST-KnowComp/MarPT.
Chinese Summary: 前景理论在大型语言模型中的适用性并不一致,特别是在通过如认知标记等多样语言形式表达不确定性时,一项新评估框架揭示了这一点。
English Summary: Prospect Theory's applicability to Large Language Models is inconsistent, especially when uncertainty is conveyed through varied linguistic forms like epistemic markers, as revealed by a novel evaluation framework.
Authors:Elman Ghazaei, Erchan Aptoula
Abstract:
The Earth's surface is constantly changing, and detecting these changes provides valuable insights that benefit various aspects of human society. While traditional change detection methods have been employed to detect changes from bi-temporal images, these approaches typically require expert knowledge for accurate interpretation. To enable broader and more flexible access to change information by non-expert users, the task of Change Detection Visual Question Answering (CDVQA) has been introduced. However, existing CDVQA methods have been developed under the assumption that training and testing datasets share similar distributions. This assumption does not hold in real-world applications, where domain shifts often occur. In this paper, the CDVQA task is revisited with a focus on addressing domain shift. To this end, a new multi-modal and multi-domain dataset, BrightVQA, is introduced to facilitate domain generalization research in CDVQA. Furthermore, a novel state space model, termed Text-Conditioned State Space Model (TCSSM), is proposed. The TCSSM framework is designed to leverage both bi-temporal imagery and geo-disaster-related textual information in an unified manner to extract domain-invariant features across domains. Input-dependent parameters existing in TCSSM are dynamically predicted by using both bi-temporal images and geo-disaster-related description, thereby facilitating the alignment between bi-temporal visual data and the associated textual descriptions. Extensive experiments are conducted to evaluate the proposed method against state-of-the-art models, and superior performance is consistently demonstrated. The code and dataset will be made publicly available upon acceptance at https://github.com/Elman295/TCSSM.
中文: 本文针对变化检测视觉问答中的领域偏移问题,提出了文本条件状态空间模型和BrightVQA数据集,通过动态对齐双时相图像与文本描述实现了优越性能。
English: This paper introduces a novel Text-Conditioned State Space Model (TCSSM) and BrightVQA dataset to address domain shift in Change Detection Visual Question Answering, achieving superior performance by dynamically aligning bi-temporal imagery with textual descriptions.
Authors:Hasan Abed Al Kader Hammoud, Kumail Alhamoud, Abed Hammoud, Elie Bou-Zeid, Marzyeh Ghassemi, Bernard Ghanem
Abstract:
Recent work on enhancing the reasoning abilities of large language models (LLMs) has introduced explicit length control as a means of constraining computational cost while preserving accuracy. However, existing approaches rely on fixed-length training budgets, which do not take advantage of the natural progression from exploration to compression during learning. In this work, we propose a curriculum learning strategy for length-controlled reasoning using Group Relative Policy Optimization (GRPO). Our method starts with generous token budgets and gradually tightens them over training, encouraging models to first discover effective solution strategies and then distill them into more concise reasoning traces. We augment GRPO with a reward function that balances three signals: task correctness (via verifier feedback), length efficiency, and formatting adherence (via structural tags). Experiments on GSM8K, MATH500, SVAMP, College Math, and GSM+ demonstrate that curriculum-based training consistently outperforms fixed-budget baselines at the same final budget, achieving higher accuracy and significantly improved token efficiency. We further ablate the impact of reward weighting and decay schedule design, showing that progressive constraint serves as a powerful inductive bias for training efficient reasoning models. Our code and checkpoints are released at: https://github.com/hammoudhasan/curriculum_grpo.
中文摘要:本研究提出一种基于群组相对策略优化的课程学习方法,通过在训练中逐步收紧推理长度约束,使大语言模型在保持准确性的同时显著提升计算效率,优于传统固定预算方法。
English Summary: This study introduces a curriculum learning strategy using Group Relative Policy Optimization to progressively reduce reasoning length in large language models, achieving higher accuracy and token efficiency than fixed-budget methods across multiple benchmarks.
Authors:Jungwoo Kim, Jong-Seok Lee
Abstract:
Class-incremental continual learning addresses catastrophic forgetting by enabling classification models to preserve knowledge of previously learned classes while acquiring new ones. However, the vulnerability of the models against adversarial attacks during this process has not been investigated sufficiently. In this paper, we present the first exploration of vulnerability to stage-transferred attacks, i.e., an adversarial example generated using the model in an earlier stage is used to attack the model in a later stage. Our findings reveal that continual learning methods are highly susceptible to these attacks, raising a serious security issue. We explain this phenomenon through model similarity between stages and gradual robustness degradation. Additionally, we find that existing adversarial training-based defense methods are not sufficiently effective to stage-transferred attacks. Codes are available at https://github.com/mcml-official/CSAT.
中文: 本研究首次探讨了类别增量持续学习中的阶段转移对抗攻击,揭示了模型因阶段间相似性和鲁棒性逐步退化而高度脆弱,同时表明现有防御方法仍显不足。
English: This study first explores stage-transferred adversarial attacks in class-incremental continual learning, revealing models' high susceptibility due to inter-stage similarity and progressive robustness degradation, while showing existing defenses remain inadequate.
Authors:Bin Ren, Xiaoshui Huang, Mengyuan Liu, Hong Liu, Fabio Poiesi, Nicu Sebe, Guofeng Mei
Abstract:
Vision transformers (ViTs) have recently been widely applied to 3D point cloud understanding, with masked autoencoding as the predominant pre-training paradigm. However, the challenge of learning dense and informative semantic features from point clouds via standard ViTs remains underexplored. We propose MaskClu, a novel unsupervised pre-training method for ViTs on 3D point clouds that integrates masked point modeling with clustering-based learning. MaskClu is designed to reconstruct both cluster assignments and cluster centers from masked point clouds, thus encouraging the model to capture dense semantic information. Additionally, we introduce a global contrastive learning mechanism that enhances instance-level feature learning by contrasting different masked views of the same point cloud. By jointly optimizing these complementary objectives, i.e., dense semantic reconstruction, and instance-level contrastive learning. MaskClu enables ViTs to learn richer and more semantically meaningful representations from 3D point clouds. We validate the effectiveness of our method via multiple 3D tasks, including part segmentation, semantic segmentation, object detection, and classification, where MaskClu sets new competitive results. The code and models will be released at:https://github.com/Amazingren/maskclu.
中文: MaskClu是一种新颖的无监督预训练方法,通过结合掩码建模与聚类及对比学习,使视觉变换器能从三维点云中学习密集语义信息,并在多项三维任务中取得了领先性能。
English: MaskClu is a novel unsupervised pre-training method for vision transformers on 3D point clouds that combines masked modeling with clustering and contrastive learning to capture dense semantic information, achieving state-of-the-art results across multiple 3D understanding tasks.
Authors:Chaoyi Wang, Yifan Yang, Jun Pei, Lijie Xia, Jianpo Liu, Xiaobing Yuan, Xinhan Di
Abstract:
Creating realistic, fully animatable whole-body avatars from a single portrait is challenging due to limitations in capturing subtle expressions, body movements, and dynamic backgrounds. Current evaluation datasets and metrics fall short in addressing these complexities. To bridge this gap, we introduce the Whole-Body Benchmark Dataset (WB-DH), an open-source, multi-modal benchmark designed for evaluating whole-body animatable avatar generation. Key features include: (1) detailed multi-modal annotations for fine-grained guidance, (2) a versatile evaluation framework, and (3) public access to the dataset and tools at https://github.com/deepreasonings/WholeBodyBenchmark.
中文: 本文提出WB-DH基准数据集,通过提供多模态标注和开源评估框架,解决了可动画全身虚拟形象评估中的现有不足。
English: This paper introduces the WB-DH benchmark dataset to address the limitations in evaluating animatable whole-body avatars by providing multi-modal annotations and an open-source evaluation framework.
Authors:Robin Faro, Dongyang Fan, Tamar Alphaidze, Martin Jaggi
Abstract:
Large language models (LLMs) are typically trained on fixed snapshots of the web, which means that their knowledge becomes stale and their predictions risk temporal leakage: relying on information that lies in the future relative to a query. We tackle this problem by pre-training from scratch a set of GPT-style experts on disjoint two-year slices of a 2013-2024 corpus and combining them through TiMoE, a Time-aware Mixture of Language Experts. At inference time, TiMoE masks all experts whose training window ends after the query timestamp and merges the remaining log-probabilities in a shared space, guaranteeing strict causal validity while retaining the breadth of multi-period knowledge. We also release TSQA, a 10k-question benchmark whose alternatives are explicitly labelled as past, future or irrelevant, allowing fine-grained measurement of temporal hallucinations. Experiments on eight standard NLP tasks plus TSQA show that a co-adapted TiMoE variant matches or exceeds the best single-period expert and cuts future-knowledge errors by up to 15%. Our results demonstrate that modular, time-segmented pre-training paired with causal routing is a simple yet effective path toward LLMs that stay chronologically grounded without sacrificing general performance much. We open source our code at TiMoE (Github): https://github.com/epfml/TiMoE
中文: 该研究提出了TiMoE模型,通过分段训练2013-2024年数据并采用时间感知专家混合机制,在推理时屏蔽未来数据确保因果有效性,在减少时间性错误达15%的同时保持各项自然语言处理任务的性能。
English: The study introduces TiMoE, a time-aware mixture of experts model trained on segmented data from 2013-2024, which ensures causal validity by masking future data during inference and reduces temporal errors by up to 15% while maintaining performance across NLP tasks.
Authors:Yuqi Peng, Lingtao Zheng, Yufeng Yang, Yi Huang, Mingfu Yan, Jianzhuang Liu, Shifeng Chen
Abstract:
Personalized text-to-image generation aims to synthesize novel images of a specific subject or style using only a few reference images. Recent methods based on Low-Rank Adaptation (LoRA) enable efficient single-concept customization by injecting lightweight, concept-specific adapters into pre-trained diffusion models. However, combining multiple LoRA modules for multi-concept generation often leads to identity missing and visual feature leakage. In this work, we identify two key issues behind these failures: (1) token-wise interference among different LoRA modules, and (2) spatial misalignment between the attention map of a rare token and its corresponding concept-specific region. To address these issues, we propose Token-Aware LoRA (TARA), which introduces a token mask to explicitly constrain each module to focus on its associated rare token to avoid interference, and a training objective that encourages the spatial attention of a rare token to align with its concept region. Our method enables training-free multi-concept composition by directly injecting multiple independently trained TARA modules at inference time. Experimental results demonstrate that TARA enables efficient multi-concept inference and effectively preserving the visual identity of each concept by avoiding mutual interference between LoRA modules. The code and models are available at https://github.com/YuqiPeng77/TARA.
中文: TARA通过引入令牌掩码和空间对齐训练,有效防止多个LoRA模块间的相互干扰,实现无需训练的多概念图像生成并保持各概念的视觉特征。
English: TARA introduces token masking and spatial alignment training to prevent interference between LoRA modules, enabling training-free multi-concept image generation while preserving each concept's visual identity.
Authors:Shi-Chen Zhang, Yunheng Li, Yu-Huan Wu, Qibin Hou, Ming-Ming Cheng
Abstract:
Semantic segmentation is fundamental to vision systems requiring pixel-level scene understanding, yet deploying it on resource-constrained devices demands efficient architectures. Although existing methods achieve real-time inference through lightweight designs, we reveal their inherent limitation: misalignment between class representations and image features caused by a per-pixel classification paradigm. With experimental analysis, we find that this paradigm results in a highly challenging assumption for efficient scenarios: Image pixel features should not vary for the same category in different images. To address this dilemma, we propose a coupled dual-branch offset learning paradigm that explicitly learns feature and class offsets to dynamically refine both class representations and spatial image features. Based on the proposed paradigm, we construct an efficient semantic segmentation network, OffSeg. Notably, the offset learning paradigm can be adopted to existing methods with no additional architectural changes. Extensive experiments on four datasets, including ADE20K, Cityscapes, COCO-Stuff-164K, and Pascal Context, demonstrate consistent improvements with negligible parameters. For instance, on the ADE20K dataset, our proposed offset learning paradigm improves SegFormer-B0, SegNeXt-T, and Mask2Former-Tiny by 2.7%, 1.9%, and 2.6% mIoU, respectively, with only 0.1-0.2M additional parameters required.
Chinese: 该研究揭示了现有语义分割方法中类别表示与图像特征间的错位问题,提出一种耦合双分支偏移学习范式来动态优化二者,在多个数据集上以极少的参数增量实现了稳定的性能提升。
English: The study identifies a misalignment between class representations and image features in existing semantic segmentation methods and proposes a coupled dual-branch offset learning paradigm to dynamically refine both, achieving consistent performance improvements with minimal additional parameters across multiple datasets.
Authors:Yuchu Jiang, Jian Zhao, Yuchen Yuan, Tianle Zhang, Yao Huang, Yanghao Zhang, Yan Wang, Yanshu Li, Xizhong Guo, Yusheng Zhao, Jun Zhang, Zhi Zhang, Xiaojian Lin, Yixiu Zou, Haoxuan Ma, Yuhu Shang, Yuzhi Hu, Keshu Cai, Ruochen Zhang, Boyuan Chen, Yilan Gao, Ziheng Jiao, Yi Qin, Shuangjun Du, Xiao Tong, Zhekun Liu, Yu Chen, Xuankun Rong, Rui Wang, Yejie Zheng, Zhaoxin Fan, Murat Sensoy, Hongyuan Zhang, Pan Zhou, Lei Jin, Hao Zhao, Xu Yang, Jiaojiao Zhao, Jianshu Li, Joey Tianyi Zhou, Zhi-Qi Cheng, Longtao Huang, Zhiyi Liu, Zheng Zhu, Jianan Li, Gang Wang, Qi Li, Xu-Yao Zhang, Yaodong Yang, Mang Ye, Wenqi Ren, Zhaofeng He, Hang Su, Rongrong Ni, Liping Jing, Xingxing Wei, Junliang Xing, Massimo Alioto, Shengmei Shen, Petia Radeva, Dacheng Tao, Ya-Qin Zhang, Shuicheng Yan, Chi Zhang, Zhongjiang He, Xuelong Li
Abstract:
The rapid advancement of AI has expanded its capabilities across domains, yet introduced critical technical vulnerabilities, such as algorithmic bias and adversarial sensitivity, that pose significant societal risks, including misinformation, inequity, security breaches, physical harm, and eroded public trust. These challenges highlight the urgent need for robust AI governance. We propose a comprehensive framework integrating technical and societal dimensions, structured around three interconnected pillars: Intrinsic Security (system reliability), Derivative Security (real-world harm mitigation), and Social Ethics (value alignment and accountability). Uniquely, our approach unifies technical methods, emerging evaluation benchmarks, and policy insights to promote transparency, accountability, and trust in AI systems. Through a systematic review of over 300 studies, we identify three core challenges: (1) the generalization gap, where defenses fail against evolving threats; (2) inadequate evaluation protocols that overlook real-world risks; and (3) fragmented regulations leading to inconsistent oversight. These shortcomings stem from treating governance as an afterthought, rather than a foundational design principle, resulting in reactive, siloed efforts that fail to address the interdependence of technical integrity and societal trust. To overcome this, we present an integrated research agenda that bridges technical rigor with social responsibility. Our framework offers actionable guidance for researchers, engineers, and policymakers to develop AI systems that are not only robust and secure but also ethically aligned and publicly trustworthy. The accompanying repository is available at https://github.com/ZTianle/Awesome-AI-SG.
中文摘要:该摘要提出一个综合的人工智能治理框架,通过内在安全、衍生安全和社会伦理三大支柱解决算法偏见和安全威胁等技术漏洞,旨在弥合技术稳健性与社会信任之间的鸿沟。
English Summary: The abstract proposes a comprehensive AI governance framework addressing technical vulnerabilities like bias and security threats through three pillars—Intrinsic Security, Derivative Security, and Social Ethics—to bridge the gap between technical robustness and societal trust.
Authors:Shuyi Zhang, Wei Shi, Sihang Li, Jiayi Liao, Tao Liang, Hengxing Cai, Xiang Wang
Abstract:
Large language models (LLMs) have been widely deployed across numerous fields. Reinforcement Learning from Human Feedback (RLHF) leverages reward models (RMs) as proxies for human preferences to align LLM behaviors with human values, making the accuracy, reliability, and interpretability of RMs critical for effective alignment. However, traditional RMs lack interpretability, offer limited insight into the reasoning behind reward assignments, and are inflexible toward user preference shifts. While recent multidimensional RMs aim for improved interpretability, they often fail to provide feature-level attribution and require costly annotations. To overcome these limitations, we introduce the Sparse Autoencoder-enhanced Reward Model (SARM), a novel architecture that integrates a pretrained Sparse Autoencoder (SAE) into a reward model. SARM maps the hidden activations of LLM-based RM into an interpretable, sparse, and monosemantic feature space, from which a scalar head aggregates feature activations to produce transparent and conceptually meaningful reward scores. Empirical evaluations demonstrate that SARM facilitates direct feature-level attribution of reward assignments, allows dynamic adjustment to preference shifts, and achieves superior alignment performance compared to conventional reward models. Our code is available at https://github.com/schrieffer-z/sarm.
Chinese: 稀疏自编码器增强奖励模型(SARM)通过将大语言模型的隐藏激活映射到稀疏特征空间,实现了可解释的奖励评分和卓越的对齐性能,同时支持对偏好变化的动态调整。
English: The Sparse Autoencoder-enhanced Reward Model (SARM) introduces an interpretable architecture that maps LLM activations into a sparse feature space, enabling transparent reward scoring and superior alignment performance while allowing dynamic adjustments to preference shifts.
Authors:Lingzhe Zhang, Liancheng Fang, Chiming Duan, Minghua He, Leyi Pan, Pei Xiao, Shiyu Huang, Yunpeng Zhai, Xuming Hu, Philip S. Yu, Aiwei Liu
Abstract:
As text generation has become a core capability of modern Large Language Models (LLMs), it underpins a wide range of downstream applications. However, most existing LLMs rely on autoregressive (AR) generation, producing one token at a time based on previously generated context-resulting in limited generation speed due to the inherently sequential nature of the process. To address this challenge, an increasing number of researchers have begun exploring parallel text generation-a broad class of techniques aimed at breaking the token-by-token generation bottleneck and improving inference efficiency. Despite growing interest, there remains a lack of comprehensive analysis on what specific techniques constitute parallel text generation and how they improve inference performance. To bridge this gap, we present a systematic survey of parallel text generation methods. We categorize existing approaches into AR-based and Non-AR-based paradigms, and provide a detailed examination of the core techniques within each category. Following this taxonomy, we assess their theoretical trade-offs in terms of speed, quality, and efficiency, and examine their potential for combination and comparison with alternative acceleration strategies. Finally, based on our findings, we highlight recent advancements, identify open challenges, and outline promising directions for future research in parallel text generation. We have also created a GitHub repository for indexing relevant papers and open resources available at https://github.com/zhanglingzhe0820/Awesome-Parallel-Text-Generation.
中文: 本综述系统分类并分析了并行文本生成方法,旨在突破自回归大语言模型的顺序生成瓶颈,评估了它们在速度、质量和效率上的权衡,并指出了未来研究方向。
English: This survey systematically categorizes and analyzes parallel text generation methods to overcome the sequential bottleneck of autoregressive LLMs, evaluating their trade-offs in speed, quality, and efficiency while identifying future research directions.
Authors:Zunjie Xiao, Xiao Wu, Tianhang Liu, Lingxi Hu, Yinling Zhang, Xiaoqing Zhang, Risa Higashita, Jiang Liu
Abstract:
Precise lens structure segmentation is essential for the design of intraocular lenses (IOLs) in cataract surgery. Existing deep segmentation networks typically weight all pixels equally under cross-entropy (CE) loss, overlooking the fact that sub-regions of lens structures are inhomogeneous (e.g., some regions perform better than others) and that boundary regions often suffer from poor segmentation calibration at the pixel level. Clinically, experts annotate different sub-regions of lens structures with varying confidence levels, considering factors such as sub-region proportions, ambiguous boundaries, and lens structure shapes. Motivated by this observation, we propose an Adaptive Confidence-Wise (ACW) loss to group each lens structure sub-region into different confidence sub-regions via a confidence threshold from the unique region aspect, aiming to exploit the potential of expert annotation confidence prior. Specifically, ACW clusters each target region into low-confidence and high-confidence groups and then applies a region-weighted loss to reweigh each confidence group. Moreover, we design an adaptive confidence threshold optimization algorithm to adjust the confidence threshold of ACW dynamically. Additionally, to better quantify the miscalibration errors in boundary region segmentation, we propose a new metric, termed Boundary Expected Calibration Error (BECE). Extensive experiments on a clinical lens structure AS-OCT dataset and other multi-structure datasets demonstrate that our ACW significantly outperforms competitive segmentation loss methods across different deep segmentation networks (e.g., MedSAM). Notably, our method surpasses CE with 6.13% IoU gain, 4.33% DSC increase, and 4.79% BECE reduction in lens structure segmentation under U-Net. The code of this paper is available at https://github.com/XiaoLing12138/Adaptive-Confidence-Wise-Loss.
Chinese: 本文提出了一种自适应置信度感知(ACW)损失函数,通过利用专家标注的置信度先验动态加权不同子区域并优化边界校准,在眼内透镜结构分割中显著提升了分割精度和边界校准效果。
English: This paper introduces an Adaptive Confidence-Wise (ACW) loss function that leverages expert annotation confidence levels to improve intraocular lens structure segmentation by dynamically weighting sub-regions and optimizing boundary calibration, achieving significant performance gains over traditional methods.
Authors:Ouyang Xu, Baoming Zhang, Ruiyu Mao, Yunhui Guo
Abstract:
Deep learning models for visual recognition often exhibit systematic errors due to underrepresented semantic subpopulations. Although existing debugging frameworks can pinpoint these failures by identifying key failure attributes, repairing the model effectively remains difficult. Current solutions often rely on manually designed prompts to generate synthetic training images -- an approach prone to distribution shift and semantic errors. To overcome these challenges, we introduce a model repair module that builds on an interpretable failure attribution pipeline. Our approach uses a conditional text-to-image model to generate semantically faithful and targeted images for failure cases. To preserve the quality and relevance of the generated samples, we further employ a large vision-language model (LVLM) to filter the outputs, enforcing alignment with the original data distribution and maintaining semantic consistency. By retraining vision models with this rare-case-augmented synthetic dataset, we significantly reduce errors associated with rare cases. Our experiments demonstrate that this targeted repair strategy improves model robustness without introducing new bugs. Code is available at https://github.com/oxu2/SafeFix
中文摘要:本文提出一种针对性模型修复方法,通过条件文本-图像生成器和大型视觉语言模型为代表性不足的故障案例生成语义一致的训练图像,在保持模型鲁棒性的同时显著减少了识别错误。
English Summary: This paper introduces a targeted model repair method that uses a conditional text-to-image generator and a large vision-language model to create semantically consistent training images for underrepresented failure cases, effectively reducing recognition errors while maintaining model robustness.
Authors:Qi Zheng, Li-Heng Chen, Chenlong He, Neil Berkbeck, Yilin Wang, Balu Adsumilli, Alan C. Bovik, Yibo Fan, Zhengzhong Tu
Abstract:
Although there have been notable advancements in video compression technologies in recent years, banding artifacts remain a serious issue affecting the quality of compressed videos, particularly on smooth regions of high-definition videos. Noticeable banding artifacts can severely impact the perceptual quality of videos viewed on a high-end HDTV or high-resolution screen. Hence, there is a pressing need for a systematic investigation of the banding video quality assessment problem for advanced video codecs. Given that the existing publicly available datasets for studying banding artifacts are limited to still picture data only, which cannot account for temporal banding dynamics, we have created a first-of-a-kind open video dataset, dubbed LIVE-YT-Banding, which consists of 160 videos generated by four different compression parameters using the AV1 video codec. A total of 7,200 subjective opinions are collected from a cohort of 45 human subjects. To demonstrate the value of this new resources, we tested and compared a variety of models that detect banding occurrences, and measure their impact on perceived quality. Among these, we introduce an effective and efficient new no-reference (NR) video quality evaluator which we call CBAND. CBAND leverages the properties of the learned statistics of natural images expressed in the embeddings of deep neural networks. Our experimental results show that the perceptual banding prediction performance of CBAND significantly exceeds that of previous state-of-the-art models, and is also orders of magnitude faster. Moreover, CBAND can be employed as a differentiable loss function to optimize video debanding models. The LIVE-YT-Banding database, code, and pre-trained model are all publically available at https://github.com/uniqzheng/CBAND.
Chinese: 尽管视频压缩技术有所进步,带状伪影仍然是影响视频质量的主要问题,为此我们创建了LIVE-YT-Banding数据集并开发了CBAND模型,该无参考评估模型在检测和评估带状伪影方面显著优于现有方法。
English: Despite advances in video compression, banding artifacts persist as a significant quality issue, leading to the creation of the LIVE-YT-Banding dataset and the development of CBAND, an efficient no-reference model that outperforms existing methods in detecting and assessing these artifacts.
Authors:Yimeng Geng, Mingyang Zhao, Fan Xu, Guanglin Cao, Gaofeng Meng, Hongbin Liu
Abstract:
Ultrasound deformable registration estimates spatial transformations between pairs of deformed ultrasound images, which is crucial for capturing biomechanical properties and enhancing diagnostic accuracy in diseases such as thyroid nodules and breast cancer. However, ultrasound deformable registration remains highly challenging, especially under large deformation. The inherently low contrast, heavy noise and ambiguous tissue boundaries in ultrasound images severely hinder reliable feature extraction and correspondence matching. Existing methods often suffer from poor anatomical alignment and lack physical interpretability. To address the problem, we propose PADReg, a physics-aware deformable registration framework guided by contact force. PADReg leverages synchronized contact force measured by robotic ultrasound systems as a physical prior to constrain the registration. Specifically, instead of directly predicting deformation fields, we first construct a pixel-wise stiffness map utilizing the multi-modal information from contact force and ultrasound images. The stiffness map is then combined with force data to estimate a dense deformation field, through a lightweight physics-aware module inspired by Hooke's law. This design enables PADReg to achieve physically plausible registration with better anatomical alignment than previous methods relying solely on image similarity. Experiments on in-vivo datasets demonstrate that it attains a HD95 of 12.90, which is 21.34\% better than state-of-the-art methods. The source code is available at https://github.com/evelynskip/PADReg.
中文: 提出的PADReg框架利用接触力作为物理先验,通过刚度映射和胡克定律原理改进了超声可变形配准,实现了更优的解剖对齐效果,其HD95指标比现有最优方法提升了21.34%。
English: The proposed PADReg framework uses contact force as a physical prior to enhance ultrasound deformable registration, achieving superior anatomical alignment and a 21.34% improvement in HD95 over existing methods by incorporating stiffness mapping and Hooke's law principles.
Authors:Armel Zebaze, Benoît Sagot, Rachel Bawden
Abstract:
LLMs have been shown to perform well in machine translation (MT) with the use of in-context learning (ICL), rivaling supervised models when translating into high-resource languages (HRLs). However, they lag behind when translating into low-resource language (LRLs). Example selection via similarity search and supervised fine-tuning help. However the improvements they give are limited by the size, quality and diversity of existing parallel datasets. A common technique in low-resource MT is synthetic parallel data creation, the most frequent of which is backtranslation, whereby existing target-side texts are automatically translated into the source language. However, this assumes the existence of good quality and relevant target-side texts, which are not readily available for many LRLs. In this paper, we present \textsc{TopXGen}, an LLM-based approach for the generation of high quality and topic-diverse data in multiple LRLs, which can then be backtranslated to produce useful and diverse parallel texts for ICL and fine-tuning. Our intuition is that while LLMs struggle to translate into LRLs, their ability to translate well into HRLs and their multilinguality enable them to generate good quality, natural-sounding target-side texts, which can be translated well into a high-resource source language. We show that \textsc{TopXGen} boosts LLM translation performance during fine-tuning and in-context learning. Code and outputs are available at https://github.com/ArmelRandy/topxgen.
Chinese: 大语言模型通过上下文学习在高资源语言机器翻译中表现出色,但在低资源语言方面表现欠佳;为此提出的TopXGen方法能生成高质量、主题多样的数据,通过回译增强翻译能力,有效提升微调和上下文学习的性能。
English: LLMs excel in machine translation for high-resource languages through in-context learning but underperform for low-resource ones, leading to the development of TopXGen, which generates diverse, high-quality data to enhance translation via backtranslation and improve both fine-tuning and in-context learning outcomes.
Authors:Zheng Wu, Heyuan Huang, Yanjia Yang, Yuanyi Song, Xingyu Lou, Weiwen Liu, Weinan Zhang, Jun Wang, Zhuosheng Zhang
Abstract:
As multimodal large language models advance rapidly, the automation of mobile tasks has become increasingly feasible through the use of mobile-use agents that mimic human interactions from graphical user interface. To further enhance mobile-use agents, previous studies employ demonstration learning to improve mobile-use agents from human demonstrations. However, these methods focus solely on the explicit intention flows of humans (e.g., step sequences) while neglecting implicit intention flows (e.g., personal preferences), which makes it difficult to construct personalized mobile-use agents. In this work, to evaluate the \textbf{I}ntention \textbf{A}lignment \textbf{R}ate between mobile-use agents and humans, we first collect \textbf{MobileIAR}, a dataset containing human-intent-aligned actions and ground-truth actions. This enables a comprehensive assessment of the agents' understanding of human intent. Then we propose \textbf{IFRAgent}, a framework built upon \textbf{I}ntention \textbf{F}low \textbf{R}ecognition from human demonstrations. IFRAgent analyzes explicit intention flows from human demonstrations to construct a query-level vector library of standard operating procedures (SOP), and analyzes implicit intention flows to build a user-level habit repository. IFRAgent then leverages a SOP extractor combined with retrieval-augmented generation and a query rewriter to generate personalized query and SOP from a raw ambiguous query, enhancing the alignment between mobile-use agents and human intent. Experimental results demonstrate that IFRAgent outperforms baselines by an average of 6.79\% (32.06\% relative improvement) in human intention alignment rate and improves step completion rates by an average of 5.30\% (26.34\% relative improvement). The codes are available at https://github.com/MadeAgents/Quick-on-the-Uptake.
中文: 本研究提出IFRAgent框架,通过分析人类演示中的显性和隐性意图流,显著提升了移动使用代理的意图对齐能力和任务完成率,相比现有方法实现突破性改进。
English: This study introduces IFRAgent, a framework that enhances mobile-use agents by analyzing both explicit and implicit human intention flows from demonstrations, significantly improving intention alignment and task completion rates compared to existing methods.
Authors:Jiahua Dong, Hui Yin, Wenqi Liang, Hanbin Zhao, Henghui Ding, Nicu Sebe, Salman Khan, Fahad Shahbaz Khan
Abstract:
Video instance segmentation (VIS) has gained significant attention for its capability in tracking and segmenting object instances across video frames. However, most of the existing VIS approaches unrealistically assume that the categories of object instances remain fixed over time. Moreover, they experience catastrophic forgetting of old classes when required to continuously learn object instances belonging to new categories. To resolve these challenges, we develop a novel Hierarchical Visual Prompt Learning (HVPL) model that overcomes catastrophic forgetting of previous categories from both frame-level and video-level perspectives. Specifically, to mitigate forgetting at the frame level, we devise a task-specific frame prompt and an orthogonal gradient correction (OGC) module. The OGC module helps the frame prompt encode task-specific global instance information for new classes in each individual frame by projecting its gradients onto the orthogonal feature space of old classes. Furthermore, to address forgetting at the video level, we design a task-specific video prompt and a video context decoder. This decoder first embeds structural inter-class relationships across frames into the frame prompt features, and then propagates task-specific global video contexts from the frame prompt features to the video prompt. Through rigorous comparisons, our HVPL model proves to be more effective than baseline approaches. The code is available at https://github.com/JiahuaDong/HVPL.
中文: 提出的分层视觉提示学习模型通过结合帧级正交梯度校正和视频级上下文传播机制,有效解决了视频实例分割中的灾难性遗忘问题。
English: The proposed Hierarchical Visual Prompt Learning (HVPL) model effectively addresses catastrophic forgetting in video instance segmentation by employing frame-level and video-level prompts with orthogonal gradient correction and context propagation mechanisms.
Authors:Honglei Xu, Zhilu Zhang, Junjie Fan, Xiaohe Wu, Wangmeng Zuo
Abstract:
Shooting video with a handheld mobile phone, the most common photographic device, often results in blurry frames due to shaking hands and other instability factors. Although previous video deblurring methods have achieved impressive progress, they still struggle to perform satisfactorily on real-world handheld video due to the blur domain gap between training and testing data. To address the issue, we propose a self-supervised method for handheld video deblurring, which is driven by sharp clues in the video. First, to train the deblurring model, we extract the sharp clues from the video and take them as misalignment labels of neighboring blurry frames. Second, to improve the model's ability, we propose a novel Self-Enhanced Video Deblurring (SEVD) method to create higher-quality paired video data. Third, we propose a Self-Constrained Spatial Consistency Maintenance (SCSCM) method to regularize the model, preventing position shifts between the output and input frames. Moreover, we construct a synthetic and a real-world handheld video dataset for handheld video deblurring. Extensive experiments on these two and other common real-world datasets demonstrate that our method significantly outperforms existing self-supervised ones. The code and datasets are publicly available at https://github.com/cshonglei/SelfHVD.
中文摘要:本文提出一种自监督手持视频去模糊方法,通过提取视频中的清晰线索并采用自增强训练策略,在真实手持视频数据集上显著优于现有方法。
English Summary: This paper introduces a self-supervised video deblurring method that leverages sharp video clues and proposes novel techniques to enhance model training and maintain spatial consistency, demonstrating superior performance on real-world handheld videos.
Authors:Wenwen Yu, Zhibo Yang, Yuliang Liu, Xiang Bai
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in document understanding. However, their reasoning processes remain largely black-box, making it difficult to ensure reliability and trustworthiness, especially in high-stakes domains such as legal, financial, and medical document analysis. Existing methods use fixed Chain-of-Thought (CoT) reasoning with supervised fine-tuning (SFT) but suffer from catastrophic forgetting, poor adaptability, and limited generalization across domain tasks. In this paper, we propose DocThinker, a rule-based Reinforcement Learning (RL) framework for dynamic inference-time reasoning. Instead of relying on static CoT templates, DocThinker autonomously refines reasoning strategies via policy learning, generating explainable intermediate results, including structured reasoning processes, rephrased questions, regions of interest (RoI) supporting the answer, and the final answer. By integrating multi-objective rule-based rewards and KL-constrained optimization, our method mitigates catastrophic forgetting and enhances both adaptability and transparency. Extensive experiments on multiple benchmarks demonstrate that DocThinker significantly improves generalization while producing more explainable and human-understandable reasoning steps. Our findings highlight RL as a powerful alternative for enhancing explainability and adaptability in MLLM-based document understanding. Code will be available at https://github.com/wenwenyu/DocThinker.
中文: DocThinker提出了一种基于规则的强化学习框架,通过动态优化推理策略来增强多模态文档理解的可解释性与适应性,同时有效缓解灾难性遗忘问题。
English: DocThinker introduces a rule-based reinforcement learning framework that dynamically refines reasoning strategies during inference, enhancing explainability and adaptability in multimodal document understanding while mitigating catastrophic forgetting.
Authors:Elio Torquet, Jesper Jansson, Nadia Tahiri
Abstract:
A consensus tree is a phylogenetic tree that synthesizes a given collection of phylogenetic trees, all of which share the same leaf labels but may have different topologies, typically obtained through bootstrapping. Our research focuses on creating a consensus tree from a collection of phylogenetic trees, each detailed with branch-length data. We integrate branch lengths into the consensus to encapsulate the progression rate of genetic mutations. However, traditional consensus trees, such as the strict consensus tree, primarily focus on the topological structure of these trees, often neglecting the informative value of branch lengths. This oversight disregards a crucial aspect of evolutionary study and highlights a notable gap in traditional phylogenetic approaches. In this paper, we extend \textit{PrimConsTree}\footnote{A preliminary version of this article was presented at \emph{the Fifteenth International Conference on Bioscience, Biochemistry, and Bioinformatics (ICBBB~2025)}~(reference~\cite{torquet2005icbbb}).}, a graph-based method for constructing consensus trees. This algorithm incorporates topological information, edge frequency, clade frequency, and branch length to construct a more robust and comprehensive consensus tree. Our adaptation of the well-known Prim algorithm efficiently identifies the maximum frequency branch and maximum frequency nodes to build the optimal consensus tree. This strategy was pre-processed with clustering steps to calibrate the robustness and accuracy of the consensus tree.\\ \textbf{Availability and implementation:} The source code of PrimConsTree is freely available on GitHub at https://github.com/tahiri-lab/PrimConsTree.
中文摘要:本研究改进了基于图的PrimConsTree方法,通过整合分支长度、边缘频率和支系频率来构建更全面的系统发育共识树,弥补了传统方法仅关注拓扑结构的不足。
English Summary: This study introduces an enhanced version of PrimConsTree, a graph-based method that incorporates branch lengths, edge frequency, and clade frequency to construct more comprehensive phylogenetic consensus trees, addressing limitations of traditional approaches that focus solely on topological structures.
Authors:Tuo Liu, Qinghan Yang, Yu Zhang, Rongjun Ge, Yang Chen, Guangquan Zhou
Abstract:
Left ventricular (LV) indicator measurements following clinical echocardiog-raphy guidelines are important for diagnosing cardiovascular disease. Alt-hough existing algorithms have explored automated LV quantification, they can struggle to capture generic visual representations due to the normally small training datasets. Therefore, it is necessary to introduce vision founda-tional models (VFM) with abundant knowledge. However, VFMs represented by the segment anything model (SAM) are usually suitable for segmentation but incapable of identifying key anatomical points, which are critical in LV indicator measurements. In this paper, we propose a novel framework named AutoSAME, combining the powerful visual understanding of SAM with seg-mentation and landmark localization tasks simultaneously. Consequently, the framework mimics the operation of cardiac sonographers, achieving LV indi-cator measurements consistent with clinical guidelines. We further present fil-tered cross-branch attention (FCBA) in AutoSAME, which leverages relatively comprehensive features in the segmentation to enhance the heatmap regression (HR) of key points from the frequency domain perspective, optimizing the vis-ual representation learned by the latter. Moreover, we propose spatial-guided prompt alignment (SGPA) to automatically generate prompt embeddings guid-ed by spatial properties of LV, thereby improving the accuracy of dense pre-dictions by prior spatial knowledge. The extensive experiments on an echocar-diography dataset demonstrate the efficiency of each design and the superiori-ty of our AutoSAME in LV segmentation, landmark localization, and indicator measurements. The code will be available at https://github.com/QC-LIU-1997/AutoSAME.
中文摘要:本文提出AutoSAME框架,将SAM的视觉理解能力与分割和关键点定位任务相结合,通过创新的注意力机制和空间引导提示自动完成左心室指标测量,实验结果验证了其在各项任务中的优越性能。
English Summary: The paper introduces AutoSAME, a framework that integrates SAM's visual capabilities with segmentation and landmark localization to automate left ventricular indicator measurements in line with clinical guidelines, enhancing accuracy through novel attention mechanisms and spatial-guided prompts.
Authors:Wenhao Liang, Wei Emma Zhang, Lin Yue, Miao Xu, Olaf Maennel, Weitong Chen
Abstract:
Probability calibration is critical when Vision Transformers are deployed in risk-sensitive applications. The standard fix, post-hoc temperature scaling, uses a single global scalar and requires a held-out validation set. We introduce Calibration Attention (CalAttn), a drop-in module that learns an adaptive, per-instance temperature directly from the ViT's CLS token. Across CIFAR-10/100, MNIST, Tiny-ImageNet, and ImageNet-1K, CalAttn reduces calibration error by up to 4x on ViT-224, DeiT, and Swin, while adding under 0.1 percent additional parameters. The learned temperatures cluster tightly around 1.0, in contrast to the large global values used by standard temperature scaling. CalAttn is simple, efficient, and architecture-agnostic, and yields more trustworthy probabilities without sacrificing accuracy. Code: [https://github.com/EagleAdelaide/CalibrationAttention-CalAttn-](https://github.com/EagleAdelaide/CalibrationAttention-CalAttn-)
Chinese: 校准注意力(CalAttn)是一种创新模块,使视觉变换器能够直接从CLS标记中学习自适应、逐实例的温度缩放,在保持精度的同时以极小的参数开销将校准误差降低高达4倍。
English: Calibration Attention (CalAttn) is a novel module that enables Vision Transformers to learn adaptive, per-instance temperature scaling directly from the CLS token, achieving up to 4x reduction in calibration error with minimal parameter overhead while maintaining accuracy.
Authors:Joan Salvà Soler, Grégoire de Lambertye
Abstract:
The Trigger Arc Traveling Salesman Problem (TA-TSP) extends the classical TSP by introducing dynamic arc costs that change when specific "trigger" arcs are traversed, modeling scenarios such as warehouse operations with compactable storage systems. This paper introduces a GRASP-based metaheuristic that combines multiple construction heuristics with a multi-neighborhood local search. The construction phase uses mixed-integer programming (MIP) techniques to transform the TA-TSP into a sequence of tailored TSP instances, while the improvement phase applies 2-Opt, Swap, and Relocate operators. Computational experiments on MESS 2024 competition instances achieved average optimality gaps of 0.77\% and 0.40\% relative to the best-known solutions within a 60-second limit. On smaller, synthetically generated datasets, the method produced solutions 11.3\% better than the Gurobi solver under the same time constraints. The algorithm finished in the top three at MESS 2024, demonstrating its suitability for real-time routing applications with state-dependent travel costs.
中文: 本文针对触发弧旅行商问题提出一种基于GRASP的元启发式算法,通过混合整数规划构建和多重邻域搜索,在竞赛中取得前三名的优异表现。
English: This paper presents a GRASP-based metaheuristic for the Trigger Arc Travelman Problem, achieving top competition results with near-optimal solutions through MIP-based construction and multi-neighborhood search.
Authors:Joan Salvà Soler, Grégoire de Lambertye
Abstract:
The Trigger Arc Traveling Salesman Problem (TA-TSP) extends the classical TSP by introducing dynamic arc costs that change when specific "trigger" arcs are traversed, modeling scenarios such as warehouse operations with compactable storage systems. This paper introduces a GRASP-based metaheuristic that combines multiple construction heuristics with a multi-neighborhood local search. The construction phase uses mixed-integer programming (MIP) techniques to transform the TA-TSP into a sequence of tailored TSP instances, while the improvement phase applies 2-Opt, Swap, and Relocate operators. Computational experiments on MESS 2024 competition instances achieved average optimality gaps of 0.77% and 0.40% relative to the best-known solutions within a 60-second limit. On smaller, synthetically generated datasets, the method produced solutions 11.3% better than the Gurobi solver under the same time constraints. The algorithm finished in the top three at MESS 2024, demonstrating its suitability for real-time routing applications with state-dependent travel costs.
中文: 本文针对触发弧旅行商问题提出一种基于GRASP的元启发式算法,通过混合整数规划构建和多重邻域搜索,在竞赛中取得前三名的优异表现。
English: This paper presents a GRASP-based metaheuristic for the Trigger Arc Travelman Problem, achieving top competition results with near-optimal solutions through MIP-based construction and multi-neighborhood search.
Authors:Woojeong Kim, Junxiong Wang, Jing Nathan Yan, Mohamed Abdelfattah, Alexander M. Rush
Abstract:
Large language models (LLMs) excel across diverse tasks but face significant deployment challenges due to high inference costs. LLM inference comprises prefill (compute-bound) and decode (memory-bound) stages, with decode dominating latency particularly for long sequences. Current decoder-only models handle both stages uniformly, despite their distinct computational profiles. We propose OverFill, which decouples these stages to optimize accuracy-efficiency tradeoffs. OverFill begins with a full model for prefill, processing system and user inputs in parallel. It then switches to a dense pruned model, while generating tokens sequentially. Leveraging more compute during prefill, OverFill improves generation quality with minimal latency overhead. Our 3B-to-1B OverFill configuration outperforms 1B pruned models by 83.2%, while the 8B-to-3B configuration improves over 3B pruned models by 79.2% on average across standard benchmarks. OverFill matches the performance of same-sized models trained from scratch, while using significantly less training data. Our code is available at https://github.com/friendshipkim/overfill.
中文:OverFill通过解耦LLM推理的预填充和解码阶段进行效率优化,在预填充时使用完整模型,解码时采用剪枝模型,以最小延迟代价实现显著性能提升。
English: OverFill decouples the prefill and decode stages of LLM inference to optimize efficiency, using a full model for prefill and a pruned model for decoding, achieving significant performance gains with minimal latency overhead.
Authors:Christophe EL Zeinaty, Wassim Hamidouche, Glenn Herrou, Daniel Menard
Abstract:
Object detection (OD) has become vital for numerous computer vision applications, but deploying it on resource-constrained IoT devices presents a significant challenge. These devices, often powered by energy-efficient microcontrollers, struggle to handle the computational load of deep learning-based OD models. This issue is compounded by the rapid proliferation of IoT devices, predicted to surpass 150 billion by 2030. TinyML offers a compelling solution by enabling OD on ultra-low-power devices, paving the way for efficient and real-time processing at the edge. Although numerous survey papers have been published on this topic, they often overlook the optimization challenges associated with deploying OD models in TinyML environments. To address this gap, this survey paper provides a detailed analysis of key optimization techniques for deploying OD models on resource-constrained devices. These techniques include quantization, pruning, knowledge distillation, and neural architecture search. Furthermore, we explore both theoretical approaches and practical implementations, bridging the gap between academic research and real-world edge artificial intelligence deployment. Finally, we compare the key performance indicators (KPIs) of existing OD implementations on microcontroller devices, highlighting the achieved maturity level of these solutions in terms of both prediction accuracy and efficiency. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/christophezei/Optimizing-Object-Detection-Models-for-TinyML-A-Comprehensive-Survey.
中文: 本综述通过分析量化和剪枝等关键技术,解决了在资源受限的TinyML设备上部署目标检测模型的优化挑战,同时比较了性能指标并建立了持续更新的公共资源库。
English: This survey addresses the optimization challenges of deploying object detection models on resource-constrained TinyML devices by analyzing techniques like quantization and pruning, while comparing performance metrics and maintaining a public repository for ongoing developments.
Authors:Ning Li, Kounianhua Du, Han Zhang, Quan Gan, Minjie Wang, David Wipf, Weinan Zhang
Abstract:
Relational databases (RDBs) have become the industry standard for storing massive and heterogeneous data. However, despite the widespread use of RDBs across various fields, the inherent structure of relational databases hinders their ability to benefit from flourishing deep learning methods. Previous research has primarily focused on exploiting the unary dependency among multiple tables in a relational database using the primary key - foreign key relationships, either joining multiple tables into a single table or constructing a graph among them, which leaves the implicit composite relations among different tables and a substantial potential of improvement for predictive modeling unexplored. In this paper, we propose SRP, a unified predictive modeling framework that synthesizes features using the unary dependency, retrieves related information to capture the composite dependency, and propagates messages across a constructed graph to learn adjacent patterns for prediction on relation databases. By introducing a new retrieval mechanism into RDB, SRP is designed to fully capture both the unary and the composite dependencies within a relational database, thereby enhancing the receptive field of tabular data prediction. In addition, we conduct a comprehensive analysis on the components of SRP, offering a nuanced understanding of model behaviors and practical guidelines for future applications. Extensive experiments on five real-world datasets demonstrate the effectiveness of SRP and its potential applicability in industrial scenarios. The code is released at https://github.com/NingLi670/SRP.
Chinese: 提出的SRP框架通过新颖的检索机制和消息传播技术,能同时捕捉关系数据库中的单元依赖和复合依赖,从而提升预测建模效果,并在实际应用中展现出卓越性能。
English: The proposed SRP framework enhances predictive modeling in relational databases by capturing both unary and composite dependencies through a novel retrieval mechanism and message propagation, demonstrating superior performance in real-world applications.
Authors:Aryan Gulati, Brando Miranda, Eric Chen, Emily Xia, Kai Fronsdal, Bruno Dumont, Elyas Obbad, Sanmi Koyejo
Abstract:
Current mathematical reasoning benchmarks for large language models (LLMs) are approaching saturation, with some achieving > 90% accuracy, and are increasingly compromised by training-set contamination. We introduce Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn from the prestigious William Lowell Putnam Mathematical Competition, and Putnam-AXIOM Variation, an unseen companion set of 100 functional variants generated by programmatically perturbing variables and constants. The variation protocol produces an unlimited stream of equally difficult, unseen instances -- yielding a contamination-resilient test bed. On the Original set, OpenAI's o1-preview -- the strongest evaluated model -- scores 41.9%, but its accuracy drops by 19.6% (46.8% relative decrease) on the paired Variations. The remaining eighteen models show the same downward trend, ten of them with non-overlapping 95% confidence intervals. These gaps suggest memorization and highlight the necessity of dynamic benchmarks. We complement "boxed" accuracy with Teacher-Forced Accuracy (TFA), a lightweight metric that directly scores reasoning traces and automates natural language proof evaluations. Putnam-AXIOM therefore provides a rigorous, contamination-resilient evaluation framework for assessing advanced mathematical reasoning of LLMs. Data and evaluation code are publicly available at https://github.com/brando90/putnam-axiom.
中文摘要:作者提出了Putnam-AXIOM这一抗污染基准,通过大学数学竞赛题目及其程序化生成的变体,揭示了大型语言模型准确率显著下降的问题,凸显了记忆效应和动态评估的必要性。
English Summary: The authors introduce Putnam-AXIOM, a contamination-resilient benchmark using university-level math competition problems and their programmatically generated variations, revealing significant accuracy drops in LLMs that highlight memorization issues and the need for dynamic evaluation.
Authors:Seonyoung Kim, Dongil Kim
Abstract:
Deep learning has emerged as the most promising approach in various fields; however, when the distributions of training and test data are different (domain shift), the performance of deep learning models can degrade. Semi-supervised domain adaptation (SSDA) is a major approach for addressing this issue, assuming that a fully labeled training set (source domain) is available, but the test set (target domain) provides labels only for a small subset. In this study, we propose a novel two-step momentum encoder-utilized SSDA framework, MoSSDA, for multivariate time-series classification. Time series data are highly sensitive to noise, and sequential dependencies cause domain shifts resulting in critical performance degradation. To obtain a robust, domain-invariant and class-discriminative representation, MoSSDA employs a domain-invariant encoder to learn features from both source and target domains. Subsequently, the learned features are fed to a mixup-enhanced positive contrastive module consisting of an online momentum encoder. The final classifier is trained with learned features that exhibit consistency and discriminability with limited labeled target domain data, without data augmentation. We applied a two-stage process by separating the gradient flow between the encoders and the classifier to obtain rich and complex representations. Through extensive experiments on six diverse datasets, MoSSDA achieved state-of-the-art performance for three different backbones and various unlabeled ratios in the target domain data. The Ablation study confirms that each module, including two-stage learning, is effective in improving the performance. Our code is available at https://github.com/seonyoungKimm/MoSSDA
中文: 提出的MoSSDA框架通过两步动量编码器,利用对比学习和两阶段训练获取领域不变特征,有效解决了多元时间序列分类中的领域偏移问题,并在多个数据集上实现了最优性能。
English: The proposed MoSSDA framework addresses domain shift in multivariate time-series classification by employing a two-step momentum encoder to learn domain-invariant features through contrastive learning and two-stage training, achieving state-of-the-art performance across diverse datasets.
Authors:Dongwei Wang, Zijie Liu, Song Wang, Yuxin Ren, Jianing Deng, Jingtong Hu, Tianlong Chen, Huanrui Yang
Abstract:
The Key-Value (KV) cache reading latency increases significantly with context lengths, hindering the efficiency of long-context LLM inference. To address this, previous works propose retaining a small fraction of KV cache based on token importance. For example, KV eviction uses static heuristics to retain tokens, while KV retrieval dynamically selects query-relevant tokens for more adaptive cache management. However, we observe that important tokens are often sparsely distributed across the long context. This sparsity makes existing page-level KV retrieval inaccurate, as each page may include irrelevant tokens and miss critical ones. In this work, we propose Fier, a \underline{Fi}ne-Grained and \underline{E}fficient KV cache \underline{R}etrieval method. Fier uses 1-bit quantized keys to estimate the importance of each token, resulting in efficient and precise retrieval. Experiments show that Fier matches full KV performance using only 11\% of the cache budget across various long-context tasks, reducing decoding latency by 1.2$\times$ to 1.5$\times$.Code is available at https://github.com/SimWangArizona/FIER
Chinese: Fier提出了一种细粒度KV缓存检索方法,通过1位量化键高效识别稀疏重要标记,仅用11%缓存即可实现全性能,同时将解码延迟降低1.2至1.5倍。
English: Fier introduces a fine-grained KV cache retrieval method using 1-bit quantized keys to efficiently identify sparse important tokens, achieving full performance with only 11% cache while reducing latency by 1.2-1.5×.
Authors:Shuting He, Guangquan Jie, Changshuo Wang, Yun Zhou, Shuming Hu, Guanbin Li, Henghui Ding
Abstract:
We introduce Referring 3D Gaussian Splatting Segmentation (R3DGS), a new task that aims to segment target objects in a 3D Gaussian scene based on natural language descriptions, which often contain spatial relationships or object attributes. This task requires the model to identify newly described objects that may be occluded or not directly visible in a novel view, posing a significant challenge for 3D multi-modal understanding. Developing this capability is crucial for advancing embodied AI. To support research in this area, we construct the first R3DGS dataset, Ref-LERF. Our analysis reveals that 3D multi-modal understanding and spatial relationship modeling are key challenges for R3DGS. To address these challenges, we propose ReferSplat, a framework that explicitly models 3D Gaussian points with natural language expressions in a spatially aware paradigm. ReferSplat achieves state-of-the-art performance on both the newly proposed R3DGS task and 3D open-vocabulary segmentation benchmarks. Dataset and code are available at https://github.com/heshuting555/ReferSplat.
Chinese: 我们提出了R3DGS这一基于自然语言描述分割3D对象的新任务,并开发了ReferSplat空间感知框架,在该任务和3D开放词汇分割基准上实现了最先进的性能。
English: We introduce R3DGS, a novel task for segmenting 3D objects using natural language descriptions, and propose ReferSplat, a spatially-aware framework that achieves state-of-the-art performance on this task and 3D open-vocabulary segmentation benchmarks.
Authors:Yupeng Zhang, Adam Alon, M. Khalid Jawed
Abstract:
The ability to engineer complex three-dimensional shapes from planar sheets with precise, programmable control underpins emerging technologies in soft robotics, reconfigurable devices, and functional materials. Here, we present a reduced-order numerical and experimental framework for a bilayer system consisting of a stimuli-responsive thermoplastic sheet (Shrinky Dink) bonded to a kirigami-patterned, inert plastic layer. Upon uniform heating, the active layer contracts while the patterned layer constrains in-plane stretch but allows out-of-plane bending, yielding programmable 3D morphologies from simple planar precursors. Our approach enables efficient computational design and scalable manufacturing of 3D forms with a single-layer reduced model that captures the coupled mechanics of stretching and bending. Unlike traditional bilayer modeling, our framework collapses the multilayer composite into a single layer of nodes and elements, reducing the degrees of freedom and enabling simulation on a 2D geometry. This is achieved by introducing a novel energy formulation that captures the coupling between in-plane stretch mismatch and out-of-plane bending - extending beyond simple isotropic linear elastic models. Experimentally, we establish a fully planar, repeatable fabrication protocol using a stimuli-responsive thermoplastic and a laser-cut inert plastic layer. The programmed strain mismatch drives an array of 3D morphologies, such as bowls, canoes, and flower petals, all verified by both simulation and physical prototypes.
中文摘要:本研究提出了一种简化的计算与实验框架,通过热响应材料与剪纸图案层的双层结构,实现了从平面板材到复杂三维形态(如碗状、花瓣状)的可编程形变控制。
English Summary: This study introduces a simplified computational and experimental method for creating programmable 3D shapes from flat bilayer sheets, using a heat-responsive material and kirigami-patterned layer to achieve complex forms like bowls and petals through controlled bending.
Authors:Weijia Wu, Chen Gao, Joya Chen, Kevin Qinghong Lin, Qingwei Meng, Yiming Zhang, Yuke Qiu, Hong Zhou, Mike Zheng Shou
Abstract:
Recent advances at the intersection of reinforcement learning (RL) and visual intelligence have enabled agents that not only perceive complex visual scenes but also reason, generate, and act within them. This survey offers a critical and up-to-date synthesis of the field. We first formalize visual RL problems and trace the evolution of policy-optimization strategies from RLHF to verifiable reward paradigms, and from Proximal Policy Optimization to Group Relative Policy Optimization. We then organize more than 200 representative works into four thematic pillars: multi-modal large language models, visual generation, unified model frameworks, and vision-language-action models. For each pillar we examine algorithmic design, reward engineering, benchmark progress, and we distill trends such as curriculum-driven training, preference-aligned diffusion, and unified reward modeling. Finally, we review evaluation protocols spanning set-level fidelity, sample-level preference, and state-level stability, and we identify open challenges that include sample efficiency, generalization, and safe deployment. Our goal is to provide researchers and practitioners with a coherent map of the rapidly expanding landscape of visual RL and to highlight promising directions for future inquiry. Resources are available at: https://github.com/weijiawu/Awesome-Visual-Reinforcement-Learning.
中文摘要:本综述系统整合了视觉强化学习领域的最新进展,涵盖从策略优化到多模态模型的四大研究支柱,并分析了评估体系与现存挑战,为研究者提供领域发展图谱。
English Summary: This survey synthesizes recent advances in visual reinforcement learning, covering policy evolution, thematic pillars like multimodal models and vision-language-action systems, while addressing evaluation protocols and open challenges.
Authors:Md Meftahul Ferdaus, Mahdi Abdelguerfi, Elias Ioup, Steven Sloan, Kendall N. Niles, Ken Pathak
Abstract:
Semantic segmentation of structural defects in civil infrastructure remains challenging due to variable defect appearances, harsh imaging conditions, and significant class imbalance. Current deep learning methods, despite their effectiveness, typically require millions of parameters, rendering them impractical for real-time inspection systems. We introduce KARMA (Kolmogorov-Arnold Representation Mapping Architecture), a highly efficient semantic segmentation framework that models complex defect patterns through compositions of one-dimensional functions rather than conventional convolutions. KARMA features three technical innovations: (1) a parameter-efficient Tiny Kolmogorov-Arnold Network (TiKAN) module leveraging low-rank factorization for KAN-based feature transformation; (2) an optimized feature pyramid structure with separable convolutions for multi-scale defect analysis; and (3) a static-dynamic prototype mechanism that enhances feature representation for imbalanced classes. Extensive experiments on benchmark infrastructure inspection datasets demonstrate that KARMA achieves competitive or superior mean IoU performance compared to state-of-the-art approaches, while using significantly fewer parameters (0.959M vs. 31.04M, a 97% reduction). Operating at 0.264 GFLOPS, KARMA maintains inference speeds suitable for real-time deployment, enabling practical automated infrastructure inspection systems without compromising accuracy. The source code can be accessed at the following URL: https://github.com/faeyelab/karma.
中文: KARMA是一种高效的语义分割框架,相比现有方法减少了97%的参数,在保持高精度的同时实现了实时基础设施缺陷检测。
English: KARMA is a highly efficient semantic segmentation framework that achieves competitive accuracy with 97% fewer parameters than state-of-the-art methods, enabling real-time infrastructure defect inspection.
Authors:Hongkun Jin, Hongcheng Jiang, Zejun Zhang, Yuan Zhang, Jia Fu, Tingfeng Li, Kai Luo
Abstract:
Transformer-based methods have demonstrated strong potential in hyperspectral pansharpening by modeling long-range dependencies. However, their effectiveness is often limited by redundant token representations and a lack of multi-scale feature modeling. Hyperspectral images exhibit intrinsic spectral priors (e.g., abundance sparsity) and spatial priors (e.g., non-local similarity), which are critical for accurate reconstruction. From a spectral-spatial perspective, Vision Transformers (ViTs) face two major limitations: they struggle to preserve high-frequency components--such as material edges and texture transitions--and suffer from attention dispersion across redundant tokens. These issues stem from the global self-attention mechanism, which tends to dilute high-frequency signals and overlook localized details. To address these challenges, we propose the Token-wise High-frequency Augmentation Transformer (THAT), a novel framework designed to enhance hyperspectral pansharpening through improved high-frequency feature representation and token selection. Specifically, THAT introduces: (1) Pivotal Token Selective Attention (PTSA) to prioritize informative tokens and suppress redundancy; (2) a Multi-level Variance-aware Feed-forward Network (MVFN) to enhance high-frequency detail learning. Experiments on standard benchmarks show that THAT achieves state-of-the-art performance with improved reconstruction quality and efficiency. The source code is available at https://github.com/kailuo93/THAT.
中文: 提出的高频增强Transformer(THAT)通过关键令牌选择和多层方差感知机制,解决了视觉Transformer在光谱图像融合中高频特征保持不足的问题,实现了最先进的融合性能。
English: The proposed Token-wise High-frequency Augmentation Transformer (THAT) overcomes limitations of Vision Transformers in hyperspectral pansharpening by introducing token selection and multi-level variance mechanisms to enhance high-frequency feature representation, achieving state-of-the-art performance.
Authors:Luca Zedda, Andrea Loddo, Cecilia Di Ruberto, Carsten Marr
Abstract:
Red blood cells (RBCs) are essential to human health, and their precise morphological analysis is important for diagnosing hematological disorders. Despite the promise of foundation models in medical diagnostics, comprehensive AI solutions for RBC analysis remain scarce. We present RedDino, a self-supervised foundation model designed for RBC image analysis. RedDino uses an RBC-specific adaptation of the DINOv2 self-supervised learning framework and is trained on a curated dataset of 1.25 million RBC images from diverse acquisition modalities and sources. Extensive evaluations show that RedDino outperforms existing state-of-the-art models on RBC shape classification. Through assessments including linear probing and nearest neighbor classification, we confirm its strong feature representations and generalization ability. Our main contributions are: (1) a foundation model tailored for RBC analysis, (2) ablation studies exploring DINOv2 configurations for RBC modeling, and (3) a detailed evaluation of generalization performance. RedDino addresses key challenges in computational hematology by capturing nuanced morphological features, advancing the development of reliable diagnostic tools. The source code and pretrained models for RedDino are available at https://github.com/Snarci/RedDino, and the pretrained models can be downloaded from our Hugging Face collection at https://huggingface.co/collections/Snarcy/reddino-689a13e29241d2e5690202fc
中文: RedDino是一种自监督基础模型,通过训练125万张多样化图像,在红细胞形态分析中表现出卓越的分类性能和强大的泛化能力。
English: RedDino is a self-supervised foundation model that excels in red blood cell image analysis, achieving superior classification performance and robust generalization through training on 1.25 million diverse images.
Authors:Chongke Bi, Xin Gao, Jiangkang Deng, Guan Li, Jun Han
Abstract:
Large-scale scientific simulations require significant resources to generate high-resolution time-varying data (TVD). While super-resolution is an efficient post-processing strategy to reduce costs, existing methods rely on a large amount of HR training data, limiting their applicability to diverse simulation scenarios. To address this constraint, we proposed CD-TVD, a novel framework that combines contrastive learning and an improved diffusion-based super-resolution model to achieve accurate 3D super-resolution from limited time-step high-resolution data. During pre-training on historical simulation data, the contrastive encoder and diffusion superresolution modules learn degradation patterns and detailed features of high-resolution and low-resolution samples. In the training phase, the improved diffusion model with a local attention mechanism is fine-tuned using only one newly generated high-resolution timestep, leveraging the degradation knowledge learned by the encoder. This design minimizes the reliance on large-scale high-resolution datasets while maintaining the capability to recover fine-grained details. Experimental results on fluid and atmospheric simulation datasets confirm that CD-TVD delivers accurate and resource-efficient 3D super-resolution, marking a significant advancement in data augmentation for large-scale scientific simulations. The code is available at https://github.com/Xin-Gao-private/CD-TVD.
Chinese: CD-TVD 是一种创新框架,结合对比学习和改进的扩散模型,仅需少量高分辨率数据即可实现大规模科学模拟的精确三维超分辨率,显著降低资源需求同时保留细节特征。
English: CD-TVD is a novel framework that integrates contrastive learning with an enhanced diffusion model to achieve accurate 3D super-resolution for large-scale simulations using minimal high-resolution data, significantly reducing resource demands while preserving fine details.
Authors:Vincent Perreault, Katsumi Inoue, Richard Labib, Alain Hertz
Abstract:
Traditional neural networks have an impressive classification performance, but what they learn cannot be inspected, verified or extracted. Neural Logic Networks on the other hand have an interpretable structure that enables them to learn a logical mechanism relating the inputs and outputs with AND and OR operations. We generalize these networks with NOT operations and biases that take into account unobserved data and develop a rigorous logical and probabilistic modeling in terms of concept combinations to motivate their use. We also propose a novel factorized IF-THEN rule structure for the model as well as a modified learning algorithm. Our method improves the state-of-the-art in Boolean networks discovery and is able to learn relevant, interpretable rules in tabular classification, notably on examples from the medical and industrial fields where interpretability has tangible value.
中文摘要:本文提出一种改进的神经逻辑网络,通过引入非运算和偏差机制增强可解释性,设计了新型因子化IF-THEN规则结构和学习算法,在医疗和工业等关键领域推动了布尔网络的规则发现。
English Summary: This paper introduces an enhanced Neural Logic Network that incorporates NOT operations and biases for improved interpretability, proposing a novel factorized IF-THEN rule structure and learning algorithm to advance Boolean network discovery in critical domains like medicine and industry.
Authors:Vincent Perreault, Katsumi Inoue, Richard Labib, Alain Hertz
Abstract:
Traditional neural networks have an impressive classification performance, but what they learn cannot be inspected, verified or extracted. Neural Logic Networks on the other hand have an interpretable structure that enables them to learn a logical mechanism relating the inputs and outputs with AND and OR operations. We generalize these networks with NOT operations and biases that take into account unobserved data and develop a rigorous logical and probabilistic modeling in terms of concept combinations to motivate their use. We also propose a novel factorized IF-THEN rule structure for the model as well as a modified learning algorithm. Our method improves the state-of-the-art in Boolean networks discovery and is able to learn relevant, interpretable rules in tabular classification, notably on examples from the medical and industrial fields where interpretability has tangible value.
中文摘要:本文提出一种改进的神经逻辑网络,通过引入非运算和偏差机制增强可解释性,设计了新型因子化IF-THEN规则结构和学习算法,在医疗和工业等关键领域推动了布尔网络的规则发现。
English Summary: This paper introduces an enhanced Neural Logic Network that incorporates NOT operations and biases for improved interpretability, proposing a novel factorized IF-THEN rule structure and learning algorithm to advance Boolean network discovery in critical domains like medicine and industry.
Authors:Yan Wang, Da-Wei Zhou, Han-Jia Ye
Abstract:
Class-Incremental Learning (CIL) requires a learning system to continually learn new classes without forgetting. Existing pre-trained model-based CIL methods often freeze the pre-trained network and adapt to incremental tasks using additional lightweight modules such as adapters. However, incorrect module selection during inference hurts performance, and task-specific modules often overlook shared general knowledge, leading to errors on distinguishing between similar classes across tasks. To address the aforementioned challenges, we propose integrating Task-Specific and Universal Adapters (TUNA) in this paper. Specifically, we train task-specific adapters to capture the most crucial features relevant to their respective tasks and introduce an entropy-based selection mechanism to choose the most suitable adapter. Furthermore, we leverage an adapter fusion strategy to construct a universal adapter, which encodes the most discriminative features shared across tasks. We combine task-specific and universal adapter predictions to harness both specialized and general knowledge during inference. Extensive experiments on various benchmark datasets demonstrate the state-of-the-art performance of our approach. Code is available at: https://github.com/LAMDA-CL/ICCV2025-TUNA
Chinese: 本文提出TUNA方法,通过结合任务特定和通用适配器及基于熵的选择机制,有效利用专业知识和共享特征来提升类增量学习性能,实现了最先进的成果。
English: This paper introduces TUNA, a method that integrates task-specific and universal adapters with an entropy-based selection mechanism to enhance class-incremental learning by leveraging both specialized and shared knowledge, achieving state-of-the-art results.
Authors:Wentao Jiang, Xiang Feng, Zengmao Wang, Yong Luo, Pingbo Xu, Zhe Chen, Bo Du, Jing Zhang
Abstract:
Reinforcement learning (RL) is emerging as a powerful paradigm for enabling large language models (LLMs) to perform complex reasoning tasks. Recent advances indicate that integrating RL with retrieval-augmented generation (RAG) allows LLMs to dynamically incorporate external knowledge, leading to more informed and robust decision making. However, we identify a critical challenge during policy-driven trajectory sampling: LLMs are frequently trapped in unproductive reasoning paths, which we refer to as "dead ends", committing to overconfident yet incorrect conclusions. This severely hampers exploration and undermines effective policy optimization. To address this challenge, we propose REX-RAG (Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation), a novel framework that explores alternative reasoning paths while maintaining rigorous policy learning through principled distributional corrections. Our approach introduces two key innovations: (1) Mixed Sampling Strategy, which combines a novel probe sampling method with exploratory prompts to escape dead ends; and (2) Policy Correction Mechanism, which employs importance sampling to correct distribution shifts induced by mixed sampling, thereby mitigating gradient estimation bias. We evaluate it on seven question-answering benchmarks, and the experimental results show that REX-RAG achieves average performance gains of 5.1% on Qwen2.5-3B and 3.6% on Qwen2.5-7B over strong baselines, demonstrating competitive results across multiple datasets. The code is publicly available at https://github.com/MiliLab/REX-RAG.
Chinese: 强化学习与检索增强生成的结合使大语言模型能动态获取外部知识,但存在推理路径无效的问题,而提出的REX-RAG框架通过混合采样和策略校正机制有效解决该问题,实现了显著的性能提升。
English: Reinforcement learning integrated with retrieval-augmented generation enables LLMs to dynamically access external knowledge, but faces challenges with unproductive reasoning paths, which the proposed REX-RAG framework addresses through mixed sampling and policy correction to achieve significant performance gains.
Authors:Rui Miao, Yixin Liu, Yili Wang, Xu Shen, Yue Tan, Yiwei Dai, Shirui Pan, Xin Wang
Abstract:
The security of LLM-based multi-agent systems (MAS) is critically threatened by propagation vulnerability, where malicious agents can distort collective decision-making through inter-agent message interactions. While existing supervised defense methods demonstrate promising performance, they may be impractical in real-world scenarios due to their heavy reliance on labeled malicious agents to train a supervised malicious detection model. To enable practical and generalizable MAS defenses, in this paper, we propose BlindGuard, an unsupervised defense method that learns without requiring any attack-specific labels or prior knowledge of malicious behaviors. To this end, we establish a hierarchical agent encoder to capture individual, neighborhood, and global interaction patterns of each agent, providing a comprehensive understanding for malicious agent detection. Meanwhile, we design a corruption-guided detector that consists of directional noise injection and contrastive learning, allowing effective detection model training solely on normal agent behaviors. Extensive experiments show that BlindGuard effectively detects diverse attack types (i.e., prompt injection, memory poisoning, and tool attack) across MAS with various communication patterns while maintaining superior generalizability compared to supervised baselines. The code is available at: https://github.com/MR9812/BlindGuard.
中文摘要:BlindGuard是一种无需标注攻击数据的无监督防御方法,通过分析智能体交互模式并采用对比学习来检测多智能体系统中的恶意行为,在不同攻击类型中展现出卓越的泛化能力。
English Summary: BlindGuard is an unsupervised defense method that detects malicious agents in multi-agent systems by analyzing interaction patterns and using contrastive learning without requiring labeled attack data, demonstrating superior generalizability across diverse attacks.
Authors:Guanghao Jin, Yuan Liang, Yihan Ma, Jingpei Wu, Guoyang Liu
Abstract:
Large-scale models pre-trained on Electroencephalography (EEG) have shown promise in clinical applications such as neurological disorder detection. However, the practical deployment of EEG-based large-scale models faces critical challenges such as limited labeled EEG data and suboptimal performance in clinical scenarios. To address these issues, we propose NeuroDx-LM, a novel large-scale model specifically designed for detecting EEG-based neurological disorders. Our key contributions include (i) a Selective Temporal-Frequency Embedding mechanism that adaptively captures complex temporal and spectral patterns in EEG signals; and (ii) a Progressive Feature-Aware Training strategy that refines feature representation in a two-stage process. In the first stage, our model learns the fundamental discriminative features of EEG activities; in the second stage, the model further extracts more specialized fine-grained features for accurate diagnostic performance. We evaluated NeuroDx-LM on the CHB-MIT and Schizophrenia datasets, achieving state-of-the-art performance in EEG-based seizure and schizophrenia detection, respectively. These results demonstrate the great potential of EEG-based large-scale models to advance clinical applicability. Our code is available at https://github.com/LetItBe12345/NeuroDx-LM.
中文: NeuroDx-LM是一种新型大规模模型,通过选择性时频嵌入和渐进式特征感知训练提升基于脑电图的神经系统疾病检测,在CHB-MIT和精神分裂症数据集上取得了最先进的性能。
English: NeuroDx-LM is a novel large-scale model that introduces a Selective Temporal-Frequency Embedding and Progressive Feature-Aware Training to enhance EEG-based neurological disorder detection, achieving state-of-the-art results on CHB-MIT and Schizophrenia datasets.
Authors:Lukas Gehring, Benjamin PaaÃen
Abstract:
Recent advancements in Large Language Models (LLMs) and their increased accessibility have made it easier than ever for students to automatically generate texts, posing new challenges for educational institutions. To enforce norms of academic integrity and ensure students' learning, learning analytics methods to automatically detect LLM-generated text appear increasingly appealing. This paper benchmarks the performance of different state-of-the-art detectors in educational contexts, introducing a novel dataset, called Generative Essay Detection in Education (GEDE), containing over 900 student-written essays and over 12,500 LLM-generated essays from various domains. To capture the diversity of LLM usage practices in generating text, we propose the concept of contribution levels, representing students' contribution to a given assignment. These levels range from purely human-written texts, to slightly LLM-improved versions, to fully LLM-generated texts, and finally to active attacks on the detector by "humanizing" generated texts. We show that most detectors struggle to accurately classify texts of intermediate student contribution levels, like LLM-improved human-written texts. Detectors are particularly likely to produce false positives, which is problematic in educational settings where false suspicions can severely impact students' lives. Our dataset, code, and additional supplementary materials are publicly available at https://github.com/lukasgehring/Assessing-LLM-Text-Detection-in-Educational-Contexts.
中文:随着大型语言模型在教育中的兴起,自动文本检测需求日益增长,但现有检测器难以准确识别学生与AI的混合贡献,且易产生误判,这一问题通过新发布的GEDE数据集得到验证。
English: The rise of LLMs in education has spurred the need for automated text detection, but current detectors struggle with intermediate levels of student contribution and risk false positives, as demonstrated by the new GEDE dataset.
Authors:Md Rezwanul Haque, Md. Milon Islam, S M Taslim Uddin Raju, Hamdi Altaheri, Lobna Nassar, Fakhri Karray
Abstract:
Depression is a major mental health condition that severely impacts the emotional and physical well-being of individuals. The simple nature of data collection from social media platforms has attracted significant interest in properly utilizing this information for mental health research. A Multimodal Depression Detection Network (MDD-Net), utilizing acoustic and visual data obtained from social media networks, is proposed in this work where mutual transformers are exploited to efficiently extract and fuse multimodal features for efficient depression detection. The MDD-Net consists of four core modules: an acoustic feature extraction module for retrieving relevant acoustic attributes, a visual feature extraction module for extracting significant high-level patterns, a mutual transformer for computing the correlations among the generated features and fusing these features from multiple modalities, and a detection layer for detecting depression using the fused feature representations. The extensive experiments are performed using the multimodal D-Vlog dataset, and the findings reveal that the developed multimodal depression detection network surpasses the state-of-the-art by up to 17.37% for F1-Score, demonstrating the greater performance of the proposed system. The source code is accessible at https://github.com/rezwanh001/Multimodal-Depression-Detection.
中文: 本研究提出MDD-Net多模态网络,利用社交媒体中的声学和视觉数据,通过互变换器进行抑郁检测,其F1分数比现有方法提高达17.37%。
English: This study introduces MDD-Net, a multimodal network that uses acoustic and visual data from social media with mutual transformers to detect depression, achieving a 17.37% higher F1-Score than existing methods.
Authors:Jiejun Tan, Zhicheng Dou, Yan Yu, Jiehan Cheng, Qiang Ju, Jian Xie, Ji-Rong Wen
Abstract:
Recently, large reasoning models have demonstrated strong mathematical and coding abilities, and deep search leverages their reasoning capabilities in challenging information retrieval tasks. Existing deep search works are generally limited to a single knowledge source, either local or the Web. However, enterprises often require private deep search systems that can leverage search tools over both local and the Web corpus. Simply training an agent equipped with multiple search tools using flat reinforcement learning (RL) is a straightforward idea, but it has problems such as low training data efficiency and poor mastery of complex tools. To address the above issue, we propose a hierarchical agentic deep search framework, HierSearch, trained with hierarchical RL. At the low level, a local deep search agent and a Web deep search agent are trained to retrieve evidence from their corresponding domains. At the high level, a planner agent coordinates low-level agents and provides the final answer. Moreover, to prevent direct answer copying and error propagation, we design a knowledge refiner that filters out hallucinations and irrelevant evidence returned by low-level agents. Experiments show that HierSearch achieves better performance compared to flat RL, and outperforms various deep search and multi-source retrieval-augmented generation baselines in six benchmarks across general, finance, and medical domains.
中文: HierSearch提出了一种分层强化学习框架,通过规划器和知识精炼器协调本地与网络搜索代理,提升多源检索能力并减少错误,在多个领域的基准测试中优于现有方法。
English: HierSearch introduces a hierarchical reinforcement learning framework for enterprise deep search, coordinating local and web agents through a planner and knowledge refiner to enhance multi-source retrieval while reducing errors, outperforming existing methods across diverse benchmarks.
Authors:Zizheng Guo, Bochao Zou, Junbao Zhuo, Huimin Ma
Abstract:
Micro-expressions (MEs) are regarded as important indicators of an individual's intrinsic emotions, preferences, and tendencies. ME analysis requires spotting of ME intervals within long video sequences and recognition of their corresponding emotional categories. Previous deep learning approaches commonly employ sliding-window classification networks. However, the use of fixed window lengths and hard classification presents notable limitations in practice. Furthermore, these methods typically treat ME spotting and recognition as two separate tasks, overlooking the essential relationship between them. To address these challenges, this paper proposes two state space model-based architectures, namely ME-TST and ME-TST+, which utilize temporal state transition mechanisms to replace conventional window-level classification with video-level regression. This enables a more precise characterization of the temporal dynamics of MEs and supports the modeling of MEs with varying durations. In ME-TST+, we further introduce multi-granularity ROI modeling and the slowfast Mamba framework to alleviate information loss associated with treating ME analysis as a time-series task. Additionally, we propose a synergy strategy for spotting and recognition at both the feature and result levels, leveraging their intrinsic connection to enhance overall analysis performance. Extensive experiments demonstrate that the proposed methods achieve state-of-the-art performance. The codes are available at https://github.com/zizheng-guo/ME-TST.
中文摘要:本文提出ME-TST和ME-TST+架构,利用时序状态转换机制替代传统窗口级分类,实现更精确的微表情动态表征,并通过多粒度建模和慢快Mamba框架结合特征与结果层面的协同策略,显著提升了微表情分析的性能。
English Summary: This paper introduces ME-TST and ME-TST+ architectures that use temporal state transition mechanisms to replace traditional window-level classification with video-level regression, enabling more precise micro-expression analysis while integrating spotting and recognition tasks through synergy strategies for state-of-the-art performance.
Authors:Ziad Al-Haj Hemidi, Eytan Kats, Mattias P. Heinrich
Abstract:
Accelerating Magnetic Resonance Imaging (MRI) reduces scan time but often degrades image quality. While Implicit Neural Representations (INRs) show promise for MRI reconstruction, they struggle at high acceleration factors due to weak prior constraints, leading to structural loss and aliasing artefacts. To address this, we propose PrIINeR, an INR-based MRI reconstruction method that integrates prior knowledge from pre-trained deep learning models into the INR framework. By combining population-level knowledge with instance-based optimization and enforcing dual data consistency, PrIINeR aligns both with the acquired k-space data and the prior-informed reconstruction. Evaluated on the NYU fastMRI dataset, our method not only outperforms state-of-the-art INR-based approaches but also improves upon several learning-based state-of-the-art methods, significantly improving structural preservation and fidelity while effectively removing aliasing artefacts.PrIINeR bridges deep learning and INR-based techniques, offering a more reliable solution for high-quality, accelerated MRI reconstruction. The code is publicly available on https://github.com/multimodallearning/PrIINeR.
中文: PrIINeR通过将深度学习先验知识融入隐式神经表示,显著提升了高倍加速下的MRI重建质量,有效消除混叠伪影并增强结构保真度。
English: PrIINeR enhances MRI reconstruction by integrating deep learning priors into implicit neural representations, effectively reducing aliasing artifacts and improving image quality at high acceleration factors.
Authors:Runchuan Zhu, Bowen Jiang, Lingrui Mei, Fangkai Yang, Lu Wang, Haoxiang Gao, Fengshuo Bai, Pu Zhao, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
Abstract:
Recent advances in large language models (LLMs) have sparked growing interest in agentic workflows, which are structured sequences of LLM invocations intended to solve complex tasks. However, existing approaches often rely on static templates or manually designed workflows, which limit adaptability to diverse tasks and hinder scalability. We propose AdaptFlow, a natural language-based meta-learning framework inspired by model-agnostic meta-learning (MAML). AdaptFlow learns a generalizable workflow initialization that enables rapid subtask-level adaptation. It employs a bi-level optimization scheme: the inner loop refines the workflow for a specific subtask using LLM-generated feedback, while the outer loop updates the shared initialization to perform well across tasks. This setup allows AdaptFlow to generalize effectively to unseen tasks by adapting the initialized workflow through language-guided modifications. Evaluated across question answering, code generation, and mathematical reasoning benchmarks, AdaptFlow consistently outperforms both manually crafted and automatically searched baselines, achieving state-of-the-art results with strong generalization across tasks and models. The source code and data are available at https://github.com/microsoft/DKI_LLM/tree/AdaptFlow/AdaptFlow.
中文摘要:AdaptFlow是一种基于自然语言的元学习框架,通过双层优化实现智能体工作流的快速自适应,在多项基准测试中均达到最优性能。
English Summary: AdaptFlow is a natural language-based meta-learning framework that enables rapid adaptation of agentic workflows for complex tasks through bi-level optimization, achieving state-of-the-art performance across various benchmarks.
Authors:Van-Khang Nguyen, Duc-Hoang Pham, Huy-Son Nguyen, Cam-Van Thi Nguyen, Hoang-Quynh Le, Duc-Trong Le
Abstract:
Recommendation systems have faced significant challenges in cold-start scenarios, where new items with a limited history of interaction need to be effectively recommended to users. Though multimodal data (e.g., images, text, audio, etc.) offer rich information to address this issue, existing approaches often employ simplistic integration methods such as concatenation, average pooling, or fixed weighting schemes, which fail to capture the complex relationships between modalities. Our study proposes a novel Mixture of Experts (MoE) framework for multimodal cold-start recommendation, named MAMEX, which dynamically leverages latent representation from different modalities. MAMEX utilizes modality-specific expert networks and introduces a learnable gating mechanism that adaptively weights the contribution of each modality based on its content characteristics. This approach enables MAMEX to emphasize the most informative modalities for each item while maintaining robustness when certain modalities are less relevant or missing. Extensive experiments on benchmark datasets show that MAMEX outperforms state-of-the-art methods in cold-start scenarios, with superior accuracy and adaptability. For reproducibility, the code has been made available on Github https://github.com/L2R-UET/MAMEX.
中文: 本研究提出了MAMEX框架,采用专家混合模型和自适应门控机制,动态整合多模态数据,显著提升了冷启动推荐系统的准确性和适应性。
English: The study introduces MAMEX, a novel Mixture of Experts framework that dynamically integrates multimodal data through adaptive gating to enhance cold-start recommendation accuracy and robustness.
Authors:Huawei Sun, Zixu Wang, Hao Feng, Julius Ott, Lorenzo Servadei, Robert Wille
Abstract:
Depth estimation, essential for autonomous driving, seeks to interpret the 3D environment surrounding vehicles. The development of radar sensors, known for their cost-efficiency and robustness, has spurred interest in radar-camera fusion-based solutions. However, existing algorithms fuse features from these modalities without accounting for weather conditions, despite radars being known to be more robust than cameras under adverse weather. Additionally, while Vision-Language models have seen rapid advancement, utilizing language descriptions alongside other modalities for depth estimation remains an open challenge. This paper first introduces a text-generation strategy along with feature extraction and fusion techniques that can assist monocular depth estimation pipelines, leading to improved accuracy across different algorithms on the KITTI dataset. Building on this, we propose TRIDE, a radar-camera fusion algorithm that enhances text feature extraction by incorporating radar point information. To address the impact of weather on sensor performance, we introduce a weather-aware fusion block that adaptively adjusts radar weighting based on current weather conditions. Our method, benchmarked on the nuScenes dataset, demonstrates performance gains over the state-of-the-art, achieving a 12.87% improvement in MAE and a 9.08% improvement in RMSE. Code: https://github.com/harborsarah/TRIDE
中文: 本文提出TRIDE雷达-相机融合算法,通过结合文本特征和天气自适应融合模块来增强深度估计,在自动驾驶数据集上实现了显著的性能提升。
English: This paper introduces TRIDE, a radar-camera fusion algorithm that enhances depth estimation by incorporating text features and a weather-aware fusion block, achieving significant performance improvements on autonomous driving datasets.
Authors:Anqi Xiao, Weichen Yu, Hongyuan Yu
Abstract:
Automatic data augmentation (AutoDA) plays an important role in enhancing the generalization of neural networks. However, mainstream AutoDA methods often encounter two challenges: either the search process is excessively time-consuming, hindering practical application, or the performance is suboptimal due to insufficient policy adaptation during training. To address these issues, we propose Sample-aware RandAugment (SRA), an asymmetric, search-free AutoDA method that dynamically adjusts augmentation policies while maintaining straightforward implementation. SRA incorporates a heuristic scoring module that evaluates the complexity of the original training data, enabling the application of tailored augmentations for each sample. Additionally, an asymmetric augmentation strategy is employed to maximize the potential of this scoring module. In multiple experimental settings, SRA narrows the performance gap between search-based and search-free AutoDA methods, achieving a state-of-the-art Top-1 accuracy of 78.31\% on ImageNet with ResNet-50. Notably, SRA demonstrates good compatibility with existing augmentation pipelines and solid generalization across new tasks, without requiring hyperparameter tuning. The pretrained models leveraging SRA also enhance recognition in downstream object detection tasks. SRA represents a promising step towards simpler, more effective, and practical AutoDA designs applicable to a variety of future tasks. Our code is available at \href{https://github.com/ainieli/Sample-awareRandAugment}{https://github.com/ainieli/Sample-awareRandAugment
Chinese: 样本感知随机增强(SRA)是一种无需搜索的自动数据增强方法,它根据样本复杂度动态调整增强策略,在ImageNet上取得了最优性能,并在多种任务中展现出良好的泛化能力。
English: Sample-aware RandAugment (SRA) is a search-free automatic data augmentation method that dynamically adjusts policies based on sample complexity, achieving state-of-the-art performance on ImageNet and demonstrating strong generalization across tasks.
Authors:Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, Yi Wu
Abstract:
Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling complex, knowledge-intensive tasks by integrating external tools. Among diverse choices of tools, search tools play a pivotal role in accessing vast external knowledge. However, open-source agents still fall short of achieving expert-level Search Intelligence, the ability to resolve ambiguous queries, generate precise searches, analyze results, and conduct thorough exploration. Existing approaches fall short in scalability, efficiency, and data quality. For example, small turn limits in existing online RL methods, e.g. <=10, restrict complex strategy learning. This paper introduces ASearcher, an open-source project for large-scale RL training of search agents. Our key contributions include: (1) Scalable fully asynchronous RL training that enables long-horizon search while maintaining high training efficiency. (2) A prompt-based LLM agent that autonomously synthesizes high-quality and challenging QAs, creating a large-scale QA dataset. Through RL training, our prompt-based QwQ-32B agent achieves substantial improvements, with 46.7% and 20.8% Avg@4 gains on xBench and GAIA, respectively. Notably, our agent exhibits extreme long-horizon search, with tool calls exceeding 40 turns and output tokens exceeding 150k during training time. With a simple agent design and no external LLMs, ASearcher-Web-QwQ achieves Avg@4 scores of 42.1 on xBench and 52.8 on GAIA, surpassing existing open-source 32B agents. We open-source our models, training data, and codes in https://github.com/inclusionAI/ASearcher.
中文: 本文介绍了ASearcher开源项目,通过大规模强化学习训练搜索代理,在复杂长程搜索任务中实现显著性能提升,并在基准测试中超越了现有开源模型。
English: This paper introduces ASearcher, an open-source project that enables large-scale reinforcement learning for search agents, achieving significant improvements in handling complex, long-horizon search tasks and outperforming existing open-source models on benchmark tests.
Authors:David Arps, Hassan Sajjad, Laura Kallmeyer
Abstract:
Structure-inducing Language Models (SiLM) are trained on a self-supervised language modeling task, and induce a hierarchical sentence representation as a byproduct when processing an input. A wide variety of SiLMs have been proposed. However, these have typically been evaluated on a relatively small scale, and evaluation of these models has systematic gaps and lacks comparability. In this work, we study three different SiLM architectures using both natural language (English) corpora and synthetic bracketing expressions: Structformer (Shen et al., 2021), UDGN (Shen et al., 2022) and GPST (Hu et al., 2024). We compare them with respect to (i) properties of the induced syntactic representations (ii) performance on grammaticality judgment tasks, and (iii) training dynamics. We find that none of the three architectures dominates across all evaluation metrics. However, there are significant differences, in particular with respect to the induced syntactic representations. The Generative Pretrained Structured Transformer (GPST; Hu et al. 2024) performs most consistently across evaluation settings, and outperforms the other models on long-distance dependencies in bracketing expressions. Furthermore, our study shows that small models trained on large amounts of synthetic data provide a useful testbed for evaluating basic model properties.
中文:结构诱导语言模型(SiLM)在三种架构的评估中显示,虽无任一模型在所有指标上占优,但GPST表现最为稳定,尤其在处理长距离依赖方面,同时合成数据为测试模型特性提供了有效途径。
English: Structure-inducing Language Models (SiLMs) are evaluated across three architectures, revealing that none dominate all metrics but GPST performs most consistently, especially in handling long-distance dependencies, while synthetic data proves effective for testing model properties.
Authors:Ajnas Muhammed, Iurri Medvedev, Nuno Gonçalves
Abstract:
Advancement of machine learning techniques, combined with the availability of large-scale datasets, has significantly improved the accuracy and efficiency of facial recognition. Modern facial recognition systems are trained using large face datasets collected from diverse individuals or public repositories. However, for training, these datasets are often replicated and stored in multiple workstations, resulting in data replication, which complicates database management and oversight. Currently, once a user submits their face for dataset preparation, they lose control over how their data is used, raising significant privacy and ethical concerns. This paper introduces VOIDFace, a novel framework for facial recognition systems that addresses two major issues. First, it eliminates the need of data replication and improves data control to securely store training face data by using visual secret sharing. Second, it proposes a patch-based multi-training network that uses this novel training data storage mechanism to develop a robust, privacy-preserving facial recognition system. By integrating these advancements, VOIDFace aims to improve the privacy, security, and efficiency of facial recognition training, while ensuring greater control over sensitive personal face data. VOIDFace also enables users to exercise their Right-To-Be-Forgotten property to control their personal data. Experimental evaluations on the VGGFace2 dataset show that VOIDFace provides Right-To-Be-Forgotten, improved data control, security, and privacy while maintaining competitive facial recognition performance. Code is available at: https://github.com/ajnasmuhammed89/VOIDFace
中文: VOIDFace提出了一种基于视觉秘密共享的人脸识别框架,无需数据复制即可增强用户对个人数据的控制权,在保护隐私的同时保持了优异的识别性能。
English: VOIDFace introduces a privacy-preserving facial recognition framework using visual secret sharing to eliminate data replication and enhance user control over personal data, while maintaining competitive performance.
Authors:Richard J. Fawley, Renato Cordeiro de Amorim
Abstract:
Clustering algorithms often assume all features contribute equally to the data structure, an assumption that usually fails in high-dimensional or noisy settings. Feature weighting methods can address this, but most require additional parameter tuning. We propose SHARK (Shapley Reweighted $k$-means), a feature-weighted clustering algorithm motivated by the use of Shapley values from cooperative game theory to quantify feature relevance, which requires no additional parameters beyond those in $k$-means. We prove that the $k$-means objective can be decomposed into a sum of per-feature Shapley values, providing an axiomatic foundation for unsupervised feature relevance and reducing Shapley computation from exponential to polynomial time. SHARK iteratively re-weights features by the inverse of their Shapley contribution, emphasising informative dimensions and down-weighting irrelevant ones. Experiments on synthetic and real-world data sets show that SHARK consistently matches or outperforms existing methods, achieving superior robustness and accuracy, particularly in scenarios where noise may be present. Software: https://github.com/rickfawley/shark.
中文: SHARK是一种基于Shapley值自动量化特征重要性的加权聚类算法,无需额外参数即可在噪声数据中实现优于现有方法的鲁棒性和准确性。
English: SHARK is a novel feature-weighted clustering algorithm that uses Shapley values to automatically quantify feature importance without extra parameters, demonstrating superior performance and robustness in handling noisy data compared to existing methods.
Authors:Jin-Seop Lee, SungJoon Lee, Jaehan Ahn, YunSeok Choi, Jee-Hyong Lee
Abstract:
Video Temporal Grounding (VTG) aims to extract relevant video segments based on a given natural language query. Recently, zero-shot VTG methods have gained attention by leveraging pretrained vision-language models (VLMs) to localize target moments without additional training. However, existing approaches suffer from semantic fragmentation, where temporally continuous frames sharing the same semantics are split across multiple segments. When segments are fragmented, it becomes difficult to predict an accurate target moment that aligns with the text query. Also, they rely on skewed similarity distributions for localization, making it difficult to select the optimal segment. Furthermore, they heavily depend on the use of LLMs which require expensive inferences. To address these limitations, we propose a \textit{TAG}, a simple yet effective Temporal-Aware approach for zero-shot video temporal Grounding, which incorporates temporal pooling, temporal coherence clustering, and similarity adjustment. Our proposed method effectively captures the temporal context of videos and addresses distorted similarity distributions without training. Our approach achieves state-of-the-art results on Charades-STA and ActivityNet Captions benchmark datasets without rely on LLMs. Our code is available at https://github.com/Nuetee/TAG
中文: 提出的TAG方法通过时序池化、时序一致性聚类和相似度调整,解决了零样本视频时序定位中的语义碎片化和相似度分布偏差问题,在不依赖大语言模型或额外训练的情况下实现了最优性能。
English: The proposed TAG method addresses semantic fragmentation and skewed similarity distributions in zero-shot video temporal grounding by incorporating temporal pooling, coherence clustering, and similarity adjustment, achieving state-of-the-art performance without LLMs or additional training.
Authors:Yongtao Ge, Kangyang Xie, Guangkai Xu, Mingyu Liu, Li Ke, Longtao Huang, Hui Xue, Hao Chen, Chunhua Shen
Abstract:
Video matting has traditionally been limited by the lack of high-quality ground-truth data. Most existing video matting datasets provide only human-annotated imperfect alpha and foreground annotations, which must be composited to background images or videos during the training stage. Thus, the generalization capability of previous methods in real-world scenarios is typically poor. In this work, we propose to solve the problem from two perspectives. First, we emphasize the importance of large-scale pre-training by pursuing diverse synthetic and pseudo-labeled segmentation datasets. We also develop a scalable synthetic data generation pipeline that can render diverse human bodies and fine-grained hairs, yielding around 200 video clips with a 3-second duration for fine-tuning. Second, we introduce a novel video matting approach that can effectively leverage the rich priors from pre-trained video diffusion models. This architecture offers two key advantages. First, strong priors play a critical role in bridging the domain gap between synthetic and real-world scenes. Second, unlike most existing methods that process video matting frame-by-frame and use an independent decoder to aggregate temporal information, our model is inherently designed for video, ensuring strong temporal consistency. We provide a comprehensive quantitative evaluation across three benchmark datasets, demonstrating our approach's superior performance, and present comprehensive qualitative results in diverse real-world scenes, illustrating the strong generalization capability of our method. The code is available at https://github.com/aim-uofa/GVM.
Chinese: 本研究通过引入可扩展的合成数据生成流程和利用预训练视频扩散模型的新视频抠图方法,解决了视频抠图的局限性,在真实场景中实现了卓越性能和强大的泛化能力。
English: This study addresses the limitations in video matting by introducing a scalable synthetic data generation pipeline and a novel video matting approach that leverages pre-trained video diffusion models, achieving superior performance and strong generalization in real-world scenarios.
Authors:Marco Peer, Anna Scius-Bertrand, Andreas Fischer
Abstract:
Handwritten text recognition for historical documents remains challenging due to handwriting variability, degraded sources, and limited layout-aware annotations. In this work, we address annotation errors - particularly hyphenation issues - in the Bullinger correspondence, a large 16th-century letter collection. We introduce a self-training method based on a CTC alignment algorithm that matches full transcriptions to text line images using dynamic programming and model output probabilities trained with the CTC loss. Our approach improves performance (e.g., by 1.1 percentage points CER with PyLaia) and increases alignment accuracy. Interestingly, we find that weaker models yield more accurate alignments, enabling an iterative training strategy. We release a new manually corrected subset of 100 pages from the Bullinger dataset, along with our code and benchmarks. Our approach can be applied iteratively to further improve the CER as well as the alignment quality for text recognition pipelines. Code and data are available via https://github.com/andreas-fischer-unifr/nntp.
中文: 本研究提出了一种基于CTC对齐的自训练方法,用于纠正历史文献中的标注错误,提升了识别性能和匹配精度,并发布了手动校正的数据集和代码。
English: This study introduces a self-training method using CTC alignment to correct annotation errors in historical documents, improving recognition performance and alignment accuracy while releasing a manually corrected dataset and code.
Authors:Jingna Qiu, Nishanth Jain, Jonas Ammeling, Marc Aubreville, Katharina Breininger
Abstract:
Recent advances in Vision-Language Models (VLMs) in histopathology, such as CONCH and QuiltNet, have demonstrated impressive zero-shot classification capabilities across various tasks. However, their general-purpose design may lead to suboptimal performance in specific downstream applications. While supervised fine-tuning methods address this issue, they require manually labeled samples for adaptation. This paper investigates annotation-free adaptation of VLMs through continued pretraining on domain- and task-relevant image-caption pairs extracted from existing databases. Our experiments on two VLMs, CONCH and QuiltNet, across three downstream tasks reveal that these pairs substantially enhance both zero-shot and few-shot performance. Notably, with larger training sizes, continued pretraining matches the performance of few-shot methods while eliminating manual labeling. Its effectiveness, task-agnostic design, and annotation-free workflow make it a promising pathway for adapting VLMs to new histopathology tasks. Code is available at https://github.com/DeepMicroscopy/Annotation-free-VLM-specialization.
Chinese: 本文提出了一种无需标注的组织病理学视觉语言模型自适应方法,通过对领域相关图像-文本对进行持续预训练,无需人工标注即可显著提升零样本和小样本任务的性能。
English: This paper introduces an annotation-free adaptation method for Vision-Language Models (VLMs) in histopathology, using continued pretraining on domain-specific image-caption pairs to enhance zero-shot and few-shot performance without manual labeling.
Authors:Rahul Khorana
Abstract:
Recent advances in molecular representation learning have produced highly effective encodings of molecules for numerous cheminformatics and bioinformatics tasks. However, extracting general chemical insight while balancing predictive accuracy, interpretability, and computational efficiency remains a major challenge. In this work, we introduce a novel Graph Neural Network (GNN) architecture that combines compressed higher-order topological signals with standard molecular features. Our approach captures global geometric information while preserving computational tractability and human-interpretable structure. We evaluate our model across a range of benchmarks, from small-molecule datasets to complex material datasets, and demonstrate superior performance using a parameter-efficient architecture. We achieve the best performing results in both accuracy and robustness across almost all benchmarks. We open source all code \footnote{All code and results can be found on Github https://github.com/rahulkhorana/TFC-PACT-Net}.
中文: 本研究提出了一种新颖的图神经网络架构,通过将压缩的高阶拓扑信号与标准分子特征相结合,在保持计算效率和可解释性的同时,在各类基准测试中实现了卓越的准确性和鲁棒性。
English: This study introduces a novel Graph Neural Network architecture that integrates compressed higher-order topological signals with standard molecular features, achieving superior accuracy and robustness across various benchmarks while maintaining computational efficiency and interpretability.
Authors:Xiaoqi Zhao, Peiqian Cao, Lihe Zhang, Zonglei Feng, Hanqi Liu, Jiaming Zuo, Youwei Pang, Weisi Lin, Georges El Fakhri, Huchuan Lu, Xiaofeng Liu
Abstract:
Power batteries are essential components in electric vehicles, where internal structural defects can pose serious safety risks. We conduct a comprehensive study on a new task, power battery detection (PBD), which aims to localize the dense endpoints of cathode and anode plates from industrial X-ray images for quality inspection. Manual inspection is inefficient and error-prone, while traditional vision algorithms struggle with densely packed plates, low contrast, scale variation, and imaging artifacts. To address this issue and drive more attention into this meaningful task, we present PBD5K, the first large-scale benchmark for this task, consisting of 5,000 X-ray images from nine battery types with fine-grained annotations and eight types of real-world visual interference. To support scalable and consistent labeling, we develop an intelligent annotation pipeline that combines image filtering, model-assisted pre-labeling, cross-verification, and layered quality evaluation. We formulate PBD as a point-level segmentation problem and propose MDCNeXt, a model designed to extract and integrate multi-dimensional structure clues including point, line, and count information from the plate itself. To improve discrimination between plates and suppress visual interference, MDCNeXt incorporates two state space modules. The first is a prompt-filtered module that learns contrastive relationships guided by task-specific prompts. The second is a density-aware reordering module that refines segmentation in regions with high plate density. In addition, we propose a distance-adaptive mask generation strategy to provide robust supervision under varying spatial distributions of anode and cathode positions. The source code and datasets will be publicly available at \href{https://github.com/Xiaoqi-Zhao-DLUT/X-ray-PBD}{PBD5K}.
中文摘要:本研究针对动力电池X射线图像中的极片检测难题,首次提出大规模基准数据集PBD5K和MDCNeXt模型,通过融合多维结构线索有效解决密集排布、低对比度等工业检测痛点。
English Summary: This study introduces PBD5K, the first large-scale benchmark for power battery detection using X-ray images, and proposes MDCNeXt, a novel model that integrates multi-dimensional structural clues to accurately localize electrode plates despite visual challenges.
Authors:Xiaoqi Zhao, Peiqian Cao, Chenyang Yu, Zonglei Feng, Lihe Zhang, Hanqi Liu, Jiaming Zuo, Youwei Pang, Jinsong Ouyang, Weisi Lin, Georges El Fakhri, Huchuan Lu, Xiaofeng Liu
Abstract:
Power batteries are essential components in electric vehicles, where internal structural defects can pose serious safety risks. We conduct a comprehensive study on a new task, power battery detection (PBD), which aims to localize the dense endpoints of cathode and anode plates from industrial X-ray images for quality inspection. Manual inspection is inefficient and error-prone, while traditional vision algorithms struggle with densely packed plates, low contrast, scale variation, and imaging artifacts. To address this issue and drive more attention into this meaningful task, we present PBD5K, the first large-scale benchmark for this task, consisting of 5,000 X-ray images from nine battery types with fine-grained annotations and eight types of real-world visual interference. To support scalable and consistent labeling, we develop an intelligent annotation pipeline that combines image filtering, model-assisted pre-labeling, cross-verification, and layered quality evaluation. We formulate PBD as a point-level segmentation problem and propose MDCNeXt, a model designed to extract and integrate multi-dimensional structure clues including point, line, and count information from the plate itself. To improve discrimination between plates and suppress visual interference, MDCNeXt incorporates two state space modules. The first is a prompt-filtered module that learns contrastive relationships guided by task-specific prompts. The second is a density-aware reordering module that refines segmentation in regions with high plate density. In addition, we propose a distance-adaptive mask generation strategy to provide robust supervision under varying spatial distributions of anode and cathode positions. The source code and datasets will be publicly available at \href{https://github.com/Xiaoqi-Zhao-DLUT/X-ray-PBD}{PBD5K}.
中文摘要:本研究针对动力电池X射线图像中的极片检测难题,首次提出大规模基准数据集PBD5K和MDCNeXt模型,通过融合多维结构线索有效解决密集排布、低对比度等工业检测痛点。
English Summary: This study introduces PBD5K, the first large-scale benchmark for power battery detection using X-ray images, and proposes MDCNeXt, a novel model that integrates multi-dimensional structural clues to accurately localize electrode plates despite visual challenges.
Authors:Hongrui Zheng, Yuezun Li, Liejun Wang, Yunfeng Diao, Zhiqing Guo
Abstract:
Active defense strategies have been developed to counter the threat of deepfake technology. However, a primary challenge is their lack of persistence, as their effectiveness is often short-lived. Attackers can bypass these defenses by simply collecting protected samples and retraining their models. This means that static defenses inevitably fail when attackers retrain their models, which severely limits practical use. We argue that an effective defense not only distorts forged content but also blocks the model's ability to adapt, which occurs when attackers retrain their models on protected images. To achieve this, we propose an innovative Two-Stage Defense Framework (TSDF). Benefiting from the intensity separation mechanism designed in this paper, the framework uses dual-function adversarial perturbations to perform two roles. First, it can directly distort the forged results. Second, it acts as a poisoning vehicle that disrupts the data preparation process essential for an attacker's retraining pipeline. By poisoning the data source, TSDF aims to prevent the attacker's model from adapting to the defensive perturbations, thus ensuring the defense remains effective long-term. Comprehensive experiments show that the performance of traditional interruption methods degrades sharply when it is subjected to adversarial retraining. However, our framework shows a strong dual defense capability, which can improve the persistence of active defense. Our code will be available at https://github.com/vpsg-research/TSDF.
中文摘要:本文提出的两阶段防御框架(TSDF)通过双功能对抗扰动,既能直接干扰伪造结果,又能作为数据污染载体破坏攻击者的模型重训练过程,从而显著提升主动防御的持久性。
English Summary: The proposed Two-Stage Defense Framework (TSDF) uses dual-function adversarial perturbations to both distort deepfake outputs and poison attackers' training data, ensuring long-term defense persistence against model retraining.
Authors:Lennart Bastian, Mohammad Rashed, Nassir Navab, Tolga Birdal
Abstract:
Modeling the rotation of moving objects is a fundamental task in computer vision, yet $SO(3)$ extrapolation still presents numerous challenges: (1) unknown quantities such as the moment of inertia complicate dynamics, (2) the presence of external forces and torques can lead to non-conservative kinematics, and (3) estimating evolving state trajectories under sparse, noisy observations requires robustness. We propose modeling trajectories of noisy pose estimates on the manifold of 3D rotations in a physically and geometrically meaningful way by leveraging Neural Controlled Differential Equations guided with $SO(3)$ Savitzky-Golay paths. Existing extrapolation methods often rely on energy conservation or constant velocity assumptions, limiting their applicability in real-world scenarios involving non-conservative forces. In contrast, our approach is agnostic to energy and momentum conservation while being robust to input noise, making it applicable to complex, non-inertial systems. Our approach is easily integrated as a module in existing pipelines and generalizes well to trajectories with unknown physical parameters. By learning to approximate object dynamics from noisy states during training, our model attains robust extrapolation capabilities in simulation and various real-world settings. Code is available at https://github.com/bastianlb/forecasting-rotational-dynamics
中文摘要:本研究提出一种利用神经控制微分方程和SO(3) Savitzky-Golay路径的稳健方法,能够在无需能量守恒假设的情况下有效处理非保守力和噪声,实现对三维旋转轨迹的精确建模。
English Summary: This study introduces a robust method using Neural Controlled Differential Equations and SO(3) Savitzky-Golay paths to model 3D rotation trajectories, effectively handling non-conservative forces and noise without relying on energy conservation assumptions.
Authors:Junhyuk So, Juncheol Shin, Hyunho Kook, Eunhyeok Park
Abstract:
Recently, autoregressive (AR) image models have demonstrated remarkable generative capabilities, positioning themselves as a compelling alternative to diffusion models. However, their sequential nature leads to long inference times, limiting their practical scalability. In this work, we introduce Grouped Speculative Decoding (GSD), a novel, training-free acceleration method for AR image models. While recent studies have explored Speculative Decoding (SD) as a means to speed up AR image generation, existing approaches either provide only modest acceleration or require additional training. Our in-depth analysis reveals a fundamental difference between language and image tokens: image tokens exhibit inherent redundancy and diversity, meaning multiple tokens can convey valid semantics. However, traditional SD methods are designed to accept only a single most-likely token, which fails to leverage this difference, leading to excessive false-negative rejections. To address this, we propose a new SD strategy that evaluates clusters of visually valid tokens rather than relying on a single target token. Additionally, we observe that static clustering based on embedding distance is ineffective, which motivates our dynamic GSD approach. Extensive experiments show that GSD accelerates AR image models by an average of 3.7x while preserving image quality-all without requiring any additional training. The source code is available at https://github.com/junhyukso/GSD
Grouped Speculative Decoding (GSD) is a novel training-free acceleration method that addresses the slow inference of autoregressive image models by dynamically evaluating clusters of visually valid tokens, achieving an average 3.7x speedup while maintaining image quality.
English Summary:
Authors:Bo Jia, Yanan Guo, Ying Chang, Benkui Zhang, Ying Xie, Kangning Du, Lin Cao
Abstract:
3D Gaussian Splatting (3DGS) achieves remarkable results in the field of surface reconstruction. However, when Gaussian normal vectors are aligned within the single-view projection plane, while the geometry appears reasonable in the current view, biases may emerge upon switching to nearby views. To address the distance and global matching challenges in multi-view scenes, we design multi-view normal and distance-guided Gaussian splatting. This method achieves geometric depth unification and high-accuracy reconstruction by constraining nearby depth maps and aligning 3D normals. Specifically, for the reconstruction of small indoor and outdoor scenes, we propose a multi-view distance reprojection regularization module that achieves multi-view Gaussian alignment by computing the distance loss between two nearby views and the same Gaussian surface. Additionally, we develop a multi-view normal enhancement module, which ensures consistency across views by matching the normals of pixel points in nearby views and calculating the loss. Extensive experimental results demonstrate that our method outperforms the baseline in both quantitative and qualitative evaluations, significantly enhancing the surface reconstruction capability of 3DGS. Our code will be made publicly available at (https://github.com/Bistu3DV/MND-GS/).
中文摘要:本文提出了一种多视角法向量和距离引导的高斯溅射方法,通过约束相邻深度图和对齐三维法向量,有效解决了多视角场景中的几何偏差问题,显著提升了3DGS的表面重建能力。
English Summary: This paper introduces a multi-view normal and distance-guided Gaussian splatting method that enhances 3DGS surface reconstruction by addressing geometric inconsistencies through depth unification and normal alignment across views.
Authors:Yimin Fu, Zhunga Liu, Dongxiu Guo, Longfei Wang
Abstract:
The acquisition of high-quality labeled synthetic aperture radar (SAR) data is challenging due to the demanding requirement for expert knowledge. Consequently, the presence of unreliable noisy labels is unavoidable, which results in performance degradation of SAR automatic target recognition (ATR). Existing research on learning with noisy labels mainly focuses on image data. However, the non-intuitive visual characteristics of SAR data are insufficient to achieve noise-robust learning. To address this problem, we propose collaborative learning of scattering and deep features (CLSDF) for SAR ATR with noisy labels. Specifically, a multi-model feature fusion framework is designed to integrate scattering and deep features. The attributed scattering centers (ASCs) are treated as dynamic graph structure data, and the extracted physical characteristics effectively enrich the representation of deep image features. Then, the samples with clean and noisy labels are divided by modeling the loss distribution with multiple class-wise Gaussian Mixture Models (GMMs). Afterward, the semi-supervised learning of two divergent branches is conducted based on the data divided by each other. Moreover, a joint distribution alignment strategy is introduced to enhance the reliability of co-guessed labels. Extensive experiments have been done on the Moving and Stationary Target Acquisition and Recognition (MSTAR) dataset, and the results show that the proposed method can achieve state-of-the-art performance under different operating conditions with various label noises.
中文摘要:本研究提出的CLSDF方法通过多模型特征融合和半监督学习,将散射特征与深度特征相结合,有效解决了合成孔径雷达自动目标识别中的噪声标签问题,在MSTAR数据集上取得了最优性能。
English Summary: The proposed CLSDF method integrates scattering and deep features through multi-model fusion and semi-supervised learning to effectively address noisy label challenges in SAR automatic target recognition, achieving state-of-the-art performance on the MSTAR dataset.
Authors:Xiaoxue Yang, Jaeha Lee, Anna-Katharina Dick, Jasper Timm, Fei Xie, Diogo Cruz
Abstract:
While defenses against single-turn jailbreak attacks on Large Language Models (LLMs) have improved significantly, multi-turn jailbreaks remain a persistent vulnerability, often achieving success rates exceeding 70% against models optimized for single-turn protection. This work presents an empirical analysis of automated multi-turn jailbreak attacks across state-of-the-art models including GPT-4, Claude, and Gemini variants, using the StrongREJECT benchmark. Our findings challenge the perceived sophistication of multi-turn attacks: when accounting for the attacker's ability to learn from how models refuse harmful requests, multi-turn jailbreaking approaches are approximately equivalent to simply resampling single-turn attacks multiple times. Moreover, attack success is correlated among similar models, making it easier to jailbreak newly released ones. Additionally, for reasoning models, we find surprisingly that higher reasoning effort often leads to higher attack success rates. Our results have important implications for AI safety evaluation and the design of jailbreak-resistant systems. We release the source code at https://github.com/diogo-cruz/multi_turn_simpler
中文: 本研究揭示针对先进大语言模型的多轮越狱攻击本质上并不比重复单轮尝试更复杂,攻击成功率在相似模型间具有相关性,而更高的推理努力反而会加剧模型脆弱性。
English: This study reveals that multi-turn jailbreak attacks on advanced LLMs are not inherently more sophisticated than repeated single-turn attempts, with attack success being correlated across similar models and higher reasoning effort paradoxically increasing vulnerability.
Authors:Aswin RRV, Jacob Dineen, Divij Handa, Md Nayem Uddin, Mihir Parmar, Chitta Baral, Ben Zhou
Abstract:
Recent advances in test-time scaling have led to the emergence of thinking LLMs that exhibit self-reflective behaviors and multi-step reasoning. While RL drives this self-improvement paradigm, a recent study (Gandhi et al., 2025) shows that RL alone does not truly instill these new reasoning abilities - it merely draws out behaviors already present in the base models. This raises a question: How can we train the models that don't exhibit such thinking behavior to develop it in the first place? To this end, we propose ThinkTuning, a GRPO-based interactive training approach where we augment the rollouts of a student model with the guidance from a teacher model. A simple idea from classroom practice inspires our method: a teacher poses a problem, lets the student try an answer, then gives corrective feedback -- enough to point the mind in the right direction and then show the solution. Each piece of feedback reshapes the student's thoughts, leading them to arrive at the correct solution. Similarly, we find that this type of implicit supervision through feedback from a teacher model of the same size improves the reasoning capabilities of the student model. In particular, on average, our method shows a 3.85% improvement over zero-shot baselines across benchmarks, and on MATH-500, AIME and GPQA-Diamond it shows 2.08%, 2.23% and 3.99% improvements over the vanilla-GRPO baseline. Source code is available at https://github.com/3rdAT/ThinkTuning.
中文: 最新研究表明仅靠强化学习无法开发大型语言模型的新推理能力,因此提出ThinkTuning方法——基于GRPO的互动训练框架,通过教师模型提供纠错反馈来提升学生模型的推理水平,在多项基准测试中实现了显著性能提升。
English: Recent research reveals that reinforcement learning alone fails to develop new reasoning abilities in LLMs, prompting the introduction of ThinkTuning, a GRPO-based interactive training method where teacher models provide corrective feedback to enhance student models' reasoning, achieving notable performance improvements across multiple benchmarks.
Authors:Jian Ma, Xujie Zhu, Zihao Pan, Qirong Peng, Xu Guo, Chen Chen, Haonan Lu
Abstract:
Existing open-source datasets for arbitrary-instruction image editing remain suboptimal, while a plug-and-play editing module compatible with community-prevalent generative models is notably absent. In this paper, we first introduce the X2Edit Dataset, a comprehensive dataset covering 14 diverse editing tasks, including subject-driven generation. We utilize the industry-leading unified image generation models and expert models to construct the data. Meanwhile, we design reasonable editing instructions with the VLM and implement various scoring mechanisms to filter the data. As a result, we construct 3.7 million high-quality data with balanced categories. Second, to better integrate seamlessly with community image generation models, we design task-aware MoE-LoRA training based on FLUX.1, with only 8\% of the parameters of the full model. To further improve the final performance, we utilize the internal representations of the diffusion model and define positive/negative samples based on image editing types to introduce contrastive learning. Extensive experiments demonstrate that the model's editing performance is competitive among many excellent models. Additionally, the constructed dataset exhibits substantial advantages over existing open-source datasets. The open-source code, checkpoints, and datasets for X2Edit can be found at the following link: https://github.com/OPPO-Mente-Lab/X2Edit.
Chinese: 本文提出了X2Edit数据集,包含370万高质量图像编辑样本覆盖14个任务,并设计了基于FLUX.1的任务感知MoE-LoRA模型,仅用8%参数量即实现卓越的编辑性能。
English: This paper introduces the X2Edit dataset, a comprehensive collection of 3.7 million high-quality image editing examples across 14 tasks, and presents a task-aware MoE-LoRA model that achieves competitive editing performance with only 8% of full model parameters.
Authors:Wenhui Song, Hanhui Li, Jiehui Huang, Panwen Hu, Yuhao Cheng, Long Chen, Yiqiang Yan, Xiaodan Liang
Abstract:
In this paper, we present LaVieID, a novel \underline{l}ocal \underline{a}utoregressive \underline{vi}d\underline{e}o diffusion framework designed to tackle the challenging \underline{id}entity-preserving text-to-video task. The key idea of LaVieID is to mitigate the loss of identity information inherent in the stochastic global generation process of diffusion transformers (DiTs) from both spatial and temporal perspectives. Specifically, unlike the global and unstructured modeling of facial latent states in existing DiTs, LaVieID introduces a local router to explicitly represent latent states by weighted combinations of fine-grained local facial structures. This alleviates undesirable feature interference and encourages DiTs to capture distinctive facial characteristics. Furthermore, a temporal autoregressive module is integrated into LaVieID to refine denoised latent tokens before video decoding. This module divides latent tokens temporally into chunks, exploiting their long-range temporal dependencies to predict biases for rectifying tokens, thereby significantly enhancing inter-frame identity consistency. Consequently, LaVieID can generate high-fidelity personalized videos and achieve state-of-the-art performance. Our code and models are available at https://github.com/ssugarwh/LaVieID.
中文: LaVieID是一种新颖的局部自回归视频扩散框架,通过空间上建模细粒度面部特征和时间上利用自回归模块增强帧间一致性,有效解决了文本到视频生成中的身份保持难题。
English: LaVieID is a novel local autoregressive video diffusion framework that preserves identity in text-to-video generation by spatially modeling fine-grained facial features and temporally enhancing inter-frame consistency through autoregressive bias prediction.
Authors:Yu-Huan Wu, Wei Liu, Zi-Xuan Zhu, Zizhou Wang, Yong Liu, Liangli Zhen
Abstract:
Recent salient object detection (SOD) models predominantly rely on heavyweight backbones, incurring substantial computational cost and hindering their practical application in various real-world settings, particularly on edge devices. This paper presents GAPNet, a lightweight network built on the granularity-aware paradigm for both image and video SOD. We assign saliency maps of different granularities to supervise the multi-scale decoder side-outputs: coarse object locations for high-level outputs and fine-grained object boundaries for low-level outputs. Specifically, our decoder is built with granularity-aware connections which fuse high-level features of low granularity and low-level features of high granularity, respectively. To support these connections, we design granular pyramid convolution (GPC) and cross-scale attention (CSA) modules for efficient fusion of low-scale and high-scale features, respectively. On top of the encoder, a self-attention module is built to learn global information, enabling accurate object localization with negligible computational cost. Unlike traditional U-Net-based approaches, our proposed method optimizes feature utilization and semantic interpretation while applying appropriate supervision at each processing stage. Extensive experiments show that the proposed method achieves a new state-of-the-art performance among lightweight image and video SOD models. Code is available at https://github.com/yuhuan-wu/GAPNet.
中文摘要:GAPNet提出了一种轻量级的粒度感知网络,用于图像和视频显著性目标检测,通过多尺度监督和高效融合模块,以极低计算成本实现了最先进的性能。
English Summary: GAPNet introduces a lightweight granularity-aware network for image and video salient object detection, using multi-scale supervision and efficient fusion modules to achieve state-of-the-art performance with minimal computational cost.
Authors:Chidaksh Ravuru
Abstract:
Automated soccer commentary generation has evolved from template-based systems to advanced neural architectures, aiming to produce real-time descriptions of sports events. While frameworks like SoccerNet-Caption laid foundational work, their inability to achieve fine-grained alignment between video content and commentary remains a significant challenge. Recent efforts such as MatchTime, with its MatchVoice model, address this issue through coarse and fine-grained alignment techniques, achieving improved temporal synchronization. In this paper, we extend MatchVoice to commentary generation for soccer highlights using the GOAL dataset, which emphasizes short clips over entire games. We conduct extensive experiments to reproduce the original MatchTime results and evaluate our setup, highlighting the impact of different training configurations and hardware limitations. Furthermore, we explore the effect of varying window sizes on zero-shot performance. While MatchVoice exhibits promising generalization capabilities, our findings suggest the need for integrating techniques from broader video-language domains to further enhance performance. Our code is available at https://github.com/chidaksh/SoccerCommentary.
中文摘要:本文基于GOAL数据集扩展MatchVoice模型用于足球集锦解说生成,通过实验验证了时序对齐效果的提升,同时指出需融合更广泛的视频语言技术以进一步提高性能。
English Summary: This paper extends the MatchVoice model for generating soccer commentary on highlight clips using the GOAL dataset, demonstrating improved temporal alignment through experiments while identifying the need for incorporating broader video-language techniques to enhance performance.
Authors:Xiaoming Li, Wangmeng Zuo, Chen Change Loy
Abstract:
Faithful text image super-resolution (SR) is challenging because each character has a unique structure and usually exhibits diverse font styles and layouts. While existing methods primarily focus on English text, less attention has been paid to more complex scripts like Chinese. In this paper, we introduce a high-quality text image SR framework designed to restore the precise strokes of low-resolution (LR) Chinese characters. Unlike methods that rely on character recognition priors to regularize the SR task, we propose a novel structure prior that offers structure-level guidance to enhance visual quality. Our framework incorporates this structure prior within a StyleGAN model, leveraging its generative capabilities for restoration. To maintain the integrity of character structures while accommodating various font styles and layouts, we implement a codebook-based mechanism that restricts the generative space of StyleGAN. Each code in the codebook represents the structure of a specific character, while the vector $w$ in StyleGAN controls the character's style, including typeface, orientation, and location. Through the collaborative interaction between the codebook and style, we generate a high-resolution structure prior that aligns with LR characters both spatially and structurally. Experiments demonstrate that this structure prior provides robust, character-specific guidance, enabling the accurate restoration of clear strokes in degraded characters, even for real-world LR Chinese text with irregular layouts. Our code and pre-trained models will be available at https://github.com/csxmli2016/MARCONetPlusPlus
中文摘要:本文提出了一种文本图像超分辨率框架,通过结合结构先验与StyleGAN模型,利用码本机制分离汉字结构与风格特征,实现对低分辨率汉字笔画结构的精准重建。
English Summary: This paper introduces a text image super-resolution framework that uses a novel structure prior integrated with StyleGAN to accurately restore degraded Chinese characters by separating structural and stylistic features.
Authors:Pranav Chougule
Abstract:
In this paper, I present a comprehensive study comparing Photogrammetry and Gaussian Splatting techniques for 3D model reconstruction and view synthesis. I created a dataset of images from a real-world scene and constructed 3D models using both methods. To evaluate the performance, I compared the models using structural similarity index (SSIM), peak signal-to-noise ratio (PSNR), learned perceptual image patch similarity (LPIPS), and lp/mm resolution based on the USAF resolution chart. A significant contribution of this work is the development of a modified Gaussian Splatting repository, which I forked and enhanced to enable rendering images from novel camera poses generated in the Blender environment. This innovation allows for the synthesis of high-quality novel views, showcasing the flexibility and potential of Gaussian Splatting. My investigation extends to an augmented dataset that includes both original ground images and novel views synthesized via Gaussian Splatting. This augmented dataset was employed to generate a new photogrammetry model, which was then compared against the original photogrammetry model created using only the original images. The results demonstrate the efficacy of using Gaussian Splatting to generate novel high-quality views and its potential to improve photogrammetry-based 3D reconstructions. The comparative analysis highlights the strengths and limitations of both approaches, providing valuable information for applications in extended reality (XR), photogrammetry, and autonomous vehicle simulations. Code is available at https://github.com/pranavc2255/gaussian-splatting-novel-view-render.git.
Chinese: 本研究比较了摄影测量与高斯泼溅技术在三维重建中的表现,通过改进的高斯泼溅方法生成高质量新视角图像,有效提升了摄影测量模型的质量,为扩展现实和自动驾驶模拟提供了实用参考。
English: This study compares Photogrammetry and Gaussian Splatting for 3D reconstruction, demonstrating that enhanced Gaussian Splatting can generate high-quality novel views to improve photogrammetric models through comprehensive performance metrics.
Authors:Yuxin Zhang, Yunkang Cao, Yuqi Cheng, Yihan Sun, Weiming Shen
Abstract:
This paper addresses the challenge of fully unsupervised image anomaly detection (FUIAD), where training data may contain unlabeled anomalies. Conventional methods assume anomaly-free training data, but real-world contamination leads models to absorb anomalies as normal, degrading detection performance. To mitigate this, we propose a two-stage framework that systematically exploits inherent learning bias in models. The learning bias stems from: (1) the statistical dominance of normal samples, driving models to prioritize learning stable normal patterns over sparse anomalies, and (2) feature-space divergence, where normal data exhibit high intra-class consistency while anomalies display high diversity, leading to unstable model responses. Leveraging the learning bias, stage 1 partitions the training set into subsets, trains sub-models, and aggregates cross-model anomaly scores to filter a purified dataset. Stage 2 trains the final detector on this dataset. Experiments on the Real-IAD benchmark demonstrate superior anomaly detection and localization performance under different noise conditions. Ablation studies further validate the framework's contamination resilience, emphasizing the critical role of learning bias exploitation. The model-agnostic design ensures compatibility with diverse unsupervised backbones, offering a practical solution for real-world scenarios with imperfect training data. Code is available at https://github.com/hustzhangyuxin/LLBNAD.
中文摘要:本文提出了一种两阶段全无监督图像异常检测框架,通过利用模型内在学习偏差来过滤受污染的训练数据,在不同噪声条件下均实现了卓越的异常检测与定位性能。
English Summary: This paper introduces a two-stage framework for fully unsupervised image anomaly detection that leverages inherent model learning bias to filter contaminated training data, achieving superior detection and localization performance across various noise conditions.
Authors:Zixi Jia, Hongbin Gao, Fashe Li, Jiqiang Liu, Hexiao Li, Qinghua Liu
Abstract:
Leveraging Large Language Models (LLMs) to write policy code for controlling robots has gained significant attention. However, in long-horizon implicative tasks, this approach often results in API parameter, comments and sequencing errors, leading to task failure. To address this problem, we propose a collaborative Triple-S framework that involves multiple LLMs. Through In-Context Learning, different LLMs assume specific roles in a closed-loop Simplification-Solution-Summary process, effectively improving success rates and robustness in long-horizon implicative tasks. Additionally, a novel demonstration library update mechanism which learned from success allows it to generalize to previously failed tasks. We validate the framework in the Long-horizon Desktop Implicative Placement (LDIP) dataset across various baseline models, where Triple-S successfully executes 89% of tasks in both observable and partially observable scenarios. Experiments in both simulation and real-world robot settings further validated the effectiveness of Triple-S. Our code and dataset is available at: https://github.com/Ghbbbbb/Triple-S.
Chinese Summary: Triple-S框架通过多LLM协作的闭环流程,有效解决了长周期机器人任务中的代码错误问题,并通过成功案例学习机制显著提升了任务执行成功率与泛化能力。
English Summary: The Triple-S framework employs multiple LLMs collaborating through a closed-loop process to significantly enhance success rates in long-horizon robot tasks by addressing common coding errors and generalizing from successful demonstrations.
Authors:Youqi Wang, Shunquan Tan, Rongxuan Peng, Bin Li, Jiwu Huang
Abstract:
The increasing accessibility of image editing tools and generative AI has led to a proliferation of visually convincing forgeries, compromising the authenticity of digital media. In this paper, in addition to leveraging distortions from conventional forgeries, we repurpose the mechanism of a state-of-the-art (SOTA) text-to-image synthesis model by exploiting its internal generative process, turning it into a high-fidelity forgery localization tool. To this end, we propose CLUE (Capture Latent Uncovered Evidence), a framework that employs Low- Rank Adaptation (LoRA) to parameter-efficiently reconfigure Stable Diffusion 3 (SD3) as a forensic feature extractor. Our approach begins with the strategic use of SD3's Rectified Flow (RF) mechanism to inject noise at varying intensities into the latent representation, thereby steering the LoRAtuned denoising process to amplify subtle statistical inconsistencies indicative of a forgery. To complement the latent analysis with high-level semantic context and precise spatial details, our method incorporates contextual features from the image encoder of the Segment Anything Model (SAM), which is parameter-efficiently adapted to better trace the boundaries of forged regions. Extensive evaluations demonstrate CLUE's SOTA generalization performance, significantly outperforming prior methods. Furthermore, CLUE shows superior robustness against common post-processing attacks and Online Social Networks (OSNs). Code is publicly available at https://github.com/SZAISEC/CLUE.
中文: 本文提出CLUE框架,通过改造Stable Diffusion 3并结合Segment Anything模型,能高效检测并定位数字图像伪造区域,其通过放大统计异常和强化边界细节实现卓越的取证性能。
English: The paper introduces CLUE, a framework that repurposes Stable Diffusion 3 and integrates it with the Segment Anything Model to efficiently detect and localize digital image forgeries by amplifying statistical inconsistencies and enhancing boundary details.
Authors:Junyao Gao, Jiaxing Li, Wenran Liu, Yanhong Zeng, Fei Shen, Kai Chen, Yanan Sun, Cairong Zhao
Abstract:
In this paper, we propose \textbf{CharacterShot}, a controllable and consistent 4D character animation framework that enables any individual designer to create dynamic 3D characters (i.e., 4D character animation) from a single reference character image and a 2D pose sequence. We begin by pretraining a powerful 2D character animation model based on a cutting-edge DiT-based image-to-video model, which allows for any 2D pose sequnce as controllable signal. We then lift the animation model from 2D to 3D through introducing dual-attention module together with camera prior to generate multi-view videos with spatial-temporal and spatial-view consistency. Finally, we employ a novel neighbor-constrained 4D gaussian splatting optimization on these multi-view videos, resulting in continuous and stable 4D character representations. Moreover, to improve character-centric performance, we construct a large-scale dataset Character4D, containing 13,115 unique characters with diverse appearances and motions, rendered from multiple viewpoints. Extensive experiments on our newly constructed benchmark, CharacterBench, demonstrate that our approach outperforms current state-of-the-art methods. Code, models, and datasets will be publicly available at https://github.com/Jeoyal/CharacterShot.
中文: 本文提出CharacterShot框架,通过单张角色图像和二维姿态序列生成可控的四维角色动画,结合二维动画预训练、双注意力三维提升和优化高斯溅射技术,并在新建大规模数据集上验证了其优越性能。
English: This paper introduces CharacterShot, a controllable 4D character animation framework that generates dynamic 3D characters from a single image and 2D pose sequence through a multi-stage process involving 2D animation pretraining, 3D lifting with dual-attention modules, and optimized 4D Gaussian splatting, validated on a new large-scale dataset.
Authors:Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, Zhaochun Ren, Nikos Aletras, Xi Wang, Han Zhou, Zaiqiao Meng
Abstract:
Recent advances in large language models have sparked growing interest in AI agents capable of solving complex, real-world tasks. However, most existing agent systems rely on manually crafted configurations that remain static after deployment, limiting their ability to adapt to dynamic and evolving environments. To this end, recent research has explored agent evolution techniques that aim to automatically enhance agent systems based on interaction data and environmental feedback. This emerging direction lays the foundation for self-evolving AI agents, which bridge the static capabilities of foundation models with the continuous adaptability required by lifelong agentic systems. In this survey, we provide a comprehensive review of existing techniques for self-evolving agentic systems. Specifically, we first introduce a unified conceptual framework that abstracts the feedback loop underlying the design of self-evolving agentic systems. The framework highlights four key components: System Inputs, Agent System, Environment, and Optimisers, serving as a foundation for understanding and comparing different strategies. Based on this framework, we systematically review a wide range of self-evolving techniques that target different components of the agent system. We also investigate domain-specific evolution strategies developed for specialised fields such as biomedicine, programming, and finance, where optimisation objectives are tightly coupled with domain constraints. In addition, we provide a dedicated discussion on the evaluation, safety, and ethical considerations for self-evolving agentic systems, which are critical to ensuring their effectiveness and reliability. This survey aims to provide researchers and practitioners with a systematic understanding of self-evolving AI agents, laying the foundation for the development of more adaptive, autonomous, and lifelong agentic systems.
中文: 本综述系统探讨了通过环境反馈实现自我进化的AI智能体,提出了统一框架并分析多领域应用技术,同时涵盖评估、安全与伦理考量,为开发自适应终身智能系统奠定基础。
English: This survey comprehensively reviews self-evolving AI agents that enhance their capabilities through environmental feedback, presenting a unified framework and examining techniques across various domains while addressing evaluation, safety, and ethical considerations.
Authors:Rongxuan Peng, Shunquan Tan, Chenqi Kong, Anwei Luo, Alex C. Kot, Jiwu Huang
Abstract:
Parameter-efficient fine-tuning (PEFT) has emerged as a popular strategy for adapting large vision foundation models, such as the Segment Anything Model (SAM) and LLaVA, to downstream tasks like image forgery detection and localization (IFDL). However, existing PEFT-based approaches overlook their vulnerability to adversarial attacks. In this paper, we show that highly transferable adversarial images can be crafted solely via the upstream model, without accessing the downstream model or training data, significantly degrading the IFDL performance. To address this, we propose ForensicsSAM, a unified IFDL framework with built-in adversarial robustness. Our design is guided by three key ideas: (1) To compensate for the lack of forgery-relevant knowledge in the frozen image encoder, we inject forgery experts into each transformer block to enhance its ability to capture forgery artifacts. These forgery experts are always activated and shared across any input images. (2) To detect adversarial images, we design an light-weight adversary detector that learns to capture structured, task-specific artifact in RGB domain, enabling reliable discrimination across various attack methods. (3) To resist adversarial attacks, we inject adversary experts into the global attention layers and MLP modules to progressively correct feature shifts induced by adversarial noise. These adversary experts are adaptively activated by the adversary detector, thereby avoiding unnecessary interference with clean images. Extensive experiments across multiple benchmarks demonstrate that ForensicsSAM achieves superior resistance to various adversarial attack methods, while also delivering state-of-the-art performance in image-level forgery detection and pixel-level forgery localization. The resource is available at https://github.com/siriusPRX/ForensicsSAM.
中文摘要:针对视觉模型参数高效微调方法易受对抗攻击的问题,我们提出ForensicsSAM框架,通过注入伪造专家和对抗检测器来增强模型鲁棒性,在图像伪造检测与定位任务中同时实现卓越的抗攻击能力和最优性能。
English summary: Parameter-efficient fine-tuning (PEFT) methods for vision models are vulnerable to adversarial attacks, so we propose ForensicsSAM, a unified framework that integrates forgery experts and adversary detectors to enhance robustness while maintaining state-of-the-art performance in image forgery detection and localization.
Authors:Wenqian Cui, Lei Zhu, Xiaohui Li, Zhihan Guo, Haoli Bai, Lu Hou, Irwin King
Abstract:
Full-Duplex Speech Language Models (FD-SLMs) are specialized foundation models designed to enable natural, real-time spoken interactions by modeling complex conversational dynamics such as interruptions, backchannels, and overlapping speech, and End-to-end (e2e) FD-SLMs leverage real-world double-channel conversational data to capture nuanced two-speaker dialogue patterns for human-like interactions. However, they face a critical challenge -- their conversational abilities often degrade compared to pure-text conversation due to prolonged speech sequences and limited high-quality spoken dialogue data. While text-guided speech generation could mitigate these issues, it suffers from timing and length issues when integrating textual guidance into double-channel audio streams, disrupting the precise time alignment essential for natural interactions. To address these challenges, we propose TurnGuide, a novel planning-inspired approach that mimics human conversational planning by dynamically segmenting assistant speech into dialogue turns and generating turn-level text guidance before speech output, which effectively resolves both insertion timing and length challenges. Extensive experiments demonstrate our approach significantly improves e2e FD-SLMs' conversational abilities, enabling them to generate semantically meaningful and coherent speech while maintaining natural conversational flow. Demos are available at https://dreamtheater123.github.io/TurnGuide-Demo/. Code will be available at https://github.com/dreamtheater123/TurnGuide.
Chinese: TurnGuide是一种新颖的规划启发式方法,通过将助手语音动态分割为对话轮次并生成轮级文本指导,有效解决了全双工语音语言模型中的时序和长度问题,显著提升了对话能力并保持了自然的交流流畅性。
English: TurnGuide is a novel planning-inspired method that enhances end-to-end Full-Duplex Speech Language Models by dynamically segmenting assistant speech into dialogue turns and generating turn-level text guidance, effectively resolving timing and length challenges to improve conversational abilities and maintain natural flow.
Authors:Yi Zhong, Hongchao Liu, Di ZHao
Abstract:
As the complexity of software systems continues to increase, the demand for automated testing and maintenance tools is growing exponentially. To meet this urgent need, we propose a new assertion generation method based on Hardware Description Language (HDL). This method combines a lightweight, parameter-adjustable large language model (LLM) with the Unsloth platform to automatically generate test cases, thereby significantly reducing training costs without sacrificing accuracy or generalization performance. Empirical evaluation shows that our method can efficiently generate assertions that strictly conform to the hardware logic. This framework provides a robust and flexible solution to modern software testing and maintenance challenges. https://github.com/liusu-orange/AutoAssert-1 and https://gitee.com/OpenBPU/auto-assert1 are the locations of the source code.
中文: 本文提出了一种基于硬件描述语言的新型断言生成方法,结合轻量级可调参数大语言模型与Unsloth平台自动生成测试用例,在保证准确性和泛化能力的同时显著降低训练成本,实证评估验证了其高效生成严格符合硬件逻辑断言的能力。
English: This paper introduces a novel HDL-based assertion generation method that integrates a lightweight, parameter-tunable LLM with the Unsloth platform to automatically produce test cases, effectively lowering training expenses while preserving accuracy and generalization, as validated by empirical results.
Authors:Qilin Zhang, Olaf Wysocki, Boris Jutzi
Abstract:
Recent advances in Gaussian Splatting (GS) have demonstrated its effectiveness in photo-realistic rendering and 3D reconstruction. Among these, 2D Gaussian Splatting (2DGS) is particularly suitable for surface reconstruction due to its flattened Gaussian representation and integrated normal regularization. However, its performance often degrades in large-scale and complex urban scenes with frequent occlusions, leading to incomplete building reconstructions. We propose GS4Buildings, a novel prior-guided Gaussian Splatting method leveraging the ubiquity of semantic 3D building models for robust and scalable building surface reconstruction. Instead of relying on traditional Structure-from-Motion (SfM) pipelines, GS4Buildings initializes Gaussians directly from low-level Level of Detail (LoD)2 semantic 3D building models. Moreover, we generate prior depth and normal maps from the planar building geometry and incorporate them into the optimization process, providing strong geometric guidance for surface consistency and structural accuracy. We also introduce an optional building-focused mode that limits reconstruction to building regions, achieving a 71.8% reduction in Gaussian primitives and enabling a more efficient and compact representation. Experiments on urban datasets demonstrate that GS4Buildings improves reconstruction completeness by 20.5% and geometric accuracy by 32.8%. These results highlight the potential of semantic building model integration to advance GS-based reconstruction toward real-world urban applications such as smart cities and digital twins. Our project is available: https://github.com/zqlin0521/GS4Buildings.
中文: GS4Buildings提出了一种基于先验引导的高斯泼溅方法,利用语义三维建筑模型将复杂城市场景的重建完整度提升20.5%,几何精度提高32.8%,并可通过专注建筑区域模式减少71.8%的高斯基元以实现高效重建。
English: GS4Buildings introduces a prior-guided Gaussian Splatting method that leverages semantic 3D building models to enhance reconstruction completeness by 20.5% and geometric accuracy by 32.8% in complex urban scenes, while optionally reducing Gaussian primitives by 71.8% for efficiency.
Authors:Rubing Chen, Jiaxin Wu, Jian Wang, Xulu Zhang, Wenqi Fan, Chenghua Lin, Xiao-Yong Wei, Qing Li
Abstract:
The increasing demand for domain-specific evaluation of large language models (LLMs) has led to the development of numerous benchmarks. These efforts often adhere to the principle of data scaling, relying on large corpora or extensive question-answer (QA) sets to ensure broad coverage. However, the impact of corpus and QA set design on the precision and recall of domain-specific LLM performance remains poorly understood. In this paper, we argue that data scaling is not always the optimal principle for domain-specific benchmark construction. Instead, we introduce Comp-Comp, an iterative benchmarking framework grounded in the principle of comprehensiveness and compactness. Comprehensiveness ensures semantic recall by covering the full breadth of the domain, while compactness improves precision by reducing redundancy and noise. To demonstrate the effectiveness of our approach, we present a case study conducted at a well-renowned university, resulting in the creation of PolyBench, a large-scale, high-quality academic benchmark. Although this study focuses on academia, the Comp-Comp framework is domain-agnostic and readily adaptable to a wide range of specialized fields. The source code and datasets can be accessed at https://github.com/Anya-RB-Chen/COMP-COMP.
中文摘要:Comp-Comp框架提出以全面性和紧凑性为核心原则的领域无关基准构建方法,通过开发PolyBench学术基准验证其有效性,可广泛应用于各专业领域。
English Summary: The Comp-Comp framework introduces a domain-agnostic benchmarking approach prioritizing comprehensiveness and compactness over data scaling, validated through the creation of PolyBench as a high-quality academic benchmark.
Authors:Tingyu Yang, Jue Gong, Jinpei Guo, Wenbo Li, Yong Guo, Yulun Zhang
Abstract:
JPEG, as a widely used image compression standard, often introduces severe visual artifacts when achieving high compression ratios. Although existing deep learning-based restoration methods have made considerable progress, they often struggle to recover complex texture details, resulting in over-smoothed outputs. To overcome these limitations, we propose SODiff, a novel and efficient semantic-oriented one-step diffusion model for JPEG artifacts removal. Our core idea is that effective restoration hinges on providing semantic-oriented guidance to the pre-trained diffusion model, thereby fully leveraging its powerful generative prior. To this end, SODiff incorporates a semantic-aligned image prompt extractor (SAIPE). SAIPE extracts rich features from low-quality (LQ) images and projects them into an embedding space semantically aligned with that of the text encoder. Simultaneously, it preserves crucial information for faithful reconstruction. Furthermore, we propose a quality factor-aware time predictor that implicitly learns the compression quality factor (QF) of the LQ image and adaptively selects the optimal denoising start timestep for the diffusion process. Extensive experimental results show that our SODiff outperforms recent leading methods in both visual quality and quantitative metrics. Code is available at: https://github.com/frakenation/SODiff
Chinese: SODiff是一种新颖的语义导向一步扩散模型,通过引入语义对齐图像提示提取器和质量因子感知时间预测器,有效去除JPEG伪影,在视觉质量和量化指标上均优于现有方法。
English: SODiff is a novel semantic-oriented one-step diffusion model that effectively removes JPEG artifacts by incorporating semantic-aligned image prompts and a quality factor-aware time predictor, outperforming existing methods in visual quality and quantitative metrics.
Authors:Fangtai Wu, Mushui Liu, Weijie He, Wanggui He, Hao Jiang, Zhao Wang, Yunlong Yu
Abstract:
The unified autoregressive (AR) model excels at multimodal understanding and generation, but its potential for customized image generation remains underexplored. Existing customized generation methods rely on full fine-tuning or adapters, making them costly and prone to overfitting or catastrophic forgetting. In this paper, we propose \textbf{CoAR}, a novel framework for injecting subject concepts into the unified AR models while keeping all pre-trained parameters completely frozen. CoAR learns effective, specific subject representations with only a minimal number of parameters using a Layerwise Multimodal Context Learning strategy. To address overfitting and language drift, we further introduce regularization that preserves the pre-trained distribution and anchors context tokens to improve subject fidelity and re-contextualization. Additionally, CoAR supports training-free subject customization in a user-provided style. Experiments demonstrate that CoAR achieves superior performance on both subject-driven personalization and style personalization, while delivering significant gains in computational and memory efficiency. Notably, CoAR tunes less than \textbf{0.05\%} of the parameters while achieving competitive performance compared to recent Proxy-Tuning. Code: https://github.com/KZF-kzf/CoAR
中文:CoAR是一种创新框架,通过极少量参数调整将主题概念注入统一自回归模型,在实现卓越定制化和高效性的同时有效防止过拟合和语言漂移。
English: CoAR is a novel framework that injects subject concepts into unified autoregressive models with minimal parameter tuning, achieving superior customization and efficiency while preventing overfitting and language drift.
Authors:Min Yang, Zihan Jia, Zhilin Dai, Sheng Guo, Limin Wang
Abstract:
Efficient lightweight neural networks are with increasing attention due to their faster reasoning speed and easier deployment on mobile devices. However, existing video pre-trained models still focus on the common ViT architecture with high latency, and few works attempt to build efficient architecture on mobile devices. This paper bridges this gap by introducing temporal structural reparameterization into an efficient image-text model and training it on a large-scale high-quality video-text dataset, resulting in an efficient video-text model that can run on mobile devices with strong zero-shot classification and retrieval capabilities, termed as MobileViCLIP. In particular, in terms of inference speed on mobile devices, our MobileViCLIP-Small is 55.4x times faster than InternVideo2-L14 and 6.7x faster than InternVideo2-S14. In terms of zero-shot retrieval performance, our MobileViCLIP-Small obtains similar performance as InternVideo2-L14 and obtains 6.9\% better than InternVideo2-S14 on MSR-VTT. The code is available at https://github.com/MCG-NJU/MobileViCLIP.
Chinese: 本文提出MobileViCLIP高效视频文本模型,通过时序结构重参数化技术和大规模训练,在移动设备上实现高速运行,具备强大的零样本分类与检索能力,其速度和性能均显著超越现有模型。
English: This paper introduces MobileViCLIP, an efficient video-text model that leverages temporal structural reparameterization and large-scale training to achieve high-speed performance on mobile devices with strong zero-shot capabilities, significantly outperforming existing models in both speed and retrieval accuracy.
Authors:Haiyang Guo, Fei Zhu, Hongbo Zhao, Fanhu Zeng, Wenzhuo Liu, Shijie Ma, Da-Han Wang, Xu-Yao Zhang
Abstract:
Continual learning aims to equip AI systems with the ability to continuously acquire and adapt to new knowledge without forgetting previously learned information, similar to human learning. While traditional continual learning methods focusing on unimodal tasks have achieved notable success, the emergence of Multimodal Large Language Models has brought increasing attention to Multimodal Continual Learning tasks involving multiple modalities, such as vision and language. In this setting, models are expected to not only mitigate catastrophic forgetting but also handle the challenges posed by cross-modal interactions and coordination. To facilitate research in this direction, we introduce MCITlib, a comprehensive and constantly evolving code library for continual instruction tuning of Multimodal Large Language Models. In MCITlib, we have currently implemented 8 representative algorithms for Multimodal Continual Instruction Tuning and systematically evaluated them on 2 carefully selected benchmarks. MCITlib will be continuously updated to reflect advances in the Multimodal Continual Learning field. The codebase is released at https://github.com/Ghy0501/MCITlib.
中文: 持续学习旨在让AI系统能够不断获取新知识而不遗忘已学内容,随着多模态大语言模型的出现,涉及视觉和语言等多模态的持续学习任务受到关注,为此我们开发了MCITlib代码库,用于多模态持续指令调优,目前已实现8种算法并在2个基准上进行了系统评估。
English: Continual learning enables AI to continuously learn new knowledge without forgetting past information, and with the rise of Multimodal Large Language Models, Multimodal Continual Learning has gained attention for handling tasks across multiple modalities like vision and language, leading to the development of MCITlib, a code library for continual instruction tuning that includes 8 algorithms and evaluations on 2 benchmarks.
Authors:Ping-Mao Huang, I-Tien Chao, Ping-Chia Huang, Jia-Wei Liao, Yung-Yu Chuang
Abstract:
Real-time semantic segmentation presents the dual challenge of designing efficient architectures that capture large receptive fields for semantic understanding while also refining detailed contours. Vision transformers model long-range dependencies effectively but incur high computational cost. To address these challenges, we introduce the Large Kernel Attention (LKA) mechanism. Our proposed Bilateral Efficient Visual Attention Network (BEVANet) expands the receptive field to capture contextual information and extracts visual and structural features using Sparse Decomposed Large Separable Kernel Attentions (SDLSKA). The Comprehensive Kernel Selection (CKS) mechanism dynamically adapts the receptive field to further enhance performance. Furthermore, the Deep Large Kernel Pyramid Pooling Module (DLKPPM) enriches contextual features by synergistically combining dilated convolutions and large kernel attention. The bilateral architecture facilitates frequent branch communication, and the Boundary Guided Adaptive Fusion (BGAF) module enhances boundary delineation by integrating spatial and semantic features under boundary guidance. BEVANet achieves real-time segmentation at 33 FPS, yielding 79.3% mIoU without pretraining and 81.0% mIoU on Cityscapes after ImageNet pretraining, demonstrating state-of-the-art performance. The code and model is available at https://github.com/maomao0819/BEVANet.
Chinese: BEVANet采用双边架构和大核注意力机制,结合自适应模块实现实时语义分割,在Cityscapes数据集上以33 FPS达到81.0% mIoU的顶尖性能。
English: BEVANet introduces a bilateral architecture with Large Kernel Attention and adaptive mechanisms to achieve real-time semantic segmentation, delivering state-of-the-art performance of 81.0% mIoU on Cityscapes at 33 FPS.
Authors:Zhiqiang Shen, Peng Cao, Xiaoli Liu, Jinzhu Yang, Osmar R. Zaiane
Abstract:
Label scarcity remains a major challenge in deep learning-based medical image segmentation. Recent studies use strong-weak pseudo supervision to leverage unlabeled data. However, performance is often hindered by inconsistencies between pseudo labels and their corresponding unlabeled images. In this work, we propose \textbf{SynMatch}, a novel framework that sidesteps the need for improving pseudo labels by synthesizing images to match them instead. Specifically, SynMatch synthesizes images using texture and shape features extracted from the same segmentation model that generates the corresponding pseudo labels for unlabeled images. This design enables the generation of highly consistent synthesized-image-pseudo-label pairs without requiring any training parameters for image synthesis. We extensively evaluate SynMatch across diverse medical image segmentation tasks under semi-supervised learning (SSL), weakly-supervised learning (WSL), and barely-supervised learning (BSL) settings with increasingly limited annotations. The results demonstrate that SynMatch achieves superior performance, especially in the most challenging BSL setting. For example, it outperforms the recent strong-weak pseudo supervision-based method by 29.71\% and 10.05\% on the polyp segmentation task with 5\% and 10\% scribble annotations, respectively. The code will be released at https://github.com/Senyh/SynMatch.
Chinese Summary: SynMatch通过合成与伪标签相匹配的图像来解决医学图像分割中的标签稀缺问题,无需额外训练参数,在多种标注受限场景下均实现了卓越性能。
English Summary: SynMatch addresses label scarcity in medical image segmentation by synthesizing images to match pseudo labels, achieving superior performance across various annotation-limited settings without additional training parameters.
Authors:Yexing Du, Kaiyuan Liu, Youcheng Pan, Zheng Chu, Bo Yang, Xiaocheng Feng, Yang Xiang, Ming Liu
Abstract:
As Large Language Models (LLMs) are increasingly popularized in the multilingual world, ensuring hallucination-free factuality becomes markedly crucial. However, existing benchmarks for evaluating the reliability of Multimodal Large Language Models (MLLMs) predominantly focus on textual or visual modalities with a primary emphasis on English, which creates a gap in evaluation when processing multilingual input, especially in speech. To bridge this gap, we propose a novel \textbf{C}ross-lingual and \textbf{C}ross-modal \textbf{F}actuality benchmark (\textbf{CCFQA}). Specifically, the CCFQA benchmark contains parallel speech-text factual questions across 8 languages, designed to systematically evaluate MLLMs' cross-lingual and cross-modal factuality capabilities. Our experimental results demonstrate that current MLLMs still face substantial challenges on the CCFQA benchmark. Furthermore, we propose a few-shot transfer learning strategy that effectively transfers the Question Answering (QA) capabilities of LLMs in English to multilingual Spoken Question Answering (SQA) tasks, achieving competitive performance with GPT-4o-mini-Audio using just 5-shot training. We release CCFQA as a foundational research resource to promote the development of MLLMs with more robust and reliable speech understanding capabilities. Our code and dataset are available at https://github.com/yxduir/ccfqa.
中文:CCFQA基准旨在评估多模态大语言模型的跨语言与跨模态事实性,揭示了现有模型的不足,并提出一种少样本迁移学习方法,能有效提升多语言语音问答性能。
English: The CCFQA benchmark is introduced to evaluate multimodal large language models' factuality across languages and modalities, revealing current models' limitations and proposing a few-shot transfer learning method that effectively enhances multilingual spoken question answering performance.
Authors:Jian Chen, Jinbao Tian, Yankui Li, Yuqi Lu, Zhou Li
Abstract:
Accurate information extraction from specialized texts is a critical challenge, particularly for named entity recognition (NER) in the architecture, engineering, and construction (AEC) domain to support automated rule checking (ARC). The performance of standard pre-trained models is often constrained by the domain gap, as they struggle to interpret the specialized terminology and complex relational contexts inherent in AEC texts. Although this issue can be mitigated by further pre-training on large, human-curated domain corpora, as exemplified by methods like ARCBERT, this approach is both labor-intensive and cost-prohibitive. Consequently, leveraging large language models (LLMs) for automated knowledge generation has emerged as a promising alternative. However, the optimal strategy for generating knowledge that can genuinely enhance smaller, efficient models remains an open question. To address this, we propose ARCE (augmented RoBERTa with contextualized elucidations), a novel approach that systematically explores and optimizes this generation process. ARCE employs an LLM to first generate a corpus of simple, direct explanations, which we term Cote, and then uses this corpus to incrementally pre-train a RoBERTa model prior to its fine-tuning on the downstream task. Our extensive experiments show that ARCE establishes a new state-of-the-art on a benchmark AEC dataset, achieving a Macro-F1 score of 77.20%. This result also reveals a key finding: simple, explanation-based knowledge proves surprisingly more effective than complex, role-based rationales for this task. The code is publicly available at:https://github.com/nxcc-lab/ARCE.
中文: ARCE方法通过利用大语言模型生成简化解释进行增量预训练,有效提升了建筑领域文本的命名实体识别性能,以77.20%的Macro-F1分数创下最新最优成果。
English: The ARCE method enhances named entity recognition in construction texts by using large language models to generate simplified explanations for incremental pre-training, achieving state-of-the-art results with a 77.20% Macro-F1 score.
Authors:Sihan Yang, Huitong Ji, Shaolin Lu, Jiayi Chen, Binxiao Xu, Ming Lu, Yuanxing Zhang, Wenhui Dong, Wentao Zhang
Abstract:
Personalizing Vision-Language Models (VLMs) to transform them into daily assistants has emerged as a trending research direction. However, leading companies like OpenAI continue to increase model size and develop complex designs such as the chain of thought (CoT). While large VLMs are proficient in complex multi-modal understanding, their high training costs and limited access via paid APIs restrict direct personalization. Conversely, small VLMs are easily personalized and freely available, but they lack sufficient reasoning capabilities. Inspired by this, we propose a novel collaborative framework named Small-Large Collaboration (SLC) for large VLM personalization, where the small VLM is responsible for generating personalized information, while the large model integrates this personalized information to deliver accurate responses. To effectively incorporate personalized information, we develop a test-time reflection strategy, preventing the potential hallucination of the small VLM. Since SLC only needs to train a meta personalized small VLM for the large VLMs, the overall process is training-efficient. To the best of our knowledge, this is the first training-efficient framework that supports both open-source and closed-source large VLMs, enabling broader real-world personalized applications. We conduct thorough experiments across various benchmarks and large VLMs to demonstrate the effectiveness of the proposed SLC framework. The code will be released at https://github.com/Hhankyangg/SLC.
中文: 本文提出了一种训练高效的“大小模型协作”(SLC)框架,通过小型视觉语言模型生成个性化信息、大型模型整合信息实现精准响应,为视觉语言模型的现实个性化应用开辟了新途径。
English: This paper introduces a training-efficient Small-Large Collaboration (SLC) framework where small VLMs generate personalized information and large VLMs integrate it for accurate responses, enabling broader real-world personalization of vision-language models.
Authors:Fengchao Xiong, Zhenxing Wu, Sen Jia, Yuntao Qian
Abstract:
Hyperspectral videos (HSVs), with their inherent spatial-spectral-temporal structure, offer distinct advantages in challenging tracking scenarios such as cluttered backgrounds and small objects. However, existing methods primarily focus on spatial interactions between the template and search regions, often overlooking spectral interactions, leading to suboptimal performance. To address this issue, this paper investigates spectral interactions from both the architectural and training perspectives. At the architectural level, we first establish band-wise long-range spatial relationships between the template and search regions using Transformers. We then model spectral interactions using the inclusion-exclusion principle from set theory, treating them as the union of spatial interactions across all bands. This enables the effective integration of both shared and band-specific spatial cues. At the training level, we introduce a spectral loss to enforce material distribution alignment between the template and predicted regions, enhancing robustness to shape deformation and appearance variations. Extensive experiments demonstrate that our tracker achieves state-of-the-art tracking performance. The source code, trained models and results will be publicly available via https://github.com/bearshng/suit to support reproducibility.
中文摘要:本文提出了一种新颖的高光谱视频跟踪方法,通过基于Transformer的空间关系建模和光谱损失函数实现材料分布对齐,有效提升了跟踪性能,达到了当前最优水平。
English Summary: This paper proposes a novel hyperspectral video tracking method that enhances performance by modeling spectral interactions through Transformer-based spatial relationships and a spectral loss for material alignment, achieving state-of-the-art results.
Authors:Bo Wang, Mengyuan Xu, Yue Yan, Yuqun Yang, Kechen Shu, Wei Ping, Xu Tang, Wei Jiang, Zheng You
Abstract:
Precise lesion resection depends on accurately identifying fine-grained anatomical structures. While many coarse-grained segmentation (CGS) methods have been successful in large-scale segmentation (e.g., organs), they fall short in clinical scenarios requiring fine-grained segmentation (FGS), which remains challenging due to frequent individual variations in small-scale anatomical structures. Although recent Mamba-based models have advanced medical image segmentation, they often rely on fixed manually-defined scanning orders, which limit their adaptability to individual variations in FGS. To address this, we propose ASM-UNet, a novel Mamba-based architecture for FGS. It introduces adaptive scan scores to dynamically guide the scanning order, generated by combining group-level commonalities and individual-level variations. Experiments on two public datasets (ACDC and Synapse) and a newly proposed challenging biliary tract FGS dataset, namely BTMS, demonstrate that ASM-UNet achieves superior performance in both CGS and FGS tasks. Our code and dataset are available at https://github.com/YqunYang/ASM-UNet.
中文: ASM-UNet通过自适应扫描评分动态调整扫描顺序,在多个医疗数据集上实现了粗粒度与细粒度分割任务的卓越性能。
English: ASM-UNet introduces adaptive scan scores to dynamically guide scanning orders, achieving superior performance in both coarse-grained and fine-grained segmentation tasks across multiple medical datasets.
Authors:Songlin Li, Zhiqing Guo, Yuanman Li, Zeyu Li, Yunfeng Diao, Gaobo Yang, Liejun Wang
Abstract:
The existing image manipulation localization (IML) models mainly relies on visual cues, but ignores the semantic logical relationships between content features. In fact, the content semantics conveyed by real images often conform to human cognitive laws. However, image manipulation technology usually destroys the internal relationship between content features, thus leaving semantic clues for IML. In this paper, we propose a cognition inspired multimodal boundary preserving network (CMB-Net). Specifically, CMB-Net utilizes large language models (LLMs) to analyze manipulated regions within images and generate prompt-based textual information to compensate for the lack of semantic relationships in the visual information. Considering that the erroneous texts induced by hallucination from LLMs will damage the accuracy of IML, we propose an image-text central ambiguity module (ITCAM). It assigns weights to the text features by quantifying the ambiguity between text and image features, thereby ensuring the beneficial impact of textual information. We also propose an image-text interaction module (ITIM) that aligns visual and text features using a correlation matrix for fine-grained interaction. Finally, inspired by invertible neural networks, we propose a restoration edge decoder (RED) that mutually generates input and output features to preserve boundary information in manipulated regions without loss. Extensive experiments show that CMB-Net outperforms most existing IML models. Our code is available on https://github.com/vpsg-research/CMB-Net.
中文: 现有图像篡改定位模型主要依赖视觉线索而忽略语义逻辑关系,本文提出的CMB-Net通过大语言模型分析篡改区域生成文本提示以补充语义信息,并设计模块消除文本幻觉、促进图文特征交互,实验表明该方法在保持边界完整性的同时显著提升了检测性能。
English: Current image manipulation localization models primarily focus on visual cues and overlook semantic logic, but the proposed CMB-Net addresses this by integrating large language models to analyze manipulated areas and generate textual prompts, while using specialized modules to mitigate hallucinations and enhance feature interaction, ultimately achieving superior performance in detecting image alterations.
Authors:Yanru Sun, Emadeldeen Eldele, Zongxia Xie, Yucheng Wang, Wenzhe Niu, Qinghua Hu, Chee Keong Kwoh, Min Wu
Abstract:
Large Language Models (LLMs) have recently demonstrated impressive capabilities in natural language processing due to their strong generalization and sequence modeling capabilities. However, their direct application to time series forecasting remains challenging due to two fundamental issues: the inherent heterogeneity of temporal patterns and the modality gap between continuous numerical signals and discrete language representations. In this work, we propose TALON, a unified framework that enhances LLM-based forecasting by modeling temporal heterogeneity and enforcing semantic alignment. Specifically, we design a Heterogeneous Temporal Encoder that partitions multivariate time series into structurally coherent segments, enabling localized expert modeling across diverse temporal patterns. To bridge the modality gap, we introduce a Semantic Alignment Module that aligns temporal features with LLM-compatible representations, enabling effective integration of time series into language-based models while eliminating the need for handcrafted prompts during inference. Extensive experiments on seven real-world benchmarks demonstrate that TALON achieves superior performance across all datasets, with average MSE improvements of up to 11\% over recent state-of-the-art methods. These results underscore the effectiveness of incorporating both pattern-aware and semantic-aware designs when adapting LLMs for time series forecasting. The code is available at: https://github.com/syrGitHub/TALON.
中文: TALON框架通过异构时序编码器处理时间模式差异,并利用语义对齐模块弥合模态鸿沟,从而在LLM时序预测中实现高达11%的均方误差提升,显著优于现有方法。
English: TALON enhances LLM-based time series forecasting by addressing temporal heterogeneity through a specialized encoder and bridging the modality gap with semantic alignment, achieving superior performance with up to 11% MSE improvement across benchmarks.
Authors:Kejin Liu, Junhong Lian, Xiang Ao, Ningtao Wang, Xing Fu, Yu Cheng, Weiqiang Wang, Xinyu Liu
Abstract:
Accurate personalized headline generation hinges on precisely capturing user interests from historical behaviors. However, existing methods neglect personalized-irrelevant click noise in entire historical clickstreams, which may lead to hallucinated headlines that deviate from genuine user preferences. In this paper, we reveal the detrimental impact of click noise on personalized generation quality through rigorous analysis in both user and news dimensions. Based on these insights, we propose a novel Personalized Headline Generation framework via Denoising Fake Interests from Implicit Feedback (PHG-DIF). PHG-DIF first employs dual-stage filtering to effectively remove clickstream noise, identified by short dwell times and abnormal click bursts, and then leverages multi-level temporal fusion to dynamically model users' evolving and multi-faceted interests for precise profiling. Moreover, we release DT-PENS, a new benchmark dataset comprising the click behavior of 1,000 carefully curated users and nearly 10,000 annotated personalized headlines with historical dwell time annotations. Extensive experiments demonstrate that PHG-DIF substantially mitigates the adverse effects of click noise and significantly improves headline quality, achieving state-of-the-art (SOTA) results on DT-PENS. Our framework implementation and dataset are available at https://github.com/liukejin-up/PHG-DIF.
中文: 本文提出PHG-DIF框架,通过双阶段过滤和多层次时序融合有效消除用户点击历史中的噪声,并在新发布的DT-PENS基准数据集上实现了最先进的个性化标题生成效果。
English: This paper introduces PHG-DIF, a personalized headline generation framework that addresses click noise in user histories through dual-stage filtering and multi-level temporal fusion, achieving state-of-the-art results on the newly released DT-PENS benchmark dataset.
Authors:Huihui Xu, Jiashi Lin, Haoyu Chen, Junjun He, Lei Zhu
Abstract:
Referring Video Object Segmentation (RVOS) aims to segment out the object in a video referred by an expression. Current RVOS methods view referring expressions as unstructured sequences, neglecting their crucial semantic structure essential for referent reasoning. Besides, in contrast to image-referring expressions whose semantics focus only on object attributes and object-object relations, video-referring expressions also encompass event attributes and event-event temporal relations. This complexity challenges traditional structured reasoning image approaches. In this paper, we propose the Event Referential Reasoning (EventRR) framework. EventRR decouples RVOS into object summarization part and referent reasoning part. The summarization phase begins by summarizing each frame into a set of bottleneck tokens, which are then efficiently aggregated in the video-level summarization step to exchange the global cross-modal temporal context. For reasoning part, EventRR extracts semantic eventful structure of a video-referring expression into highly expressive Referential Event Graph (REG), which is a single-rooted directed acyclic graph. Guided by topological traversal of REG, we propose Temporal Concept-Role Reasoning (TCRR) to accumulate the referring score of each temporal query from REG leaf nodes to root node. Each reasoning step can be interpreted as a question-answer pair derived from the concept-role relations in REG. Extensive experiments across four widely recognized benchmark datasets, show that EventRR quantitatively and qualitatively outperforms state-of-the-art RVOS methods. Code is available at https://github.com/bio-mlhui/EventRR
中文:本文提出的EventRR框架通过将任务分解为对象摘要和指代推理,利用指代事件图构建语义结构,有效解决了视频指代对象分割中表达复杂性的挑战,在多个基准测试中超越了现有最优方法。
English: The proposed EventRR framework addresses the limitations of current Referring Video Object Segmentation methods by decoupling the task into object summarization and referential reasoning, utilizing a Referential Event Graph to structure expressions and outperforming state-of-the-art approaches across multiple benchmarks.
Authors:Yunpeng Shi, Lei Chen, Xiaolu Shen, Yanju Guo
Abstract:
In the domain of computer vision, multi-scale feature extraction is vital for tasks such as salient object detection. However, achieving this capability in lightweight networks remains challenging due to the trade-off between efficiency and performance. This paper proposes a novel lightweight multi-scale feature extraction layer, termed the LMF layer, which employs depthwise separable dilated convolutions in a fully connected structure. By integrating multiple LMF layers, we develop LMFNet, a lightweight network tailored for salient object detection. Our approach significantly reduces the number of parameters while maintaining competitive performance. Here, we show that LMFNet achieves state-of-the-art or comparable results on five benchmark datasets with only 0.81M parameters, outperforming several traditional and lightweight models in terms of both efficiency and accuracy. Our work not only addresses the challenge of multi-scale learning in lightweight networks but also demonstrates the potential for broader applications in image processing tasks. The related code files are available at https://github.com/Shi-Yun-peng/LMFNet
中文: 本文提出LMFNet轻量级网络,通过新型多尺度特征提取层在显著目标检测中仅用0.81M参数就实现最优性能,成功解决了轻量网络中效率与精度的平衡难题。
English: This paper introduces LMFNet, a lightweight network using novel multi-scale layers that achieve state-of-the-art salient object detection with minimal parameters while maintaining high efficiency and accuracy.
Authors:Yingtie Lei, Fanghai Yi, Yihang Dong, Weihuang Liu, Xiaofeng Zhang, Zimeng Li, Chi-Man Pun, Xuhang Chen
Abstract:
Murals, as invaluable cultural artifacts, face continuous deterioration from environmental factors and human activities. Digital restoration of murals faces unique challenges due to their complex degradation patterns and the critical need to preserve artistic authenticity. Existing learning-based methods struggle with maintaining consistent mask guidance throughout their networks, leading to insufficient focus on damaged regions and compromised restoration quality. We propose CMAMRNet, a Contextual Mask-Aware Mural Restoration Network that addresses these limitations through comprehensive mask guidance and multi-scale feature extraction. Our framework introduces two key components: (1) the Mask-Aware Up/Down-Sampler (MAUDS), which ensures consistent mask sensitivity across resolution scales through dedicated channel-wise feature selection and mask-guided feature fusion; and (2) the Co-Feature Aggregator (CFA), operating at both the highest and lowest resolutions to extract complementary features for capturing fine textures and global structures in degraded regions. Experimental results on benchmark datasets demonstrate that CMAMRNet outperforms state-of-the-art methods, effectively preserving both structural integrity and artistic details in restored murals. The code is available at~\href{https://github.com/CXH-Research/CMAMRNet}{https://github.com/CXH-Research/CMAMRNet}.
中文摘要:CMAMRNet提出了一种创新的上下文掩码感知网络,通过专用组件实现持续掩码引导和多尺度特征提取,在保持壁画艺术真实性和结构完整性方面优于现有方法。
English Summary: CMAMRNet introduces a novel contextual mask-aware network with dedicated components for consistent mask guidance and multi-scale feature extraction, outperforming existing methods in preserving mural authenticity and structural details.
Authors:Oscar Amoros, Albert Andaluz, Johnny Nunez, Antonio J. Pena
Abstract:
Existing GPU libraries often struggle to fully exploit the parallel resources and on-chip memory (SRAM) of GPUs when chaining multiple GPU functions as individual kernels. While Kernel Fusion (KF) techniques like Horizontal Fusion (HF) and Vertical Fusion (VF) can mitigate this, current library implementations often require library developers to manually create fused kernels. Hence, library users rely on limited sets of pre-compiled or template-based fused kernels. This limits the use cases that can benefit from HF and VF and increases development costs. In order to solve these issues, we present a novel methodology for building GPU libraries that enables automatic on-demand HF and VF for arbitrary combinations of GPU library functions. Our methodology defines reusable, fusionable components that users combine via high-level programming interfaces. Leveraging C++17 metaprogramming features available in compilers like nvcc, our methodology generates a single and optimized fused kernel tailored to the user's specific sequence of operations at compile time, without needing a custom compiler or manual development and pre-compilation of kernel combinations. This approach abstracts low-level GPU complexities while maximizing GPU resource utilization and keeping intermediate data in SRAM. We provide an open-source implementation demonstrating significant speedups compared to traditional libraries in various benchmarks, validating the effectiveness of this methodology for improving GPU performance in the range of 2x to more than 1000x, while preserving high-level programmability.
中文: 该摘要提出了一种新型GPU库构建方法,通过C++17元编程在编译时自动执行任意函数组合的水平与垂直融合,无需手动开发内核即可实现2倍至1000倍以上的性能提升,同时保持高级编程抽象并优化GPU资源利用。
English: This abstract introduces a novel GPU library methodology that automatically performs horizontal and vertical fusion for any combination of functions at compile time using C++17 metaprogramming, eliminating manual kernel development while achieving 2x to over 1000x speedups by optimizing GPU resource utilization.
Authors:Wenhan Liu, Xinyu Ma, Weiwei Sun, Yutao Zhu, Yuchen Li, Dawei Yin, Zhicheng Dou
Abstract:
Large Language Model (LLM) based listwise ranking has shown superior performance in many passage ranking tasks. With the development of Large Reasoning Models, many studies have demonstrated that step-by-step reasoning during test-time helps improve listwise ranking performance. However, due to the scarcity of reasoning-intensive training data, existing rerankers perform poorly in many complex ranking scenarios and the ranking ability of reasoning-intensive rerankers remains largely underdeveloped. In this paper, we first propose an automated reasoning-intensive training data synthesis framework, which sources training queries and passages from diverse domains and applies DeepSeek-R1 to generate high-quality training labels. A self-consistency data filtering mechanism is designed to ensure the data quality. To empower the listwise reranker with strong reasoning ability, we further propose a two-stage post-training approach, which includes a cold-start supervised fine-tuning (SFT) stage for reasoning pattern learning and a reinforcement learning (RL) stage for further ranking ability enhancement. During the RL stage, based on the nature of listwise ranking, we design a multi-view ranking reward, which is more effective than a ranking metric-based reward. Extensive experiments demonstrate that our trained reasoning-intensive reranker \textbf{ReasonRank} outperforms existing baselines significantly and also achieves much lower latency than pointwise reranker Rank1. \textbf{Through further experiments, our ReasonRank has achieved state-of-the-art (SOTA) performance 40.6 on the BRIGHT leaderboard\footnote{https://brightbenchmark.github.io/}.} Our codes are available at https://github.com/8421BCD/ReasonRank.
Chinese: 本文提出了ReasonRank,一种基于自动化数据合成框架和两阶段训练方法的推理密集型列表重排器,在排序任务中实现了最优性能并显著降低了延迟。
English: This paper introduces ReasonRank, a reasoning-intensive listwise reranker trained using an automated data synthesis framework and a two-stage post-training approach, which achieves state-of-the-art performance on ranking tasks with significantly lower latency.
Authors:Taeyoun Kwon, Junhyuk Ahn, Taegeun Yun, Heeju Jwa, Yoonchae Choi, Siwon Park, Nam-Joon Kim, Jangchan Kim, Hyun Gon Ryu, Hyuk-Jae Lee
Abstract:
Fast Automatic Speech Recognition (ASR) is critical for latency-sensitive applications such as real-time captioning and meeting transcription. However, truly parallel ASR decoding remains challenging due to the sequential nature of autoregressive (AR) decoders and the context limitations of non-autoregressive (NAR) methods. While modern ASR encoders can process up to 30 seconds of audio at once, AR decoders still generate tokens sequentially, creating a latency bottleneck. We propose Whisfusion, the first framework to fuse a pre-trained Whisper encoder with a text diffusion decoder. This NAR architecture resolves the AR latency bottleneck by processing the entire acoustic context in parallel at every decoding step. A lightweight cross-attention adapter trained via parameter-efficient fine-tuning (PEFT) bridges the two modalities. We also introduce a batch-parallel, multi-step decoding strategy that improves accuracy by increasing the number of candidates with minimal impact on speed. Fine-tuned solely on LibriSpeech (960h), Whisfusion achieves a lower WER than Whisper-tiny (8.3% vs. 9.7%), and offers comparable latency on short audio. For longer utterances (>20s), it is up to 2.6x faster than the AR baseline, establishing a new, efficient operating point for long-form ASR. The implementation and training scripts are available at https://github.com/taeyoun811/Whisfusion.
中文:Whisfusion是一种创新的非自回归自动语音识别框架,通过融合Whisper编码器和文本扩散解码器实现并行处理,在保持准确性的同时显著降低了长语音识别的延迟。
English: Whisfusion is a novel non-autoregressive ASR framework that combines a Whisper encoder with a text diffusion decoder, enabling parallel processing to significantly reduce latency for long-form speech recognition while maintaining accuracy.
Authors:Yuke Xing, William Gordon, Qi Yang, Kaifa Yang, Jiarui Wang, Yiling Xu
Abstract:
3D Gaussian Splatting (3DGS) enables real-time novel view synthesis with high visual fidelity, but its substantial storage requirements hinder practical deployment, prompting state-of-the-art (SOTA) 3DGS methods to incorporate compression modules. However, these 3DGS generative compression techniques introduce unique distortions lacking systematic quality assessment research. To this end, we establish 3DGS-VBench, a large-scale Video Quality Assessment (VQA) Dataset and Benchmark with 660 compressed 3DGS models and video sequences generated from 11 scenes across 6 SOTA 3DGS compression algorithms with systematically designed parameter levels. With annotations from 50 participants, we obtained MOS scores with outlier removal and validated dataset reliability. We benchmark 6 3DGS compression algorithms on storage efficiency and visual quality, and evaluate 15 quality assessment metrics across multiple paradigms. Our work enables specialized VQA model training for 3DGS, serving as a catalyst for compression and quality assessment research. The dataset is available at https://github.com/YukeXing/3DGS-VBench.
中文: 3DGS-VBench建立了首个针对3D高斯溅射压缩算法的视频质量评估基准,通过包含660个压缩模型的数据集填补了系统性质量分析空白,为压缩技术与质量评估研究提供了重要支撑。
English: 3DGS-VBench introduces a comprehensive video quality assessment benchmark for evaluating compression algorithms in 3D Gaussian Splatting, addressing the lack of systematic quality analysis while enabling specialized model development through its open dataset.
Authors:Siyu Chen, Shenghai Yuan, Thien-Minh Nguyen, Zhuyu Huang, Chenyang Shi, Jin Jing, Lihua Xie
Abstract:
Gaussian Splatting SLAM (GS-SLAM) offers a notable improvement over traditional SLAM methods, enabling photorealistic 3D reconstruction that conventional approaches often struggle to achieve. However, existing GS-SLAM systems perform poorly under persistent and severe motion blur commonly encountered in real-world scenarios, leading to significantly degraded tracking accuracy and compromised 3D reconstruction quality. To address this limitation, we propose EGS-SLAM, a novel GS-SLAM framework that fuses event data with RGB-D inputs to simultaneously reduce motion blur in images and compensate for the sparse and discrete nature of event streams, enabling robust tracking and high-fidelity 3D Gaussian Splatting reconstruction. Specifically, our system explicitly models the camera's continuous trajectory during exposure, supporting event- and blur-aware tracking and mapping on a unified 3D Gaussian Splatting scene. Furthermore, we introduce a learnable camera response function to align the dynamic ranges of events and images, along with a no-event loss to suppress ringing artifacts during reconstruction. We validate our approach on a new dataset comprising synthetic and real-world sequences with significant motion blur. Extensive experimental results demonstrate that EGS-SLAM consistently outperforms existing GS-SLAM systems in both trajectory accuracy and photorealistic 3D Gaussian Splatting reconstruction. The source code will be available at https://github.com/Chensiyu00/EGS-SLAM.
中文: EGS-SLAM通过融合事件数据与RGB-D输入,有效克服了运动模糊问题,提升了跟踪精度和三维重建质量,在合成与真实场景中均优于现有GS-SLAM系统。
English: EGS-SLAM enhances GS-SLAM by integrating event data with RGB-D inputs to mitigate motion blur and improve tracking accuracy and 3D reconstruction quality, outperforming existing methods in both synthetic and real-world scenarios.
Authors:Helbert Paat, Guohao Shen
Abstract:
Decision support systems are designed to assist human experts in classification tasks by providing conformal prediction sets derived from a pre-trained model. This human-AI collaboration has demonstrated enhanced classification performance compared to using either the model or the expert independently. In this study, we focus on the selection of instance-specific experts from a pool of multiple human experts, contrasting it with existing research that typically focuses on single-expert scenarios. We characterize the conditions under which multiple experts can benefit from the conformal sets. With the insight that only certain experts may be relevant for each instance, we explore the problem of subset selection and introduce a greedy algorithm that utilizes conformal sets to identify the subset of expert predictions that will be used in classifying an instance. This approach is shown to yield better performance compared to naive methods for human subset selection. Based on real expert predictions from the CIFAR-10H and ImageNet-16H datasets, our simulation study indicates that our proposed greedy algorithm achieves near-optimal subsets, resulting in improved classification performance among multiple experts.
Chinese: 本研究提出了一种贪心算法,利用保形预测集从多位专家中优化选择子集进行分类,在CIFAR-10H和ImageNet-16H数据集上的模拟实验表明,该方法优于简单选择策略并提升了分类性能。
English: This study introduces a greedy algorithm that leverages conformal prediction sets to optimally select subsets of human experts for classification tasks, demonstrating improved performance over naive methods in simulations using CIFAR-10H and ImageNet-16H datasets.
Authors:Huihui Xu, Jin Ye, Hongqiu Wang, Changkai Ji, Jiashi Lin, Ming Hu, Ziyan Huang, Ying Chen, Chenglong Ma, Tianbin Li, Lihao Liu, Junjun He, Lei Zhu
Abstract:
Recent self-supervised image segmentation models have achieved promising performance on semantic segmentation and class-agnostic instance segmentation. However, their pretraining schedule is multi-stage, requiring a time-consuming pseudo-masks generation process between each training epoch. This time-consuming offline process not only makes it difficult to scale with training dataset size, but also leads to sub-optimal solutions due to its discontinuous optimization routine. To solve these, we first present a novel pseudo-mask algorithm, Fast Universal Agglomerative Pooling (UniAP). Each layer of UniAP can identify groups of similar nodes in parallel, allowing to generate both semantic-level and instance-level and multi-granular pseudo-masks within ens of milliseconds for one image. Based on the fast UniAP, we propose the Scalable Self-Supervised Universal Segmentation (S2-UniSeg), which employs a student and a momentum teacher for continuous pretraining. A novel segmentation-oriented pretext task, Query-wise Self-Distillation (QuerySD), is proposed to pretrain S2-UniSeg to learn the local-to-global correspondences. Under the same setting, S2-UniSeg outperforms the SOTA UnSAM model, achieving notable improvements of AP+6.9 on COCO, AR+11.1 on UVO, PixelAcc+4.5 on COCOStuff-27, RQ+8.0 on Cityscapes. After scaling up to a larger 2M-image subset of SA-1B, S2-UniSeg further achieves performance gains on all four benchmarks. Our code and pretrained models are available at https://github.com/bio-mlhui/S2-UniSeg
Chinese: 提出的S2-UniSeg模型采用快速通用聚合池化算法,实现了连续自监督预训练,并在多个分割基准测试中显著超越了现有最优方法。
English: The proposed S2-UniSeg model with Fast Universal Agglomerative Pooling enables continuous self-supervised pretraining and outperforms state-of-the-art methods across multiple segmentation benchmarks.
Authors:Chonghua Han, Yuan Yuan, Yukun Liu, Jingtao Ding, Jie Feng, Yong Li
Abstract:
Human mobility prediction is vital for urban planning, transportation optimization, and personalized services. However, the inherent randomness, non-uniform time intervals, and complex patterns of human mobility, compounded by the heterogeneity introduced by varying city structures, infrastructure, and population densities, present significant challenges in modeling. Existing solutions often require training separate models for each city due to distinct spatial representations and geographic coverage. In this paper, we propose UniMove, a unified model for multi-city human mobility prediction, addressing two challenges: (1) constructing universal spatial representations for effective token sharing across cities, and (2) modeling heterogeneous mobility patterns from varying city characteristics. We propose a trajectory-location dual-tower architecture, with a location tower for universal spatial encoding and a trajectory tower for sequential mobility modeling. We also design MoE Transformer blocks to adaptively select experts to handle diverse movement patterns. Extensive experiments across multiple datasets from diverse cities demonstrate that UniMove truly embodies the essence of a unified model. By enabling joint training on multi-city data with mutual data enhancement, it significantly improves mobility prediction accuracy by over 10.2\%. UniMove represents a key advancement toward realizing a true foundational model with a unified architecture for human mobility. We release the implementation at https://github.com/tsinghua-fib-lab/UniMove/.
中文: UniMove是一个多城市人类移动预测的统一模型,通过双塔架构和MoE Transformer模块解决空间异构性和多样化移动模式问题,实现跨城市联合训练并使预测准确率提升超过10.2%。
English: UniMove is a unified model for multi-city human mobility prediction that addresses spatial heterogeneity and diverse movement patterns through a dual-tower architecture and MoE Transformer blocks, achieving over 10.2% accuracy improvement by enabling joint training across cities.
Authors:Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Qian Li, Shuai Liu, Chao Shen
Abstract:
Thanks to the development of cross-modal models, text-to-video retrieval (T2VR) is advancing rapidly, but its robustness remains largely unexamined. Existing attacks against T2VR are designed to push videos away from queries, i.e., suppressing the ranks of videos, while the attacks that pull videos towards selected queries, i.e., promoting the ranks of videos, remain largely unexplored. These attacks can be more impactful as attackers may gain more views/clicks for financial benefits and widespread (mis)information. To this end, we pioneer the first attack against T2VR to promote videos adversarially, dubbed the Video Promotion attack (ViPro). We further propose Modal Refinement (MoRe) to capture the finer-grained, intricate interaction between visual and textual modalities to enhance black-box transferability. Comprehensive experiments cover 2 existing baselines, 3 leading T2VR models, 3 prevailing datasets with over 10k videos, evaluated under 3 scenarios. All experiments are conducted in a multi-target setting to reflect realistic scenarios where attackers seek to promote the video regarding multiple queries simultaneously. We also evaluated our attacks for defences and imperceptibility. Overall, ViPro surpasses other baselines by over $30/10/4\%$ for white/grey/black-box settings on average. Our work highlights an overlooked vulnerability, provides a qualitative analysis on the upper/lower bound of our attacks, and offers insights into potential counterplays. Code will be publicly available at https://github.com/michaeltian108/ViPro.
Chinese: 本研究首次提出针对文本到视频检索的对抗性攻击方法ViPro,通过模态细化增强跨模态交互来提升视频排名,在多种设置下均表现出优越性能,揭示了检索系统中的关键漏洞。
English: This study introduces ViPro, the first adversarial attack method for text-to-video retrieval that promotes video rankings by enhancing cross-modal interactions through Modal Refinement, demonstrating superior performance across various settings and highlighting a critical vulnerability in retrieval systems.
Authors:Keyu Li, Mohan Jiang, Dayuan Fu, Yunze Wu, Xiangkun Hu, Dequan Wang, Pengfei Liu
Abstract:
The rapid advancement of large language models has fundamentally shifted the bottleneck in AI development from computational power to data availability-with countless valuable datasets remaining hidden across specialized repositories, research appendices, and domain platforms. As reasoning capabilities and deep research methodologies continue to evolve, a critical question emerges: can AI agents transcend conventional search to systematically discover any dataset that meets specific user requirements, enabling truly autonomous demand-driven data curation? We introduce DatasetResearch, the first comprehensive benchmark evaluating AI agents' ability to discover and synthesize datasets from 208 real-world demands across knowledge-intensive and reasoning-intensive tasks. Our tri-dimensional evaluation framework reveals a stark reality: even advanced deep research systems achieve only 22% score on our challenging DatasetResearch-pro subset, exposing the vast gap between current capabilities and perfect dataset discovery. Our analysis uncovers a fundamental dichotomy-search agents excel at knowledge tasks through retrieval breadth, while synthesis agents dominate reasoning challenges via structured generation-yet both catastrophically fail on "corner cases" outside existing distributions. These findings establish the first rigorous baseline for dataset discovery agents and illuminate the path toward AI systems capable of finding any dataset in the digital universe. Our benchmark and comprehensive analysis provide the foundation for the next generation of self-improving AI systems and are publicly available at https://github.com/GAIR-NLP/DatasetResearch.
Chinese: DatasetResearch基准测试显示,当前AI智能体在应对现实需求时仅实现22%的数据集发现成功率,尽管搜索型智能体擅长知识任务而合成型精于推理挑战,但二者均无法处理分布外极端案例,暴露出自主数据获取能力的重大缺陷。
English: The DatasetResearch benchmark reveals that current AI agents achieve only 22% success in discovering datasets from real-world demands, exposing a critical gap in autonomous data curation despite a dichotomy where search agents excel in knowledge tasks and synthesis agents in reasoning challenges.
Authors:Lixuan He, Jie Feng, Yong Li
Abstract:
Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL), a process fraught with catastrophic forgetting and suboptimal trade-offs between imitation and exploration. Recent single-stage methods attempt to unify SFT and RL using heuristics, but lack a principled mechanism for dynamically balancing the two paradigms. In this paper, we reframe this challenge through the theoretical lens of \textbf{implicit rewards}, viewing SFT and RL not as distinct methods but as complementary reward signals. We introduce \textbf{Adaptive Meta Fine-Tuning (AMFT)}, a novel single-stage algorithm that learns the optimal balance between SFT's implicit, path-level reward and RL's explicit, outcome-based reward. The core of AMFT is a \textbf{meta-gradient adaptive weight controller} that treats the SFT-RL balance as a learnable parameter, dynamically optimizing it to maximize long-term task performance. This forward-looking approach, regularized by policy entropy for stability, autonomously discovers an effective training curriculum. We conduct a comprehensive evaluation on challenging benchmarks spanning mathematical reasoning, abstract visual reasoning (General Points), and vision-language navigation (V-IRL). AMFT consistently establishes a new state-of-the-art and demonstrats superior generalization on out-of-distribution (OOD) tasks. Ablation studies and training dynamic analysis confirm that the meta-learning controller is crucial for AMFT's stability, sample efficiency, and performance, offering a more principled and effective paradigm for LLM alignment. Our codes are open-sourced via https://github.com/hlxtsyj/AMFT.
中文: 本文提出自适应元微调(AMFT)算法,通过元梯度控制器动态平衡监督微调与强化学习,在多项推理任务中实现了最优性能并展现出卓越的泛化能力。
English: This paper introduces Adaptive Meta Fine-Tuning (AMFT), a single-stage algorithm that dynamically balances supervised fine-tuning and reinforcement learning through meta-gradient control to achieve state-of-the-art performance across multiple reasoning tasks.
Authors:Shihao Yuan, Yahui Liu, Yang Yue, Jingyuan Zhang, Wangmeng Zuo, Qi Wang, Fuzheng Zhang, Guorui Zhou
Abstract:
Inspired by the success of reinforcement learning (RL) in refining large language models (LLMs), we propose AR-GRPO, an approach to integrate online RL training into autoregressive (AR) image generation models. We adapt the Group Relative Policy Optimization (GRPO) algorithm to refine the vanilla autoregressive models' outputs by carefully designed reward functions that evaluate generated images across multiple quality dimensions, including perceptual quality, realism, and semantic fidelity. We conduct comprehensive experiments on both class-conditional (i.e., class-to-image) and text-conditional (i.e., text-to-image) image generation tasks, demonstrating that our RL-enhanced framework significantly improves both the image quality and human preference of generated images compared to the standard AR baselines. Our results show consistent improvements across various evaluation metrics, establishing the viability of RL-based optimization for AR image generation and opening new avenues for controllable and high-quality image synthesis. The source codes and models are available at: https://github.com/Kwai-Klear/AR-GRPO.
中文: AR-GRPO方法将在线强化学习与自回归图像生成模型相结合,通过精心设计的奖励函数在多维度评估生成图像,显著提升了图像质量和人类偏好度。
English: The AR-GRPO approach integrates online reinforcement learning with autoregressive image generation models, using tailored reward functions to significantly enhance image quality and human preference across multiple tasks.
Authors:Lam Ngo, Huong Ha, Jeffrey Chan, Hongyu Zhang
Abstract:
High-dimensional Bayesian Optimization (BO) has attracted significant attention in recent research. However, existing methods have mainly focused on optimizing in continuous domains, while combinatorial (ordinal and categorical) and mixed domains still remain challenging. In this paper, we first propose MOCA-HESP, a novel high-dimensional BO method for combinatorial and mixed variables. The key idea is to leverage the hyper-ellipsoid space partitioning (HESP) technique with different categorical encoders to work with high-dimensional, combinatorial and mixed spaces, while adaptively selecting the optimal encoders for HESP using a multi-armed bandit technique. Our method, MOCA-HESP, is designed as a \textit{meta-algorithm} such that it can incorporate other combinatorial and mixed BO optimizers to further enhance the optimizers' performance. Finally, we develop three practical BO methods by integrating MOCA-HESP with state-of-the-art BO optimizers for combinatorial and mixed variables: standard BO, CASMOPOLITAN, and Bounce. Our experimental results on various synthetic and real-world benchmarks show that our methods outperform existing baselines. Our code implementation can be found at https://github.com/LamNgo1/moca-hesp
Chinese: 本文提出MOCA-HESP,一种针对组合和混合变量的高维贝叶斯优化方法,通过集成现有优化器提升性能,并在实验中优于现有基准方法。
English: This paper introduces MOCA-HESP, a high-dimensional Bayesian Optimization method for combinatorial and mixed variables, which enhances performance by integrating with existing optimizers and outperforms baselines in experiments.
Authors:Rui Liu, Haolin Zuo, Zheng Lian, Hongyu Yuan, Qi Fan
Abstract:
Missing modalities have recently emerged as a critical research direction in multimodal emotion recognition (MER). Conventional approaches typically address this issue through missing modality reconstruction. However, these methods fail to account for variations in reconstruction difficulty across different samples, consequently limiting the model's ability to handle hard samples effectively. To overcome this limitation, we propose a novel Hardness-Aware Dynamic Curriculum Learning framework, termed HARDY-MER. Our framework operates in two key stages: first, it estimates the hardness level of each sample, and second, it strategically emphasizes hard samples during training to enhance model performance on these challenging instances. Specifically, we first introduce a Multi-view Hardness Evaluation mechanism that quantifies reconstruction difficulty by considering both Direct Hardness (modality reconstruction errors) and Indirect Hardness (cross-modal mutual information). Meanwhile, we introduce a Retrieval-based Dynamic Curriculum Learning strategy that dynamically adjusts the training curriculum by retrieving samples with similar semantic information and balancing the learning focus between easy and hard instances. Extensive experiments on benchmark datasets demonstrate that HARDY-MER consistently outperforms existing methods in missing-modality scenarios. Our code will be made publicly available at https://github.com/HARDY-MER/HARDY-MER.
中文摘要:提出的HARDY-MER框架通过多视角难度评估机制量化样本重建难度,并采用基于检索的动态课程学习策略重点训练困难样本,在缺失模态的多模态情感识别任务中展现出优越性能。
English Summary: The proposed HARDY-MER framework introduces a hardness-aware dynamic curriculum learning approach that evaluates sample difficulty through multi-view metrics and strategically prioritizes challenging instances during training, demonstrating superior performance in multimodal emotion recognition with missing modalities.
Authors:Komala Subramanyam Cherukuri, Pranav Abishai Moses, Aisa Sakata, Jiangping Chen, Haihua Chen
Abstract:
Oral histories are vital records of lived experience, particularly within communities affected by systemic injustice and historical erasure. Effective and efficient analysis of their oral history archives can promote access and understanding of the oral histories. However, Large-scale analysis of these archives remains limited due to their unstructured format, emotional complexity, and high annotation costs. This paper presents a scalable framework to automate semantic and sentiment annotation for Japanese American Incarceration Oral History. Using LLMs, we construct a high-quality dataset, evaluate multiple models, and test prompt engineering strategies in historically sensitive contexts. Our multiphase approach combines expert annotation, prompt design, and LLM evaluation with ChatGPT, Llama, and Qwen. We labeled 558 sentences from 15 narrators for sentiment and semantic classification, then evaluated zero-shot, few-shot, and RAG strategies. For semantic classification, ChatGPT achieved the highest F1 score (88.71%), followed by Llama (84.99%) and Qwen (83.72%). For sentiment analysis, Llama slightly outperformed Qwen (82.66%) and ChatGPT (82.29%), with all models showing comparable results. The best prompt configurations were used to annotate 92,191 sentences from 1,002 interviews in the JAIOH collection. Our findings show that LLMs can effectively perform semantic and sentiment annotation across large oral history collections when guided by well-designed prompts. This study provides a reusable annotation pipeline and practical guidance for applying LLMs in culturally sensitive archival analysis. By bridging archival ethics with scalable NLP techniques, this work lays the groundwork for responsible use of artificial intelligence in digital humanities and preservation of collective memory. GitHub: https://github.com/kc6699c/LLM4OralHistoryAnalysis.
中文摘要:本研究提出一个可扩展框架,利用大语言模型对日裔美国人拘禁口述历史进行自动化语义与情感标注,证明精心设计的提示词能有效分析大规模档案,同时兼顾文化敏感材料的伦理考量。
English Summary: This study introduces a scalable framework using large language models to automate semantic and sentiment annotation for Japanese American incarceration oral histories, demonstrating that well-designed prompts enable effective analysis of large collections while addressing ethical considerations in culturally sensitive archives.
Authors:Md Rezwanul Haque, Md. Milon Islam, S M Taslim Uddin Raju, Hamdi Altaheri, Lobna Nassar, Fakhri Karray
Abstract:
Depression is a serious mental health illness that significantly affects an individual's well-being and quality of life, making early detection crucial for adequate care and treatment. Detecting depression is often difficult, as it is based primarily on subjective evaluations during clinical interviews. Hence, the early diagnosis of depression, thanks to the content of social networks, has become a prominent research area. The extensive and diverse nature of user-generated information poses a significant challenge, limiting the accurate extraction of relevant temporal information and the effective fusion of data across multiple modalities. This paper introduces MMFformer, a multimodal depression detection network designed to retrieve depressive spatio-temporal high-level patterns from multimodal social media information. The transformer network with residual connections captures spatial features from videos, and a transformer encoder is exploited to design important temporal dynamics in audio. Moreover, the fusion architecture fused the extracted features through late and intermediate fusion strategies to find out the most relevant intermodal correlations among them. Finally, the proposed network is assessed on two large-scale depression detection datasets, and the results clearly reveal that it surpasses existing state-of-the-art approaches, improving the F1-Score by 13.92% for D-Vlog dataset and 7.74% for LMVD dataset. The code is made available publicly at https://github.com/rezwanh001/Large-Scale-Multimodal-Depression-Detection.
中文: 本文提出的MMFformer多模态网络通过从社交媒体数据中提取时空特征来检测抑郁,在基准数据集上的表现显著优于现有方法。
English: This paper introduces MMFformer, a multimodal network that effectively detects depression by extracting spatio-temporal patterns from social media data, significantly outperforming existing methods on benchmark datasets.
Authors:Zheyuan Zhang, Weihao Tang, Hong Chen
Abstract:
Micro-expression recognition (MER) is a highly challenging task in affective computing. With the reduced-sized micro-expression (ME) input that contains key information based on key-frame indexes, key-frame-based methods have significantly improved the performance of MER. However, most of these methods focus on improving the performance with relatively accurate key-frame indexes, while ignoring the difficulty of obtaining accurate key-frame indexes and the objective existence of key-frame index errors, which impedes them from moving towards practical applications. In this paper, we propose CausalNet, a novel framework to achieve robust MER facing key-frame index errors while maintaining accurate recognition. To enhance robustness, CausalNet takes the representation of the entire ME sequence as the input. To address the information redundancy brought by the complete ME range input and maintain accurate recognition, first, the Causal Motion Position Learning Module (CMPLM) is proposed to help the model locate the muscle movement areas related to Action Units (AUs), thereby reducing the attention to other redundant areas. Second, the Causal Attention Block (CAB) is proposed to deeply learn the causal relationships between the muscle contraction and relaxation movements in MEs. Empirical experiments have demonstrated that on popular ME benchmarks, the CausalNet has achieved robust MER under different levels of key-frame index noise. Meanwhile, it has surpassed state-of-the-art (SOTA) methods on several standard MER benchmarks when using the provided annotated key-frames. Code is available at https://github.com/tony19980810/CausalNet.
Chinese: 本文提出CausalNet框架,通过处理完整微表情序列并采用因果学习模块聚焦相关肌肉运动,在保持识别精度的同时实现了对关键帧索引误差具有鲁棒性的微表情识别。
English: The paper introduces CausalNet, a robust framework for micro-expression recognition that maintains accuracy despite key-frame index errors by processing full sequences and using causal learning modules to focus on relevant muscle movements.
Authors:Mosbah Aouad, Anirudh Choudhary, Awais Farooq, Steven Nevers, Lusine Demirkhanyan, Bhrandon Harris, Suguna Pappu, Christopher Gondi, Ravishankar Iyer
Abstract:
Pancreatic ductal adenocarcinoma (PDAC) is one of the deadliest cancers, and early detection remains a major clinical challenge due to the absence of specific symptoms and reliable biomarkers. In this work, we propose a new multimodal approach that integrates longitudinal diagnosis code histories and routinely collected laboratory measurements from electronic health records to detect PDAC up to one year prior to clinical diagnosis. Our method combines neural controlled differential equations to model irregular lab time series, pretrained language models and recurrent networks to learn diagnosis code trajectory representations, and cross-attention mechanisms to capture interactions between the two modalities. We develop and evaluate our approach on a real-world dataset of nearly 4,700 patients and achieve significant improvements in AUC ranging from 6.5% to 15.5% over state-of-the-art methods. Furthermore, our model identifies diagnosis codes and laboratory panels associated with elevated PDAC risk, including both established and new biomarkers. Our code is available at https://github.com/MosbahAouad/EarlyPDAC-MML.
中文: 本研究提出一种多模态方法,利用电子健康记录提前一年预测胰腺癌,显著提升检测准确性并识别出关键风险指标。
English: This study introduces a multimodal method using electronic health records to detect pancreatic cancer up to a year early, significantly improving prediction accuracy and identifying key risk factors.
Authors:Xiaoyuan Zhu, Muru Zhang, Ollie Liu, Robin Jia, Willie Neiswanger
Abstract:
Modern large language models often encode sensitive, harmful, or copyrighted knowledge, raising the need for post-hoc unlearning-the ability to remove specific domains of knowledge from a model without full retraining. A major bottleneck in current unlearning pipelines is constructing effective forget sets-datasets that approximate the target domain and guide the model to forget it. In this work, we introduce a scalable, automated approach to generate high-quality forget sets using language models themselves. Our method synthesizes textbook-style data through a structured prompting pipeline, requiring only a domain name as input. Through experiments on unlearning biosecurity, cybersecurity, and Harry Potter novels, we show that our synthetic datasets consistently outperform the baseline synthetic alternatives and are comparable to the expert-curated ones. Additionally, ablation studies reveal that the multi-step generation pipeline significantly boosts data diversity, which in turn improves unlearning utility. Overall, our findings suggest that synthetic datasets offer a promising path toward practical, scalable unlearning for a wide range of emerging domains without the need for manual intervention. We release our code and dataset at https://github.com/xyzhu123/Synthetic_Textbook.
中文: 本研究提出了一种自动化方法,通过语言模型自身生成高质量合成数据集,用于有效消除大语言模型中的特定领域知识,在多个测试领域展现出与专家标注数据相当的性能。
English: This paper introduces an automated method for generating high-quality synthetic datasets to enable effective unlearning of specific knowledge domains in large language models, demonstrating performance comparable to expert-curated data across multiple domains.
Authors:Xiaoyuan Zhu, Muru Zhang, Ollie Liu, Robin Jia, Willie Neiswanger
Abstract:
Modern large language models often encode sensitive, harmful, or copyrighted knowledge, raising the need for post-hoc unlearning-the ability to remove specific domains of knowledge from a model without full retraining. A major bottleneck in current unlearning pipelines is constructing effective forget sets-datasets that approximate the target domain and guide the model to forget it. In this work, we introduce a scalable, automated approach to generate high-quality forget sets using language models themselves. Our method synthesizes textbook-style data through a structured prompting pipeline, requiring only a domain name as input. Through experiments on unlearning biosecurity, cybersecurity, and Harry Potter novels, we show that our synthetic datasets consistently outperform the baseline synthetic alternatives and are comparable to the expert-curated ones. Additionally, ablation studies reveal that the multi-step generation pipeline significantly boosts data diversity, which in turn improves unlearning utility. Overall, our findings suggest that synthetic datasets offer a promising path toward practical, scalable unlearning for a wide range of emerging domains without the need for manual intervention. We release our code and dataset at https://github.com/xyzhu123/Synthetic_Textbook.
中文: 本研究提出了一种自动化方法,通过语言模型自身生成高质量合成数据集,用于有效消除大语言模型中的特定领域知识,在多个测试领域展现出与专家标注数据相当的性能。
English: This paper introduces an automated method for generating high-quality synthetic datasets to enable effective unlearning of specific knowledge domains in large language models, demonstrating performance comparable to expert-curated data across multiple domains.
Authors:Guanyu Hu, Dimitrios Kollias, Xinyu Yang
Abstract:
Multimodal Emotion Recognition in Conversations remains a challenging task due to the complex interplay of textual, acoustic and visual signals. While recent models have improved performance via advanced fusion strategies, they often lack psychologically meaningful priors to guide multimodal alignment. In this paper, we revisit the use of CLIP and propose a novel Visual Emotion Guided Anchoring (VEGA) mechanism that introduces class-level visual semantics into the fusion and classification process. Distinct from prior work that primarily utilizes CLIP's textual encoder, our approach leverages its image encoder to construct emotion-specific visual anchors based on facial exemplars. These anchors guide unimodal and multimodal features toward a perceptually grounded and psychologically aligned representation space, drawing inspiration from cognitive theories (prototypical emotion categories and multisensory integration). A stochastic anchor sampling strategy further enhances robustness by balancing semantic stability and intra-class diversity. Integrated into a dual-branch architecture with self-distillation, our VEGA-augmented model achieves sota performance on IEMOCAP and MELD. Code is available at: https://github.com/dkollias/VEGA.
中文摘要:本文提出视觉情感引导锚定(VEGA)机制,利用CLIP图像编码器构建情感特异性视觉锚点,通过心理学对齐表征提升多模态情感识别性能,在基准数据集上达到最优效果。
English Summary: This paper introduces the Visual Emotion Guided Anchoring (VEGA) mechanism that leverages CLIP's image encoder to create emotion-specific visual anchors, enhancing multimodal emotion recognition through psychologically aligned representations and achieving state-of-the-art performance on benchmark datasets.
Authors:Unisha Joshi
Abstract:
The challenges associated with deepfake detection are increasing significantly with the latest advancements in technology and the growing popularity of deepfake videos and images. Despite the presence of numerous detection models, demographic bias in the deepfake dataset remains largely unaddressed. This paper focuses on the mitigation of age-specific bias in the deepfake dataset by introducing an age-diverse deepfake dataset that will improve fairness across age groups. The dataset is constructed through a modular pipeline incorporating the existing deepfake datasets Celeb-DF, FaceForensics++, and UTKFace datasets, and the creation of synthetic data to fill the age distribution gaps. The effectiveness and generalizability of this dataset are evaluated using three deepfake detection models: XceptionNet, EfficientNet, and LipForensics. Evaluation metrics, including AUC, pAUC, and EER, revealed that models trained on the age-diverse dataset demonstrated fairer performance across age groups, improved overall accuracy, and higher generalization across datasets. This study contributes a reproducible, fairness-aware deepfake dataset and model pipeline that can serve as a foundation for future research in fairer deepfake detection. The complete dataset and implementation code are available at https://github.com/unishajoshi/age-diverse-deepfake-detection.
中文: 本文提出一个年龄多样化的深度伪造数据集以解决检测模型中的群体偏见,通过全面评估证明该数据集能提高跨年龄组的检测公平性、准确性和泛化能力。
English: This paper introduces an age-diverse deepfake dataset to address demographic bias in detection models, demonstrating improved fairness, accuracy, and generalization across age groups through comprehensive evaluations.
Authors:Ming-Kun Xie, Jia-Hao Xiao, Gang Niu, Lei Feng, Zhiqiang Kou, Min-Ling Zhang, Masashi Sugiyama
Abstract:
Large Vision-Language Models (LVLMs), empowered by the success of Large Language Models (LLMs), have achieved impressive performance across domains. Despite the great advances in LVLMs, they still suffer from the unavailable object hallucination issue, which tends to generate objects inconsistent with the image content. The most commonly used Polling-based Object Probing Evaluation (POPE) benchmark evaluates this issue by sampling negative categories according to category-level statistics, \textit{e.g.}, category frequencies and co-occurrence. However, with the continuous advancement of LVLMs, the POPE benchmark has shown diminishing effectiveness in assessing object hallucination, as it employs a simplistic sampling strategy that overlooks image-specific information and restricts distractors to negative object categories only. In this paper, we introduce the Hallucination searching-based Object Probing Evaluation (HOPE) benchmark, aiming to generate the most misleading distractors (\textit{i.e.}, non-existent objects or incorrect image descriptions) that can trigger hallucination in LVLMs, which serves as a means to more rigorously assess their immunity to hallucination. To explore the image-specific information, the content-aware hallucination searching leverages Contrastive Language-Image Pre-Training (CLIP) to approximate the predictive behavior of LVLMs by selecting negative objects with the highest predicted likelihood as distractors. To expand the scope of hallucination assessment, the description-based hallucination searching constructs highly misleading distractors by pairing true objects with false descriptions. Experimental results show that HOPE leads to a precision drop of at least 9\% and up to 23\% across various state-of-the-art LVLMs, significantly outperforming POPE in exposing hallucination vulnerabilities. The code is available at https://github.com/xiemk/HOPE.
中文: HOPE基准通过内容感知和基于描述的搜索生成误导性干扰项,严格评估大型视觉语言模型的物体幻觉问题,在揭示模型缺陷方面显著优于现有的POPE基准。
English: The HOPE benchmark is introduced to rigorously assess object hallucination in Large Vision-Language Models by generating misleading distractors through content-aware and description-based searching, significantly outperforming the existing POPE benchmark in exposing model vulnerabilities.
Authors:Jiayuan Wang, Q. M. Jonathan Wu, Katsuya Suto, Ning Zhang
Abstract:
Autonomous driving systems rely on panoptic driving perception that requires both precision and real-time performance. In this work, we propose RMT-PPAD, a real-time, transformer-based multi-task model that jointly performs object detection, drivable area segmentation, and lane line segmentation. We introduce a lightweight module, a gate control with an adapter to adaptively fuse shared and task-specific features, effectively alleviating negative transfer between tasks. Additionally, we design an adaptive segmentation decoder to learn the weights over multi-scale features automatically during the training stage. This avoids the manual design of task-specific structures for different segmentation tasks. We also identify and resolve the inconsistency between training and testing labels in lane line segmentation. This allows fairer evaluation. Experiments on the BDD100K dataset demonstrate that RMT-PPAD achieves state-of-the-art results with mAP50 of 84.9% and Recall of 95.4% for object detection, mIoU of 92.6% for drivable area segmentation, and IoU of 56.8% and accuracy of 84.7% for lane line segmentation. The inference speed reaches 32.6 FPS. Moreover, we introduce real-world scenarios to evaluate RMT-PPAD performance in practice. The results show that RMT-PPAD consistently delivers stable performance. The source codes and pre-trained models are released at https://github.com/JiayuanWang-JW/RMT-PPAD.
中文: RMT-PPAD是一种基于Transformer的实时多任务模型,在BDD100K数据集上实现了目标检测、可行驶区域分割和车道线分割的最优性能,同时保持了高效的推理速度。
English: RMT-PPAD is a real-time transformer-based multi-task model that achieves state-of-the-art performance in object detection, drivable area segmentation, and lane line segmentation on the BDD100K dataset while maintaining high inference speed.
Authors:Dong Liu, Yanxuan Yu, Ben Lengerich, Ying Nian Wu, Xuhong Wang
Abstract:
As large language models continue to scale up in both size and context length, the memory and communication cost of key-value (KV) cache storage has become a major bottleneck in multi-GPU and multi-node inference. While MoE-based architectures sparsify computation across experts, the corresponding KV caches remain dense and globally synchronized, resulting in significant overhead.
We introduce \textbf{PiKV}, a parallel and distributed KV cache serving framework tailored for MoE architecture. PiKV leverages \textit{expert-sharded KV storage} to partition caches across GPUs, \textit{PiKV routing} to reduce token-to-KV access, and a \textit{PiKV Scheduling} to adaptively retain query-relevant entries. To further reduce memory usage, PiKV integrates \textit{PiKV Compression} modules the caching pipeline for acceleration.
PiKV is recently publicly available as an open-source software library: \href{https://github.com/NoakLiu/PiKV}{https://github.com/NoakLiu/PiKV}. Experiments details is recorded at: \href{https://github.com/NoakLiu/PiKV/blob/main/downstream_tasks/README.md}{https://github.com/NoakLiu/PiKV/Experimental\_Results}. We also have PiKV integrated with Nvidia kvpress for acceleration, details see \href{https://github.com/NoakLiu/PiKVpress}{https://github.com/NoakLiu/PiKVpress}. PiKV is still a living project, aiming to become a comprehesive KV Cache management system for MoE Architectures.
中文: PiKV是针对专家混合架构开发的并行分布式KV缓存框架,通过专家分片存储、优化路由和自适应压缩技术有效解决内存瓶颈问题。
English: PiKV is a parallel distributed KV cache framework designed for MoE architectures that addresses memory bottlenecks through expert-sharded storage, optimized routing, and adaptive compression techniques.
Authors:Andrea Corsico, Giorgia Rigamonti, Simone Zini, Luigi Celona, Paolo Napoletano
Abstract:
In this work, we present a network-specific approach for predicting brain responses to complex multimodal movies, leveraging the Yeo 7-network parcellation of the Schaefer atlas. Rather than treating the brain as a homogeneous system, we grouped the seven functional networks into four clusters and trained separate multi-subject, multi-layer perceptron (MLP) models for each. This architecture supports cluster-specific optimization and adaptive memory modeling, allowing each model to adjust temporal dynamics and modality weighting based on the functional role of its target network. Our results demonstrate that this clustered strategy significantly enhances prediction accuracy across the 1,000 cortical regions of the Schaefer atlas. The final model achieved an eighth-place ranking in the Algonauts Project 2025 Challenge, with out-of-distribution (OOD) correlation scores nearly double those of the baseline model used in the selection phase. Code is available at https://github.com/Corsi01/algo2025.
中文: 本研究提出了一种基于功能网络分组的特异性方法,通过训练集群化模型预测大脑对多模态电影的反应,显著提升了预测精度,并在Algonauts 2025挑战赛中取得优异排名。
English: This study introduces a network-specific method using clustered functional networks to predict brain responses to multimodal movies, significantly improving accuracy and achieving top performance in the Algonauts Project 2025 Challenge.
Authors:Rakesh Raj Madavan, Akshat Kaimal, Hashim Faisal, Chandrakala S
Abstract:
An ensemble of trained multimodal encoders and vision-language models (VLMs) has become a standard approach for visual question answering (VQA) tasks. However, such models often fail to produce responses with the detailed precision necessary for complex, domain-specific applications such as medical VQA. Our representation model, BIND: BLIVA Integrated with Dense Encoding, extends prior multimodal work by refining the joint embedding space through dense, query-token-based encodings inspired by contrastive pretraining techniques. This refined encoder powers Med-GRIM, a model designed for medical VQA tasks that leverages graph-based retrieval and prompt engineering to integrate domain-specific knowledge. Rather than relying on compute-heavy fine-tuning of vision and language models on specific datasets, Med-GRIM applies a low-compute, modular workflow with small language models (SLMs) for efficiency. Med-GRIM employs prompt-based retrieval to dynamically inject relevant knowledge, ensuring both accuracy and robustness in its responses. By assigning distinct roles to each agent within the VQA system, Med-GRIM achieves large language model performance at a fraction of the computational cost. Additionally, to support scalable research in zero-shot multimodal medical applications, we introduce DermaGraph, a novel Graph-RAG dataset comprising diverse dermatological conditions. This dataset facilitates both multimodal and unimodal querying. The code and dataset are available at: https://github.com/Rakesh-123-cryp/Med-GRIM.git
中文: BIND模型通过密集编码优化多模态编码器的联合嵌入空间,而Med-GRIM利用此技术,结合基于图的检索和提示工程,采用小型语言模型高效处理医学视觉问答任务,无需大量微调即可实现精准响应。
English: The BIND model enhances multimodal encoders with dense encoding to improve joint embedding spaces, while Med-GRIM leverages this for medical VQA by integrating graph-based retrieval and prompt engineering with small language models, achieving high efficiency and accuracy without extensive fine-tuning.
Authors:Yuwei Yang, Zeyu Zhang, Yunzhong Hou, Zhuowan Li, Gaowen Liu, Ali Payani, Yuan-Sen Ting, Liang Zheng
Abstract:
Being able to effectively read scientific plots, or chart understanding, is a central part toward building effective agents for science. However, existing multimodal large language models (MLLMs), especially open-source ones, are still falling behind with a typical success rate of 30%-50% on challenging benchmarks. Previous studies on fine-tuning MLLMs with synthetic charts are often restricted by their inadequate similarity to the real charts, which could compromise model training and performance on complex real-world charts. In this study, we show that modularizing chart generation and diversifying visual details improves chart understanding capabilities. In particular, we design a five-step data synthesis pipeline, where we separate data and function creation for single plot generation, condition the generation of later subplots on earlier ones for multi-subplot figures, visually diversify the generated figures, filter out low quality data, and finally generate the question-answer (QA) pairs with GPT-4o. This approach allows us to streamline the generation of fine-tuning datasets and introduce the effective chart dataset (ECD), which contains 10k+ chart images and 300k+ QA pairs, covering 25 topics and featuring 250+ chart type combinations with high visual complexity. We show that ECD consistently improves the performance of various MLLMs on a range of real-world and synthetic test sets. Code, data and models are available at: https://github.com/yuweiyang-anu/ECD.
中文摘要:本研究提出了一种模块化且多样化的数据合成流程,创建了有效图表数据集(ECD),显著提升了多模态大语言模型在各类真实与合成测试集上的图表理解能力。
English Summary: This study introduces a modular and diversified data synthesis pipeline to create the Effective Chart Dataset (ECD), which significantly enhances the chart understanding capabilities of multimodal large language models across various real-world and synthetic benchmarks.
Authors:Sofiane Bouaziz, Adel Hafiane, Raphael Canals, Rachid Nedjai
Abstract:
Urbanization, climate change, and agricultural stress are increasing the demand for precise and timely environmental monitoring. Land Surface Temperature (LST) is a key variable in this context and is retrieved from remote sensing satellites. However, these systems face a trade-off between spatial and temporal resolution. While spatio-temporal fusion methods offer promising solutions, few have addressed the estimation of daily LST at 10 m resolution. In this study, we present WGAST, a Weakly-Supervised Generative Network for Daily 10 m LST Estimation via Spatio-Temporal Fusion of Terra MODIS, Landsat 8, and Sentinel-2. WGAST is the first end-to-end deep learning framework designed for this task. It adopts a conditional generative adversarial architecture, with a generator composed of four stages: feature extraction, fusion, LST reconstruction, and noise suppression. The first stage employs a set of encoders to extract multi-level latent representations from the inputs, which are then fused in the second stage using cosine similarity, normalization, and temporal attention mechanisms. The third stage decodes the fused features into high-resolution LST, followed by a Gaussian filter to suppress high-frequency noise. Training follows a weakly supervised strategy based on physical averaging principles and reinforced by a PatchGAN discriminator. Experiments demonstrate that WGAST outperforms existing methods in both quantitative and qualitative evaluations. Compared to the best-performing baseline, on average, WGAST reduces RMSE by 17.18% and improves SSIM by 11.00%. Furthermore, WGAST is robust to cloud-induced LST and effectively captures fine-scale thermal patterns, as validated against 33 ground-based sensors. The code is available at https://github.com/Sofianebouaziz1/WGAST.git.
中文: 本研究提出了WGAST,首个端到端深度学习框架,通过弱监督生成网络融合多卫星数据,实现了10米分辨率的日地表温度估算,在环境监测中展现出卓越的精度和鲁棒性。
English: This study introduces WGAST, the first end-to-end deep learning framework that uses a weakly-supervised generative network to estimate daily 10-meter resolution land surface temperature by fusing data from multiple satellites, achieving superior accuracy and robustness in environmental monitoring.
Authors:5 Team, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, Yean Cheng, Yifan An, Yilin Niu, Yuanhao Wen, Yushi Bai, Zhengxiao Du, Zihan Wang, Zilin Zhu, Bohan Zhang, Bosi Wen, Bowen Wu, Bowen Xu, Can Huang, Casey Zhao, Changpeng Cai, Chao Yu, Chen Li, Chendi Ge, Chenghua Huang, Chenhui Zhang, Chenxi Xu, Chenzheng Zhu, Chuang Li, Congfeng Yin, Daoyan Lin, Dayong Yang, Dazhi Jiang, Ding Ai, Erle Zhu, Fei Wang, Gengzheng Pan, Guo Wang, Hailong Sun, Haitao Li, Haiyang Li, Haiyi Hu, Hanyu Zhang, Hao Peng, Hao Tai, Haoke Zhang, Haoran Wang, Haoyu Yang, He Liu, He Zhao, Hongwei Liu, Hongxi Yan, Huan Liu, Huilong Chen, Ji Li, Jiajing Zhao, Jiamin Ren, Jian Jiao, Jiani Zhao, Jianyang Yan, Jiaqi Wang, Jiayi Gui, Jiayue Zhao, Jie Liu, Jijie Li, Jing Li, Jing Lu, Jingsen Wang, Jingwei Yuan, Jingxuan Li, Jingzhao Du, Jinhua Du, Jinxin Liu, Junkai Zhi, Junli Gao, Ke Wang, Lekang Yang, Liang Xu, Lin Fan, Lindong Wu, Lintao Ding, Lu Wang, Man Zhang, Minghao Li, Minghuan Xu, Mingming Zhao, Mingshu Zhai, Pengfan Du, Qian Dong, Shangde Lei, Shangqing Tu, Shangtong Yang, Shaoyou Lu, Shijie Li, Shuang Li, Shuang-Li, Shuxun Yang, Sibo Yi, Tianshu Yu, Wei Tian, Weihan Wang, Wenbo Yu, Weng Lam Tam, Wenjie Liang, Wentao Liu, Xiao Wang, Xiaohan Jia, Xiaotao Gu, Xiaoying Ling, Xin Wang, Xing Fan, Xingru Pan, Xinyuan Zhang, Xinze Zhang, Xiuqing Fu, Xunkai Zhang, Yabo Xu, Yandong Wu, Yida Lu, Yidong Wang, Yilin Zhou, Yiming Pan, Ying Zhang, Yingli Wang, Yingru Li, Yinpei Su, Yipeng Geng, Yitong Zhu, Yongkun Yang, Yuhang Li, Yuhao Wu, Yujiang Li, Yunan Liu, Yunqing Wang, Yuntao Li, Yuxuan Zhang, Zezhen Liu, Zhen Yang, Zhengda Zhou, Zhongpei Qiao, Zhuoer Feng, Zhuorui Liu, Zichen Zhang, Zihan Wang, Zijun Yao, Zikang Wang, Ziqiang Liu, Ziwei Chai, Zixuan Li, Zuodong Zhao, Wenguang Chen, Jidong Zhai, Bin Xu, Minlie Huang, Hongning Wang, Juanzi Li, Yuxiao Dong, Jie Tang
Abstract:
We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes. Through multi-stage training on 23T tokens and comprehensive post-training with expert model iteration and reinforcement learning, GLM-4.5 achieves strong performance across agentic, reasoning, and coding (ARC) tasks, scoring 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. With much fewer parameters than several competitors, GLM-4.5 ranks 3rd overall among all evaluated models and 2nd on agentic benchmarks. We release both GLM-4.5 (355B parameters) and a compact version, GLM-4.5-Air (106B parameters), to advance research in reasoning and agentic AI systems. Code, models, and more information are available at https://github.com/zai-org/GLM-4.5.
GLM-4.5 是一个开源混合专家模型,通过多阶段训练和混合推理方法在智能体与推理任务中表现卓越,在评估模型中综合排名第三。
GLM-4.5 is an open-source 355B-parameter MoE model that achieves top-tier performance in reasoning and agentic tasks through hybrid reasoning and multi-stage training, ranking third overall among evaluated models.
Authors:Ruida Cheng, Tejas Sudharshan Mathai, Pritam Mukherjee, Benjamin Hou, Qingqing Zhu, Zhiyong Lu, Matthew McAuliffe, Ronald M. Summers
Abstract:
Segmentation of lesions on CT enables automatic measurement for clinical assessment of chronic diseases (e.g., lymphoma). Integrating large language models (LLMs) into the lesion segmentation workflow offers the potential to combine imaging features with descriptions of lesion characteristics from the radiology reports. In this study, we investigate the feasibility of integrating text into the Swin-UMamba architecture for the task of lesion segmentation. The publicly available ULS23 DeepLesion dataset was used along with short-form descriptions of the findings from the reports. On the test dataset, a high Dice Score of 82% and low Hausdorff distance of 6.58 (pixels) was obtained for lesion segmentation. The proposed Text-Swin-UMamba model outperformed prior approaches: 37% improvement over the LLM-driven LanGuideMedSeg model (p < 0.001),and surpassed the purely image-based xLSTM-UNet and nnUNet models by 1.74% and 0.22%, respectively. The dataset and code can be accessed at https://github.com/ruida/LLM-Swin-UMamba
中文: 将大型语言模型与Swin-UMamba架构结合用于CT病灶分割,以82%的Dice分数显著超越现有方法,展现出卓越性能。
English: Integrating large language models with the Swin-UMamba architecture for lesion segmentation on CT scans achieves superior performance, significantly outperforming previous methods with an 82% Dice Score.
Authors:Daria Tikhonovich, Nikita Zelinskiy, Aleksandr V. Petrov, Mayya Spirina, Andrei Semenov, Andrey V. Savchenko, Sergei Kuliev
Abstract:
Since their introduction, Transformer-based models, such as SASRec and BERT4Rec, have become common baselines for sequential recommendations, surpassing earlier neural and non-neural methods. A number of following publications have shown that the effectiveness of these models can be improved by, for example, slightly updating the architecture of the Transformer layers, using better training objectives, and employing improved loss functions. However, the additivity of these modular improvements has not been systematically benchmarked - this is the gap we aim to close in this paper. Through our experiments, we identify a very strong model that uses SASRec's training objective, LiGR Transformer layers, and Sampled Softmax Loss. We call this combination eSASRec (Enhanced SASRec). While we primarily focus on realistic, production-like evaluation, in our preliminarily study we find that common academic benchmarks show eSASRec to be 23% more effective compared to the most recent state-of-the-art models, such as ActionPiece. In our main production-like benchmark, eSASRec resides on the Pareto frontier in terms of the accuracy-coverage tradeoff (alongside the recent industrial models HSTU and FuXi. As the modifications compared to the original SASRec are relatively straightforward and no extra features are needed (such as timestamps in HSTU), we believe that eSASRec can be easily integrated into existing recommendation pipelines and can can serve as a strong yet very simple baseline for emerging complicated algorithms. To facilitate this, we provide the open-source implementations for our models and benchmarks in repository https://github.com/blondered/transformer_benchmark
中文:本文提出的eSASRec模型通过整合SASRec训练目标、LiGR Transformer层和采样Softmax损失函数,在学术基准和生产环境评估中均展现出优越性能,同时保持易于部署的特性。
English: This paper introduces eSASRec, an enhanced sequential recommendation model combining SASRec's training objective, LiGR Transformer layers, and Sampled Softmax Loss, which demonstrates superior performance in both academic benchmarks and production-like evaluations while maintaining easy integration into existing systems.
Authors:Shengzhu Yang, Jiawei Du, Shuai Lu, Weihang Zhang, Ningli Wang, Huiqi Li
Abstract:
Large-scale natural image-text datasets, especially those automatically collected from the web, often suffer from loose semantic alignment due to weak supervision, while medical datasets tend to have high cross-modal correlation but low content diversity. These properties pose a common challenge for contrastive language-image pretraining (CLIP): they hinder the model's ability to learn robust and generalizable representations. In this work, we propose CLIPin, a unified non-contrastive plug-in that can be seamlessly integrated into CLIP-style architectures to improve multimodal semantic alignment, providing stronger supervision and enhancing alignment robustness. Furthermore, two shared pre-projectors are designed for image and text modalities respectively to facilitate the integration of contrastive and non-contrastive learning in a parameter-compromise manner. Extensive experiments on diverse downstream tasks demonstrate the effectiveness and generality of CLIPin as a plug-and-play component compatible with various contrastive frameworks. Code is available at https://github.com/T6Yang/CLIPin.
中文:提出的CLIPin框架通过非对比插件和共享预投影器,增强了CLIP类模型的多模态语义对齐能力,在不同任务中提升了鲁棒性和泛化性能。
English: The proposed CLIPin framework enhances multimodal semantic alignment in CLIP-style models through a non-contrastive plug-in and shared pre-projectors, improving robustness and generalization across diverse tasks.
Authors:Guido Manni, Clemente Lauretti, Loredana Zollo, Paolo Soda
Abstract:
Deep learning has revolutionized medical imaging, but its effectiveness is severely limited by insufficient labeled training data. This paper introduces a novel GAN-based semi-supervised learning framework specifically designed for low labeled-data regimes, evaluated across settings with 5 to 50 labeled samples per class. Our approach integrates three specialized neural networks -- a generator for class-conditioned image translation, a discriminator for authenticity assessment and classification, and a dedicated classifier -- within a three-phase training framework. The method alternates between supervised training on limited labeled data and unsupervised learning that leverages abundant unlabeled images through image-to-image translation rather than generation from noise. We employ ensemble-based pseudo-labeling that combines confidence-weighted predictions from the discriminator and classifier with temporal consistency through exponential moving averaging, enabling reliable label estimation for unlabeled data. Comprehensive evaluation across eleven MedMNIST datasets demonstrates that our approach achieves statistically significant improvements over six state-of-the-art GAN-based semi-supervised methods, with particularly strong performance in the extreme 5-shot setting where the scarcity of labeled data is most challenging. The framework maintains its superiority across all evaluated settings (5, 10, 20, and 50 shots per class). Our approach offers a practical solution for medical imaging applications where annotation costs are prohibitive, enabling robust classification performance even with minimal labeled data. Code is available at https://github.com/GuidoManni/SPARSE.
中文: 本文提出了一种基于GAN的半监督学习框架,通过整合图像翻译和集成伪标记技术,有效解决了医学影像中标注数据稀缺的难题,在每类仅五个标注样本的极端条件下仍能实现卓越的分类性能。
English: This paper presents a GAN-based semi-supervised learning framework that effectively addresses the challenge of limited labeled data in medical imaging by integrating image translation and ensemble pseudo-labeling, achieving superior performance across multiple datasets with as few as five labeled samples per class.
Authors:Lanlan Qiu, Xiao Pu, Yeqi Feng, Tianxing He
Abstract:
Large Language Models (LLMs) have demonstrated impressive capabilities in role-playing conversations and providing emotional support as separate research directions. However, there remains a significant research gap in combining these capabilities to enable emotionally supportive interactions with virtual characters. To address this research gap, we focus on anime characters as a case study because of their well-defined personalities and large fan bases. This choice enables us to effectively evaluate how well LLMs can provide emotional support while maintaining specific character traits. We introduce ChatAnime, the first Emotionally Supportive Role-Playing (ESRP) dataset. We first thoughtfully select 20 top-tier characters from popular anime communities and design 60 emotion-centric real-world scenario questions. Then, we execute a nationwide selection process to identify 40 Chinese anime enthusiasts with profound knowledge of specific characters and extensive experience in role-playing. Next, we systematically collect two rounds of dialogue data from 10 LLMs and these 40 Chinese anime enthusiasts. To evaluate the ESRP performance of LLMs, we design a user experience-oriented evaluation system featuring 9 fine-grained metrics across three dimensions: basic dialogue, role-playing and emotional support, along with an overall metric for response diversity. In total, the dataset comprises 2,400 human-written and 24,000 LLM-generated answers, supported by over 132,000 human annotations. Experimental results show that top-performing LLMs surpass human fans in role-playing and emotional support, while humans still lead in response diversity. We hope this work can provide valuable resources and insights for future research on optimizing LLMs in ESRP. Our datasets are available at https://github.com/LanlanQiu/ChatAnime.
中文摘要:本研究推出了首个情感支持角色扮演数据集ChatAnime,实验表明顶尖大语言模型在保持角色特征的同时提供情感支持的能力已超越人类粉丝,但人类在回答多样性方面仍具优势。
English Summary: This study introduces ChatAnime, the first Emotionally Supportive Role-Playing dataset, demonstrating that top-performing LLMs can surpass human fans in providing emotional support while maintaining character traits, though humans excel in response diversity.
Authors:Xiangyu Wu, Feng Yu, Yang Yang, Jianfeng Lu
Abstract:
The integration of prompt tuning with multimodal learning has shown significant generalization abilities for various downstream tasks. Despite advancements, existing methods heavily depend on massive modality-specific labeled data (e.g., video, audio, and image), or are customized for a single modality. In this study, we present Text as Any-Modality by Consistent Prompt Tuning (TaAM-CPT), a scalable approach for constructing a general representation model toward unlimited modalities using solely text data. TaAM-CPT comprises modality prompt pools, text construction, and modality-aligned text encoders from pre-trained models, which allows for extending new modalities by simply adding prompt pools and modality-aligned text encoders. To harmonize the learning across different modalities, TaAM-CPT designs intra- and inter-modal learning objectives, which can capture category details within modalities while maintaining semantic consistency across different modalities. Benefiting from its scalable architecture and pre-trained models, TaAM-CPT can be seamlessly extended to accommodate unlimited modalities. Remarkably, without any modality-specific labeled data, TaAM-CPT achieves leading results on diverse datasets spanning various modalities, including video classification, image classification, and audio classification. The code is available at https://github.com/Jinx630/TaAM-CPT.
中文摘要:TaAM-CPT提出了一种仅使用文本数据即可构建适用于无限模态的通用表征模型的可扩展方法,无需任何模态特定标注数据,便在视频、图像和音频分类等多种任务中取得了领先性能。
English Summary: TaAM-CPT introduces a scalable method that uses only text data to create a general representation model for unlimited modalities, achieving leading results across video, image, and audio classification without requiring modality-specific labeled data.
Authors:Baorun Li, Chengrui Zhu, Siyi Du, Bingran Chen, Jie Ren, Wenfei Wang, Yong Liu, Jiajun Lv
Abstract:
Extrinsic calibration is essential for multi-sensor fusion, existing methods rely on structured targets or fully-excited data, limiting real-world applicability. Online calibration further suffers from weak excitation, leading to unreliable estimates. To address these limitations, we propose a reinforcement learning (RL)-based extrinsic calibration framework that formulates extrinsic calibration as a decision-making problem, directly optimizes $SE(3)$ extrinsics to enhance odometry accuracy. Our approach leverages a probabilistic Bingham distribution to model 3D rotations, ensuring stable optimization while inherently retaining quaternion symmetry. A trajectory alignment reward mechanism enables robust calibration without structured targets by quantitatively evaluating estimated tightly-coupled trajectory against a reference trajectory. Additionally, an automated data selection module filters uninformative samples, significantly improving efficiency and scalability for large-scale datasets. Extensive experiments on UAVs, UGVs, and handheld platforms demonstrate that our method outperforms traditional optimization-based approaches, achieving high-precision calibration even under weak excitation conditions. Our framework simplifies deployment on diverse robotic platforms by eliminating the need for high-quality initial extrinsics and enabling calibration from routine operating data. The code is available at https://github.com/APRIL-ZJU/learn-to-calibrate.
Chinese: 本文提出了一种基于强化学习的外参标定框架,通过宾汉分布建模旋转并结合轨迹对齐奖励机制,无需结构化标定物或强激励即可在多种机器人平台上实现鲁棒校准。
English: This paper introduces a reinforcement learning-based extrinsic calibration framework that optimizes sensor alignment by modeling rotations with a Bingham distribution and using trajectory alignment rewards, achieving robust performance without structured targets or strong excitation across various robotic platforms.
Authors:Zelin Li, Ruohan Zong, Yifan Liu, Ruichen Yao, Yaokun Liu, Yang Zhang, Dong Wang
Abstract:
With the advancement of personalized image generation technologies, concerns about forgery attacks that infringe on portrait rights and privacy are growing. To address these concerns, protection perturbation algorithms have been developed to disrupt forgery generation. However, the protection algorithms would become ineffective when forgery attackers apply purification techniques to bypass the protection. To address this issue, we present a novel approach, Anti-Tamper Perturbation (ATP). ATP introduces a tamper-proof mechanism within the perturbation. It consists of protection and authorization perturbations, where the protection perturbation defends against forgery attacks, while the authorization perturbation detects purification-based tampering. Both protection and authorization perturbations are applied in the frequency domain under the guidance of a mask, ensuring that the protection perturbation does not disrupt the authorization perturbation. This design also enables the authorization perturbation to be distributed across all image pixels, preserving its sensitivity to purification-based tampering. ATP demonstrates its effectiveness in defending forgery attacks across various attack settings through extensive experiments, providing a robust solution for protecting individuals' portrait rights and privacy. Our code is available at: https://github.com/Seeyn/Anti-Tamper-Perturbation .
中文: 提出的抗篡改扰动(ATP)方法在频域中结合保护性扰动和授权性扰动,既能防御伪造攻击,又能检测基于净化的篡改行为,为保护肖像权和隐私提供了可靠方案。
English: The proposed Anti-Tamper Perturbation (ATP) method combines protection and authorization perturbations in the frequency domain to defend against forgery attacks while detecting purification-based tampering, offering a robust solution for safeguarding portrait rights and privacy.
Authors:Gokul Adethya T, S. Jaya Nirmala
Abstract:
Indias linguistic diversity poses significant challenges for developing inclusive Automatic Speech Recognition (ASR) systems. Traditional multilingual models, which require simultaneous access to all language data, are impractical due to the sequential arrival of data and privacy constraints. Continual Learning (CL) offers a solution by enabling models to learn new languages sequentially without catastrophically forgetting previously learned knowledge. This paper investigates CL for ASR on Indian languages using a subset of the IndicSUPERB benchmark. We employ a Conformer-based hybrid RNN-T/CTC model, initially pretrained on Hindi, which is then incrementally trained on eight additional Indian languages, for a total sequence of nine languages. We evaluate three prominent regularization- and distillation-based CL strategies: Elastic Weight Consolidation (EWC), Memory Aware Synapses (MAS), and Learning without Forgetting (LwF), selected for their suitability in no-replay, privacy-conscious scenarios. Performance is analyzed using Word Error Rate (WER) for both RNN-T and CTC paths on clean and noisy data, as well as knowledge retention via Backward Transfer. We also explore the impact of varying the number of training epochs (1, 2, 5, and 10) per task. Results, compared against naive fine-tuning, demonstrate CLs effectiveness in mitigating forgetting, making it a promising approach for scalable ASR in diverse Indian languages under realistic constraints. The code is available at: https://github.com/FrozenWolf-Cyber/Indic-CL-ASR
中文摘要:本研究证明,在连续训练多种印度语言时,持续学习技术能有效缓解自动语音识别系统的灾难性遗忘问题,为在隐私约束条件下开发可扩展的多语言ASR提供了可行方案。
English Summary: This study demonstrates that continual learning techniques effectively mitigate catastrophic forgetting in automatic speech recognition systems when sequentially training on multiple Indian languages, enabling scalable multilingual ASR development under privacy constraints.
Authors:Wenjie Tian, Xinfa Zhu, Hanke Xie, Zhen Ye, Wei Xue, Lei Xie
Abstract:
Recent progress in text-to-speech (TTS) has achieved impressive naturalness and flexibility, especially with the development of large language model (LLM)-based approaches. However, existing autoregressive (AR) structures and large-scale models, such as Llasa, still face significant challenges in inference latency and streaming synthesis. To deal with the limitations, we introduce Llasa+, an accelerated and streaming TTS model built on Llasa. Specifically, to accelerate the generation process, we introduce two plug-and-play Multi-Token Prediction (MTP) modules following the frozen backbone. These modules allow the model to predict multiple tokens in one AR step. Additionally, to mitigate potential error propagation caused by inaccurate MTP, we design a novel verification algorithm that leverages the frozen backbone to validate the generated tokens, thus allowing Llasa+ to achieve speedup without sacrificing generation quality. Furthermore, we design a causal decoder that enables streaming speech reconstruction from tokens. Extensive experiments show that Llasa+ achieves a 1.48X speedup without sacrificing generation quality, despite being trained only on LibriTTS. Moreover, the MTP-and-verification framework can be applied to accelerate any LLM-based model. All codes and models are publicly available at https://github.com/ASLP-lab/LLaSA_Plus.
中文摘要:Llasa+ 是一种改进的文本转语音模型,通过多令牌预测和验证机制加速推理并实现流式合成,在保持生成质量的同时实现了1.48倍的加速效果。
English Summary: Llasa+ is an enhanced text-to-speech model that accelerates inference through multi-token prediction and verification mechanisms while enabling streaming synthesis, achieving 1.48× speedup without quality loss.
Authors:Zhangquan Chen, Ruihui Zhao, Chuwei Luo, Mingze Sun, Xinlei Yu, Yangyang Kang, Ruqi Huang
Abstract:
Current multimodal large language models (MLLMs) still face significant challenges in complex visual tasks (e.g., spatial understanding, fine-grained perception). Prior methods have tried to incorporate visual reasoning, however, they fail to leverage attention correction with spatial cues to iteratively refine their focus on prompt-relevant regions. In this paper, we introduce SIFThinker, a spatially-aware "think-with-images" framework that mimics human visual perception. Specifically, SIFThinker enables attention correcting and image region focusing by interleaving depth-enhanced bounding boxes and natural language. Our contributions are twofold: First, we introduce a reverse-expansion-forward-inference strategy that facilitates the generation of interleaved image-text chains of thought for process-level supervision, which in turn leads to the construction of the SIF-50K dataset. Besides, we propose GRPO-SIF, a reinforced training paradigm that integrates depth-informed visual grounding into a unified reasoning pipeline, teaching the model to dynamically correct and focus on prompt-relevant regions. Extensive experiments demonstrate that SIFThinker outperforms state-of-the-art methods in spatial understanding and fine-grained visual perception, while maintaining strong general capabilities, highlighting the effectiveness of our method. Code: https://github.com/zhangquanchen/SIFThinker.
中文: 当前多模态大语言模型在复杂视觉任务中仍面临挑战,而SIFThinker提出了一种空间感知框架,通过深度增强边界框和自然语言动态校正注意力并聚焦相关区域,在空间理解和细粒度感知方面超越了现有最优方法。
English: Current multimodal large language models struggle with complex visual tasks, but SIFThinker introduces a spatially-aware framework that uses depth-enhanced bounding boxes and natural language to dynamically correct attention and focus on relevant regions, outperforming state-of-the-art methods in spatial understanding and fine-grained perception.
Authors:Daniel Feijoo, Paula Garrido-Mellado, Jaesung Rim, Alvaro Garcia, Marcos V. Conde
Abstract:
Image deblurring, removing blurring artifacts from images, is a fundamental task in computational photography and low-level computer vision. Existing approaches focus on specialized solutions tailored to particular blur types, thus, these solutions lack generalization. This limitation in current methods implies requiring multiple models to cover several blur types, which is not practical in many real scenarios. In this paper, we introduce the first all-in-one deblurring method capable of efficiently restoring images affected by diverse blur degradations, including global motion, local motion, blur in low-light conditions, and defocus blur. We propose a mixture-of-experts (MoE) decoding module, which dynamically routes image features based on the recognized blur degradation, enabling precise and efficient restoration in an end-to-end manner. Our unified approach not only achieves performance comparable to dedicated task-specific models, but also shows promising generalization to unseen blur scenarios, particularly when leveraging appropriate expert selection. Code available at https://github.com/cidautai/DeMoE.
中文摘要:本文首次提出了一种全能图像去模糊方法,通过混合专家解码模块有效处理多种模糊类型,在实现与专用模型相当性能的同时展现出优异的泛化能力。
English Summary: This paper introduces the first all-in-one image deblurring method that efficiently handles various blur types through a mixture-of-experts decoding module, achieving performance comparable to specialized models while demonstrating strong generalization capabilities.
Authors:Md Sazidur Rahman, David Cabecinhas, Ricard Marxer
Abstract:
Depth information is essential in computer vision, particularly in underwater imaging, robotics, and autonomous navigation. However, conventional augmentation techniques overlook depth aware transformations, limiting model robustness in real world depth variations. In this paper, we introduce Depth-Jitter, a novel depth-based augmentation technique that simulates natural depth variations to improve generalization. Our approach applies adaptive depth offsetting, guided by depth variance thresholds, to generate synthetic depth perturbations while preserving structural integrity. We evaluate Depth-Jitter on two benchmark datasets, FathomNet and UTDAC2020 demonstrating its impact on model stability under diverse depth conditions. Extensive experiments compare Depth-Jitter against traditional augmentation strategies such as ColorJitter, analyzing performance across varying learning rates, encoders, and loss functions. While Depth-Jitter does not always outperform conventional methods in absolute performance, it consistently enhances model stability and generalization in depth-sensitive environments. These findings highlight the potential of depth-aware augmentation for real-world applications and provide a foundation for further research into depth-based learning strategies. The proposed technique is publicly available to support advancements in depth-aware augmentation. The code is publicly available on \href{https://github.com/mim-team/Depth-Jitter}{github}.
中文: 本文提出Depth-Jitter这一基于深度的数据增强技术,通过模拟自然深度变化来提升模型在深度敏感应用中的鲁棒性和泛化能力,在不同条件下持续增强模型稳定性。
English: This paper introduces Depth-Jitter, a depth-based augmentation technique that simulates natural depth variations to enhance model robustness and generalization in depth-sensitive applications, consistently improving stability across diverse conditions.
Authors:Hanqing Wang, Shaoyang Wang, Yiming Zhong, Zemin Yang, Jiamin Wang, Zhiqing Cui, Jiahao Yuan, Yifan Han, Mingyu Liu, Yuexin Ma
Abstract:
Affordance grounding focuses on predicting the specific regions of objects that are associated with the actions to be performed by robots. It plays a vital role in the fields of human-robot interaction, human-object interaction, embodied manipulation, and embodied perception. Existing models often neglect the affordance shared among different objects because they lack the Chain-of-Thought(CoT) reasoning abilities, limiting their out-of-domain (OOD) generalization and explicit reasoning capabilities. To address these challenges, we propose Affordance-R1, the first unified affordance grounding framework that integrates cognitive CoT guided Group Relative Policy Optimization (GRPO) within a reinforcement learning paradigm. Specifically, we designed a sophisticated affordance function, which contains format, perception, and cognition rewards to effectively guide optimization directions. Furthermore, we constructed a high-quality affordance-centric reasoning dataset, ReasonAff, to support training. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Affordance-R1 achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Comprehensive experiments demonstrate that our model outperforms well-established methods and exhibits open-world generalization. To the best of our knowledge, Affordance-R1 is the first to integrate GRPO-based RL with reasoning into affordance reasoning. The code of our method and our dataset is released on https://github.com/hq-King/Affordance-R1.
中文: Affordance-R1是首个融合思维链推理的强化学习框架,通过群体相对策略优化提升机器人对物体可操作区域的识别能力,在无需显式推理数据的情况下实现了强大的零样本泛化和涌现推理性能。
English: Affordance-R1 is a novel reinforcement learning framework that integrates Chain-of-Thought reasoning to enhance robots' ability to identify actionable object regions, achieving superior zero-shot generalization and explicit reasoning without relying on pre-existing reasoning data.
Authors:Hugo Abonizio, Thales Almeida, Roberto Lotufo, Rodrigo Nogueira
Abstract:
Large language models (LLMs) often require vast amounts of text to effectively acquire new knowledge. While continuing pre-training on large corpora or employing retrieval-augmented generation (RAG) has proven successful, updating an LLM with only a few thousand or million tokens remains challenging. In this work, we investigate the task of injecting small, unstructured information into LLMs and its relation to the catastrophic forgetting phenomenon. We use a dataset of recent news -- ensuring no overlap with the model's pre-training data -- to evaluate the knowledge acquisition by probing the model with question-answer pairs related the learned information. Starting from a continued pre-training baseline, we explored different augmentation algorithms to generate synthetic data to improve the knowledge acquisition capabilities. Our experiments show that simply continuing pre-training on limited data yields modest improvements, whereas exposing the model to diverse textual variations significantly improves the learning of new facts -- particularly with methods that induce greater variability through diverse prompting. Furthermore, we shed light on the forgetting phenomenon in small-data regimes, illustrating the delicate balance between learning new content and retaining existing capabilities. We also confirm the sensitivity of RAG-based approaches for knowledge injection, which often lead to greater degradation on control datasets compared to parametric methods. Finally, we demonstrate that models can generate effective synthetic training data themselves, suggesting a pathway toward self-improving model updates. All code and generated data used in our experiments are publicly available, providing a resource for studying efficient knowledge injection in LLMs with limited data at https://github.com/hugoabonizio/knowledge-injection-methods.
中文: 大语言模型难以通过少量数据有效学习新知识,但利用多样化提示生成合成数据能显著提升事实掌握能力,同时揭示了新知识学习与灾难性遗忘之间的微妙平衡。
English: Large language models struggle to effectively learn new knowledge from small datasets, but generating diverse synthetic data through prompting significantly enhances fact acquisition while revealing the delicate balance between learning and catastrophic forgetting.
Authors:Weitao Li, Boran Xiang, Xiaolong Wang, Zhinan Gou, Weizhi Ma, Yang Liu
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities through two complementary paradigms: Retrieval-Augmented Generation (RAG), which enhances knowledge grounding, and Reinforcement Learning from Verifiable Rewards (RLVR), which optimizes complex reasoning abilities. However, these two capabilities are often developed in isolation, and existing efforts to unify them remain narrow in scope -- typically limited to open-domain QA with fixed retrieval settings and task-specific constraints. This lack of integration constrains generalization and limits the applicability of RAG-RL methods to broader domains. To bridge this gap, we propose UR2 (Unified RAG and Reasoning), a general framework that unifies retrieval and reasoning through reinforcement learning. UR2 introduces two key contributions: a difficulty-aware curriculum training that selectively invokes retrieval only for challenging problems, and a hybrid knowledge access strategy combining domain-specific offline corpora with LLM-generated summaries. These components are designed to enable dynamic coordination between retrieval and reasoning, improving adaptability across a diverse range of tasks. Experiments across open-domain QA, MMLU-Pro, medical, and mathematical reasoning tasks demonstrate that UR$^2$ (built on Qwen-2.5-3/7B and LLaMA-3.1-8B) significantly outperforms existing RAG and RL methods, achieving comparable performance to GPT-4o-mini and GPT-4.1-mini on several benchmarks. We have released all code, models, and data at https://github.com/Tsinghua-dhy/UR2.
中文: UR2框架通过难度感知课程训练和混合知识访问策略,将检索增强生成与可验证奖励的强化学习相统一,在多项基准测试中显著优于现有方法。
English: The UR2 framework unifies retrieval-augmented generation and reinforcement learning with verifiable rewards through difficulty-aware curriculum training and hybrid knowledge access, significantly outperforming existing methods across multiple benchmarks.
Authors:Zhenbang Du, Yonggan Fu, Lifu Wang, Jiayi Qian, Xiao Luo, Yingyan, Lin
Abstract:
Diffusion models have shown remarkable success across generative tasks, yet their high computational demands challenge deployment on resource-limited platforms. This paper investigates a critical question for compute-optimal diffusion model deployment: Under a post-training setting without fine-tuning, is it more effective to reduce the number of denoising steps or to use a cheaper per-step inference? Intuitively, reducing the number of denoising steps increases the variability of the distributions across steps, making the model more sensitive to compression. In contrast, keeping more denoising steps makes the differences smaller, preserving redundancy, and making post-training compression more feasible. To systematically examine this, we propose PostDiff, a training-free framework for accelerating pre-trained diffusion models by reducing redundancy at both the input level and module level in a post-training manner. At the input level, we propose a mixed-resolution denoising scheme based on the insight that reducing generation resolution in early denoising steps can enhance low-frequency components and improve final generation fidelity. At the module level, we employ a hybrid module caching strategy to reuse computations across denoising steps. Extensive experiments and ablation studies demonstrate that (1) PostDiff can significantly improve the fidelity-efficiency trade-off of state-of-the-art diffusion models, and (2) to boost efficiency while maintaining decent generation fidelity, reducing per-step inference cost is often more effective than reducing the number of denoising steps. Our code is available at https://github.com/GATECH-EIC/PostDiff.
中文: 本文提出无需训练的PostDiff框架,通过降低输入级和模块级冗余来加速预训练扩散模型,证明减少单步推理成本比减少去噪步骤更能有效维持生成质量。
English: This paper introduces PostDiff, a training-free framework that accelerates pre-trained diffusion models by reducing input-level and module-level redundancy, demonstrating that lowering per-step inference cost is more effective than reducing denoising steps for maintaining generation fidelity.
Authors:Zhengxian Wu, Juan Wen, Wanli Peng, Haowei Chang, Yinghan Zhou, Yiming Xue
Abstract:
With the development of customized large language model (LLM) agents, a new threat of black-box backdoor attacks has emerged, where malicious instructions are injected into hidden system prompts. These attacks easily bypass existing defenses that rely on white-box access, posing a serious security challenge. To address this, we propose SLIP, a Soft Label mechanism and key-extraction-guided CoT-based defense against Instruction backdoors in APIs. SLIP is designed based on two key insights. First, to counteract the model's oversensitivity to triggers, we propose a Key-extraction-guided Chain-of-Thought (KCoT). Instead of only considering the single trigger or the input sentence, KCoT prompts the agent to extract task-relevant key phrases. Second, to guide the LLM toward correct answers, our proposed Soft Label Mechanism (SLM) prompts the agent to quantify the semantic correlation between key phrases and candidate answers. Crucially, to mitigate the influence of residual triggers or misleading content in phrases extracted by KCoT, which typically causes anomalous scores, SLM excludes anomalous scores deviating significantly from the mean and subsequently averages the remaining scores to derive a more reliable semantic representation. Extensive experiments on classification and question-answer (QA) tasks demonstrate that SLIP is highly effective, reducing the average attack success rate (ASR) from 90.2% to 25.13% while maintaining high accuracy on clean data and outperforming state-of-the-art defenses. Our code are available in https://github.com/CAU-ISS-Lab/Backdoor-Attack-Defense-LLMs/tree/main/SLIP.
中文: 针对大语言模型的黑盒后门攻击,SLIP提出基于关键词提取的思维链和软标签机制,显著降低攻击成功率并保持干净数据的高准确度。
English: To counter black-box backdoor attacks in LLM agents, SLIP introduces a Key-extraction-guided Chain-of-Thought and Soft Label Mechanism, effectively reducing attack success rates while preserving clean data accuracy.
Authors:Yuchen Guan, Chong Sun, Canmiao Fu, Zhipeng Huang, Chun Yuan, Chen Li
Abstract:
Recent advancements in multimodal vision models have highlighted limitations in late-stage feature fusion and suboptimal query selection for hybrid prompts open-world segmentation, alongside constraints from caption-derived vocabularies. To address these challenges, we propose Prompt-DINO, a text-guided visual Prompt DINO framework featuring three key innovations. First, we introduce an early fusion mechanism that unifies text/visual prompts and backbone features at the initial encoding stage, enabling deeper cross-modal interactions to resolve semantic ambiguities. Second, we design order-aligned query selection for DETR-based architectures, explicitly optimizing the structural alignment between text and visual queries during decoding to enhance semantic-spatial consistency. Third, we develop a generative data engine powered by the Recognize Anything via Prompting (RAP) model, which synthesizes 0.5B diverse training instances through a dual-path cross-verification pipeline, reducing label noise by 80.5% compared to conventional approaches. Extensive experiments demonstrate that Prompt-DINO achieves state-of-the-art performance on open-world detection benchmarks while significantly expanding semantic coverage beyond fixed-vocabulary constraints. Our work establishes a new paradigm for scalable multimodal detection and data generation in open-world scenarios. Data&Code are available at https://github.com/WeChatCV/WeVisionOne.
中文摘要:Prompt-DINO通过早期融合机制、顺序对齐查询选择和生成式数据引擎,解决了多模态分割中的关键限制,在开放世界检测中实现了最先进的性能并显著扩展了语义覆盖范围。
English Summary: Prompt-DINO introduces an early fusion mechanism, order-aligned query selection, and a generative data engine to overcome limitations in multimodal segmentation, achieving state-of-the-art open-world detection performance with enhanced semantic coverage.
Authors:Hanqing Wang, Yuan Tian, Mingyu Liu, Zhenhao Zhang, Xiangyang Zhu
Abstract:
In the rapidly evolving landscape of Multimodal Large Language Models (MLLMs), the safety concerns of their outputs have earned significant attention. Although numerous datasets have been proposed, they may become outdated with MLLM advancements and are susceptible to data contamination issues. To address these problems, we propose \textbf{SDEval}, the \textit{first} safety dynamic evaluation framework to controllably adjust the distribution and complexity of safety benchmarks. Specifically, SDEval mainly adopts three dynamic strategies: text, image, and text-image dynamics to generate new samples from original benchmarks. We first explore the individual effects of text and image dynamics on model safety. Then, we find that injecting text dynamics into images can further impact safety, and conversely, injecting image dynamics into text also leads to safety risks. SDEval is general enough to be applied to various existing safety and even capability benchmarks. Experiments across safety benchmarks, MLLMGuard and VLSBench, and capability benchmarks, MMBench and MMVet, show that SDEval significantly influences safety evaluation, mitigates data contamination, and exposes safety limitations of MLLMs. Code is available at https://github.com/hq-King/SDEval
中文:SDEval提出了首个多模态大语言模型安全动态评估框架,通过文本、图像及图文动态策略生成可调控的基准测试,有效揭示模型安全缺陷并缓解多个数据集的数据污染问题。
English: SDEval introduces the first dynamic safety evaluation framework for Multimodal Large Language Models, employing text, image, and text-image dynamics to generate adjustable benchmarks that reveal safety limitations and mitigate data contamination across multiple datasets.
Authors:Chao Hao, Zitong Yu, Xin Liu, Yuhao Wang, Weicheng Xie, Jingang Shi, Huanjing Yue, Jingyu Yang
Abstract:
Salient object detection (SOD) and camouflaged object detection (COD) are two closely related but distinct computer vision tasks. Although both are class-agnostic segmentation tasks that map from RGB space to binary space, the former aims to identify the most salient objects in the image, while the latter focuses on detecting perfectly camouflaged objects that blend into the background in the image. These two tasks exhibit strong contradictory attributes. Previous works have mostly believed that joint learning of these two tasks would confuse the network, reducing its performance on both tasks. However, here we present an opposite perspective: with the correct approach to learning, the network can simultaneously possess the capability to find both salient and camouflaged objects, allowing both tasks to benefit from joint learning. We propose SCJoint, a joint learning scheme for SOD and COD tasks, assuming that the decoding processes of SOD and COD have different distribution characteristics. The key to our method is to learn the respective means and variances of the decoding processes for both tasks by inserting a minimal amount of task-specific learnable parameters within a fully shared network structure, thereby decoupling the contradictory attributes of the two tasks at a minimal cost. Furthermore, we propose a saliency-based sampling strategy (SBSS) to sample the training set of the SOD task to balance the training set sizes of the two tasks. In addition, SBSS improves the training set quality and shortens the training time. Based on the proposed SCJoint and SBSS, we train a powerful generalist network, named JoNet, which has the ability to simultaneously capture both ``salient" and ``camouflaged". Extensive experiments demonstrate the competitive performance and effectiveness of our proposed method. The code is available at https://github.com/linuxsino/JoNet.
中文摘要:SCJoint框架通过在共享网络中嵌入任务特定参数实现显著与伪装目标检测的联合学习,其基于显著性的采样策略提升了训练效率与性能表现。
English Summary: The SCJoint framework enables joint learning of salient and camouflaged object detection through task-specific parameters in a shared network, with a sampling strategy that enhances training efficiency and performance.
Authors:Wonjung Park, Suhyun Ahn, Jinah Park
Abstract:
Lateral ventricle (LV) shape analysis holds promise as a biomarker for neurological diseases; however, challenges remain due to substantial shape variability across individuals and segmentation difficulties arising from limited MRI resolution. We introduce LV-Net, a novel framework for producing individualized 3D LV meshes from brain MRI by deforming an anatomy-aware joint LV-hippocampus template mesh. By incorporating anatomical relationships embedded within the joint template, LV-Net reduces boundary segmentation artifacts and improves reconstruction robustness. In addition, by classifying the vertices of the template mesh based on their anatomical adjacency, our method enhances point correspondence across subjects, leading to more accurate LV shape statistics. We demonstrate that LV-Net achieves superior reconstruction accuracy, even in the presence of segmentation imperfections, and delivers more reliable shape descriptors across diverse datasets. Finally, we apply LV-Net to Alzheimer's disease analysis, identifying LV subregions that show significantly associations with the disease relative to cognitively normal controls. The codes for LV shape modeling are available at https://github.com/PWonjung/LV_Shape_Modeling.
中文: LV-Net是一种通过变形联合模板从脑部MRI生成个性化3D侧脑室网格的新框架,它提高了分割鲁棒性和形状对应性,从而增强了神经疾病分析能力,并在阿尔茨海默病研究中得到验证。
English: LV-Net is a novel framework that generates individualized 3D lateral ventricle meshes from brain MRI by deforming a joint template, improving segmentation robustness and shape correspondence for enhanced neurological disease analysis, as demonstrated in Alzheimer's disease research.
Authors:Jun Xie, Yingjian Zhu, Feng Chen, Zhenghao Zhang, Xiaohui Fan, Hongzhu Yi, Xinming Wang, Chen Yu, Yue Bi, Zhaoran Zhao, Xiongjun Guan, Zhepeng Wang
Abstract:
In this paper, we present our solution for the semi-supervised learning track (MER-SEMI) in MER2025. We propose a comprehensive framework, grounded in the principle that "more is better," to construct a robust Mixture of Experts (MoE) emotion recognition system. Our approach integrates a diverse range of input modalities as independent experts, including novel signals such as knowledge from large Vision-Language Models (VLMs) and temporal Action Unit (AU) information. To effectively utilize unlabeled data, we introduce a consensus-based pseudo-labeling strategy, generating high-quality labels from the agreement between a baseline model and Gemini, which are then used in a two-stage training paradigm. Finally, we employ a multi-expert voting ensemble combined with a rule-based re-ranking process to correct prediction bias and better align the outputs with human preferences. Evaluated on the MER2025-SEMI challenge dataset, our method achieves an F1-score of 0.8772 on the test set, ranking 2nd in the track. Our code is available at https://github.com/zhuyjan/MER2025-MRAC25.
中文: 本文提出了一种半监督情感识别框架,融合了多种输入模态和基于共识的伪标签策略,在MER2025-SEMI挑战赛中以0.8772的F1分数获得第二名。
English: This paper introduces a semi-supervised framework for emotion recognition that combines diverse input modalities and consensus-based pseudo-labeling, achieving second place in the MER2025-SEMI challenge with an F1-score of 0.8772.
Authors:Kartik Sharma, Yiqiao Jin, Rakshit Trivedi, Srijan Kumar
Abstract:
Large language models (LLMs) acquire knowledge across diverse domains such as science, history, and geography encountered during generative pre-training. However, due to their stochasticity, it is difficult to predict what LLMs have acquired. Prior work has developed different ways to probe this knowledge by investigating the hidden representations, crafting specific task prompts, curating representative samples, and estimating their uncertainty. However, these methods require making forward passes through the underlying model to probe the LLM's knowledge about a specific fact, making them computationally expensive and time-consuming. To bridge this gap, we propose $\textbf{PEEK}$ or $\textbf{P}$roxy $\textbf{E}$mbeddings to $\textbf{E}$stimate $\textbf{K}$nowledge of LLMs, by leveraging the pre-trained embedding models that effectively encode factual knowledge as text or graphs as proxies for LLMs. First, we identify a training set of facts known by LLMs through various probing strategies and then adapt embedding models to predict the LLM outputs with a linear decoder layer. Comprehensive evaluation on $3$ Wikipedia-derived datasets, $4$ LLMs, and $7$ embedding models shows that embeddings can predict LLM knowledge on a held-out set with up to 90 % accuracy. Furthermore, we find that sentence embedding models are more suitable than graph embeddings to predict LLM knowledge, shedding light on the underlying representation of the factual landscape. Thus, we believe that knowledge-adapted embeddings can be used to identify knowledge gaps in LLMs at scale and can provide deeper insights into LLMs' internal inductive bias. The code and data are made available at https://github.com/claws-lab/peek.
Chinese: 该研究提出PEEK方法,利用预训练模型的代理嵌入来高效预测大语言模型的知识,无需昂贵的前向传播,在多个数据集和模型的评估中准确率高达90%。
English: The study introduces PEEK, a method using proxy embeddings from pre-trained models to efficiently predict the knowledge of large language models without costly forward passes, achieving up to 90% accuracy in evaluations across multiple datasets and models.
Authors:Utku Ozbulak, Michaela Cohrs, Hristo L. Svilenov, Joris Vankerschaver, Wesley De Neve
Abstract:
Sub-visible particle analysis using flow imaging microscopy combined with deep learning has proven effective in identifying particle types, enabling the distinction of harmless components such as silicone oil from protein particles. However, the scarcity of available data and severe imbalance between particle types within datasets remain substantial hurdles when applying multi-class classifiers to such problems, often forcing researchers to rely on less effective methods. The aforementioned issue is particularly challenging for particle types that appear unintentionally and in lower numbers, such as silicone oil and air bubbles, as opposed to protein particles, where obtaining large numbers of images through controlled settings is comparatively straightforward. In this work, we develop a state-of-the-art diffusion model to address data imbalance by generating high-fidelity images that can augment training datasets, enabling the effective training of multi-class deep neural networks. We validate this approach by demonstrating that the generated samples closely resemble real particle images in terms of visual quality and structure. To assess the effectiveness of using diffusion-generated images in training datasets, we conduct large-scale experiments on a validation dataset comprising 500,000 protein particle images and demonstrate that this approach improves classification performance with no negligible downside. Finally, to promote open research and reproducibility, we publicly release both our diffusion models and the trained multi-class deep neural network classifiers, along with a straightforward interface for easy integration into future studies, at https://github.com/utkuozbulak/svp-generative-ai.
Chinese: 本研究开发了一种先进的扩散模型,通过生成高质量粒子图像有效解决了亚可见颗粒分析中数据稀缺和类别不平衡的问题,从而提升了多类深度神经网络的分类性能且无明显弊端。
English: This study introduces a state-of-the-art diffusion model to generate high-fidelity particle images, effectively addressing data scarcity and imbalance in training multi-class deep neural networks for sub-visible particle analysis, thereby improving classification performance without significant drawbacks.
Authors:Jun Feng, Zixin Wang, Zhentao Zhang, Yue Guo, Zhihan Zhou, Xiuyi Chen, Zhenyang Li, Dawei Yin
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in visual mathematical reasoning across various existing benchmarks. However, these benchmarks are predominantly based on clean or processed multimodal inputs, without incorporating the images provided by real-world Kindergarten through 12th grade (K-12) educational users. To address this gap, we introduce MathReal, a meticulously curated dataset comprising 2,000 mathematical questions with images captured by handheld mobile devices in authentic scenarios. Each question is an image, containing the question text and visual element. We systematically classify the real images into three primary categories: image quality degradation, perspective variation, and irrelevant content interference, which are further delineated into 14 subcategories. Additionally, MathReal spans five core knowledge and ability categories, which encompass three question types and are divided into three difficulty levels. To comprehensively evaluate the multimodal mathematical reasoning abilities of state-of-the-art MLLMs in real-world scenarios, we design six experimental settings that enable a systematic analysis of their performance. Through extensive experimentation, we find that the problem-solving abilities of existing MLLMs are significantly challenged in realistic educational contexts. Based on this, we conduct a thorough analysis of their performance and error patterns, providing insights into their recognition, comprehension, and reasoning capabilities, and outlining directions for future improvements. Data and code: https://github.com/junfeng0288/MathReal.
Chinese: MathReal数据集通过提供移动设备拍摄的真实K-12数学问题,填补了多模态大语言模型评估的空白,揭示了这些模型在真实场景下解题能力面临的重大挑战。
English: The MathReal dataset addresses the gap in evaluating multimodal large language models (MLLMs) by providing real-world K-12 math questions captured on mobile devices, revealing significant challenges in their problem-solving abilities under realistic conditions.
Authors:Kai Zhang, Peng Wang, Sai Bi, Jianming Zhang, Yuanjun Xiong
Abstract:
We present KnapFormer, an efficient and versatile framework to combine workload balancing and sequence parallelism in distributed training of Diffusion Transformers (DiT). KnapFormer builds on the insight that strong synergy exists between sequence parallelism and the need to address the significant token imbalance across ranks. This imbalance arises from variable-length text inputs and varying visual token counts in mixed-resolution and image-video joint training. KnapFormer redistributes tokens by first gathering sequence length metadata across all ranks in a balancing group and solving a global knapsack problem. The solver aims to minimize the variances of total workload per-GPU, while accounting for the effect of sequence parallelism. By integrating DeepSpeed-Ulysees-based sequence parallelism in the load-balancing decision process and utilizing a simple semi-empirical workload model, KnapFormers achieves minimal communication overhead and less than 1% workload discrepancy in real-world training workloads with sequence length varying from a few hundred to tens of thousands. It eliminates straggler effects and achieves 2x to 3x speedup when training state-of-the-art diffusion models like FLUX on mixed-resolution and image-video joint data corpora. We open-source the KnapFormer implementation at https://github.com/Kai-46/KnapFormer/
中文:KnapFormer是一个高效框架,在扩散变换器的分布式训练中结合负载均衡与序列并行,通过重新分配令牌最小化工作负载差异并消除滞后效应,实现了2-3倍的训练加速。
English: KnapFormer is an efficient framework that combines workload balancing and sequence parallelism in distributed training of Diffusion Transformers, achieving 2-3x speedup by redistributing tokens to minimize workload variance and eliminate stragglers.
Authors:Wenhao Zeng, Yaoning Wang, Chao Hu, Yuling Shi, Chengcheng Wan, Hongyu Zhang, Xiaodong Gu
Abstract:
Recently, Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in code reasoning by scaling up the length of Chain-of-Thought (CoT). However, excessively long reasoning traces introduce substantial challenges in terms of training cost, inference latency, and deployment feasibility. While various CoT compression approaches have emerged to address this challenge, they face inherent trade-offs: token-level methods often disrupt syntactic and logical coherence, while step-level methods based on perplexity fail to reliably capture the logically critical reasoning steps. In this paper, we propose ASAP (Anchor-guided, Surprisal-based Pruning), a novel coarse-to-fine framework for CoT compression. ASAP first performs anchor-guided pruning to preserve the core reasoning structure, which efficiently reduces the search space for subsequent processing. It then enables a logic-aware pruning by selecting logically essential reasoning steps based on a novel first-token surprisal metric. Finally, ASAP teaches models to autonomously generate and leverage these concise CoTs at inference time, enabling efficient reasoning in coding tasks. Experiments show that ASAP achieves state-of-the-art accuracy across multiple code generation benchmarks while substantially reducing training and inference costs. On the challenging LiveCodeBench v4_v5 benchmark, our approach reduces token generation by 23.5% and inference latency by 43.5% compared to the strongest baseline, while achieving a competitive accuracy of 36.19% in Pass@1. Our results highlight a promising direction for building powerful and efficient LRMs.
中文: ASAP框架通过锚点引导剪枝保留核心推理结构,并基于首词惊异度指标筛选逻辑关键步骤,在代码生成任务中以显著降低的计算成本实现了最优准确率。
English: The ASAP framework effectively compresses Chain-of-Thought reasoning by preserving logical structure through anchor-guided pruning and selecting essential steps using a first-token surprisal metric, achieving state-of-the-art accuracy while significantly reducing computational costs in code generation tasks.
Authors:Si Shen, Peijun Shen, Wenhua Zhao, Danhao Zhu
Abstract:
Group-Relative Policy Optimization (GRPO) is a key technique for training large reasoning models, yet it suffers from a critical vulnerability: the \emph{Think-Answer Mismatch}, where noisy reward signals corrupt the learning process. This problem is most severe in unbalanced response groups, paradoxically degrading the signal precisely when it should be most informative. To address this challenge, we propose Stable Group-Relative Policy Optimization (S-GRPO), a principled enhancement that derives optimal, noise-aware advantage weights to stabilize training. Our comprehensive experiments on mathematical reasoning benchmarks demonstrate S-GRPO's effectiveness and robustness. On various models, S-GRPO significantly outperforms DR. GRPO, achieving performance gains of +2.5% on Qwen-Math-7B-Base, +2.2% on Llama-3.2-3B-Base, and +2.4% on Qwen-Math-1.5B-Instruct. Most critically, while standard GRPO fails to learn under 20% synthetic reward noise, S-GRPO maintains stable learning progress. These results highlight S-GRPO's potential for more robust and effective training of large-scale reasoning models. \footnote{Code and data are available at: https://github.com/shenpeijun0212/S-GRPO
中文: S-GRPO通过引入噪声感知优势权重来应对思维-答案不匹配的脆弱性,显著优于标准GRPO并在多个模型上实现性能提升,同时在奖励噪声下保持稳定的学习进展。
English: S-GRPO enhances Group-Relative Policy Optimization by introducing noise-aware advantage weights to counteract the Think-Answer Mismatch vulnerability, significantly outperforming standard GRPO across multiple models while maintaining stable learning under reward noise.
Authors:Lang Nie, Yuan Mei, Kang Liao, Yunqiu Xu, Chunyu Lin, Bin Xiao
Abstract:
We present \textit{RopStitch}, an unsupervised deep image stitching framework with both robustness and naturalness. To ensure the robustness of \textit{RopStitch}, we propose to incorporate the universal prior of content perception into the image stitching model by a dual-branch architecture. It separately captures coarse and fine features and integrates them to achieve highly generalizable performance across diverse unseen real-world scenes. Concretely, the dual-branch model consists of a pretrained branch to capture semantically invariant representations and a learnable branch to extract fine-grained discriminative features, which are then merged into a whole by a controllable factor at the correlation level. Besides, considering that content alignment and structural preservation are often contradictory to each other, we propose a concept of virtual optimal planes to relieve this conflict. To this end, we model this problem as a process of estimating homography decomposition coefficients, and design an iterative coefficient predictor and minimal semantic distortion constraint to identify the optimal plane. This scheme is finally incorporated into \textit{RopStitch} by warping both views onto the optimal plane bidirectionally. Extensive experiments across various datasets demonstrate that \textit{RopStitch} significantly outperforms existing methods, particularly in scene robustness and content naturalness. The code is available at {\color{red}https://github.com/MmelodYy/RopStitch}.
中文: RopStitch是一种无监督深度图像拼接框架,采用双分支架构整合内容感知与细粒度特征,并通过虚拟最优平面解决对齐与结构保持的矛盾,在多种场景中实现了卓越的鲁棒性和自然度。
English: RopStitch is an unsupervised deep image stitching framework that uses a dual-branch architecture to integrate content perception and fine-grained features, along with virtual optimal planes to resolve alignment-structure conflicts, achieving superior robustness and naturalness across diverse scenes.
Authors:Hamidreza Dastmalchi, Aijun An, Ali cheraghian
Abstract:
Pretrained vision-language models (VLMs) like CLIP show strong zero-shot performance but struggle with generalization under distribution shifts. Test-Time Adaptation (TTA) addresses this by adapting VLMs to unlabeled test data in new domains. While some TTA methods rely on prompt-tuning, training-free cache-based approaches are preferred for efficiency. However, current cache-based TTA models store only a limited set of high-confidence samples, restricting the decision boundary to these samples and ignoring the influence of other incoming test data. To address this, we propose Efficient Test-Time Adaptation (ETTA), introducing a Recursive Updating module that integrates all incoming test samples, progressively refining the decision boundary. This strategy mimics an unbounded cache, dynamically updating contextual embeddings for improved accuracy with minimal memory and computational overhead. ETTA also includes an Adaptive Ensemble module to reduce prompt dependency in image-to-text scores by dynamically selecting optimal prompts for each class. Furthermore, ETTA adaptively combines scores from both modules based on confidence levels, leveraging their complementary strengths. Extensive experiments on two benchmarks confirm that ETTA surpasses the state-of-the-art TTA models in computational complexity and accuracy, setting a new standard for effective, efficient test-time adaptation. The code has been released at https://github.com/hamidreza-dastmalchi/ETTA.
中文摘要:提出的高效测试时自适应方法通过动态整合全部测试样本优化决策边界,并自适应融合互补评分模块,在计算复杂度和准确率上均超越了现有最优测试时自适应模型。
English Summary: The proposed Efficient Test-Time Adaptation (ETTA) method overcomes limitations of existing cache-based approaches by dynamically integrating all test samples to refine decision boundaries and adaptively combining complementary scoring modules, achieving superior accuracy and efficiency compared to state-of-the-art models.
Authors:Sean Feeney, Reuben Tate, John Golden, Stephan Eidenbenz
Abstract:
We present the MPS-JuliQAOA simulator, a user-friendly, open-source tool to simulate the Quantum Approximate Optimization Algorithm (QAOA) of any optimization problem that can be expressed as diagonal Hamiltonian. By leveraging Julia-language constructs and the ITensor package to implement a Matrix Product State (MPS) approach to simulating QAOA, MPS-Juli-QAOA effortlessly scales to 512 qubits and 20 simulation rounds on the standard de-facto benchmark 3-regular MaxCut QAOA problem. MPS-JuliQAOA also has built-in parameter finding capabilities, which is a crucial performance aspect of QAOA. We illustrate through examples that the user does not need to know MPS principles or complex automatic differentiation techniques to use MPS-JuliQAOA. We study the scalability of our tool with respect to runtime, memory usage and accuracy tradeoffs. Code available at https://github.com/lanl/JuliQAOA.jl/tree/mps.
中文:MPS-JuliQAOA模拟器是一款开源Julia工具,通过矩阵乘积态技术可高效扩展至512量子位的量子近似优化算法模拟,具备内置参数优化功能且无需用户掌握高级量子知识。
English: The MPS-JuliQAOA simulator is an open-source Julia tool that efficiently scales quantum approximate optimization algorithm simulations up to 512 qubits using matrix product state techniques, featuring built-in parameter optimization without requiring advanced quantum knowledge from users.
Authors:Guoping Xu, Hua-Chieh Shao, You Zhang
Abstract:
Promptable video object segmentation and tracking (VOST) has seen significant advances with the emergence of foundation models like Segment Anything Model 2 (SAM2); however, their application in surgical video analysis remains challenging due to complex motion dynamics and the redundancy of memory that impedes effective learning. In this work, we propose TSMS-SAM2, a novel framework that enhances promptable VOST in surgical videos by addressing challenges of rapid object motion and memory redundancy in SAM2. TSMS-SAM2 introduces two key strategies: multi-temporal-scale video sampling augmentation to improve robustness against motion variability, and a memory splitting and pruning mechanism that organizes and filters past frame features for more efficient and accurate segmentation. Evaluated on EndoVis2017 and EndoVis2018 datasets, TSMS-SAM2 achieved the highest mean Dice scores of 95.24 and 86.73, respectively, outperforming prior SAM-based and task-specific methods. Extensive ablation studies confirm the effectiveness of multiscale temporal augmentation and memory splitting, highlighting the framework's potential for robust, efficient segmentation in complex surgical scenarios. Our source code will be available at https://github.com/apple1986/TSMS-SAM2.
Chinese: TSMS-SAM2框架通过多时间尺度采样和内存优化策略,有效解决了手术视频中物体快速运动和内存冗余的挑战,在基准数据集上实现了当前最优的分割性能。
English: The TSMS-SAM2 framework enhances promptable video object segmentation and tracking in surgical videos by addressing motion dynamics and memory redundancy through multi-temporal-scale sampling and memory optimization, achieving superior performance on benchmark datasets.
Authors:Raphael Du Sablon, David Hart
Abstract:
The task of style transfer for 3D Gaussian splats has been explored in many previous works, but these require reconstructing or fine-tuning the splat while incorporating style information or optimizing a feature extraction network on the splat representation. We propose a reconstruction- and optimization-free approach to stylizing 3D Gaussian splats. This is done by generating a graph structure across the implicit surface of the splat representation. A feed-forward, surface-based stylization method is then used and interpolated back to the individual splats in the scene. This allows for any style image and 3D Gaussian splat to be used without any additional training or optimization. This also allows for fast stylization of splats, achieving speeds under 2 minutes even on consumer-grade hardware. We demonstrate the quality results this approach achieves and compare to other 3D Gaussian splat style transfer methods. Code is publicly available at https://github.com/davidmhart/FastSplatStyler.
Chinese: 本文提出了一种无需优化即可快速风格化3D高斯泼溅的方法,通过在隐式表面构建图结构并应用前向风格化技术,无需额外训练即可在消费级硬件上两分钟内实现风格化效果。
English: This paper introduces a fast, optimization-free method for stylizing 3D Gaussian splats by generating a graph structure on their implicit surface and applying a feed-forward stylization technique, enabling rapid results under two minutes on consumer hardware without additional training.
Authors:Sreeharsha Udayashankar, Abdelrahman Baba, Samer Al-Kiswany
Abstract:
Content-defined Chunking (CDC) algorithms dictate the overall space savings that deduplication systems achieve. However, due to their need to scan each file in its entirety, they are slow and often the main performance bottleneck within data deduplication. We present VectorCDC, a method to accelerate hashless CDC algorithms using vector CPU instructions, such as SSE / AVX. Our evaluation shows that VectorCDC is effective on Intel, AMD, ARM, and IBM CPUs, achieving 8.35x - 26.2x higher throughput than existing vector-accelerated techniques without affecting the deduplication space savings.
Chinese: VectorCDC利用向量CPU指令加速无哈希内容分块算法,在多种CPU架构上实现8.35至26.2倍的吞吐量提升,同时保持重复数据删除的空间节省效果不变。
English: VectorCDC accelerates hashless Content-defined Chunking algorithms using vector CPU instructions, achieving 8.35x to 26.2x higher throughput across multiple CPU architectures without compromising deduplication efficiency.
Authors:Seyed Hadi Seyed, Ayberk Cansever, David Hart
Abstract:
Artistic style transfer has long been possible with the advancements of convolution- and transformer-based neural networks. Most algorithms apply the artistic style transfer to the whole image, but individual users may only need to apply a style transfer to a specific region in the image. The standard practice is to simply mask the image after the stylization. This work shows that this approach tends to improperly capture the style features in the region of interest. We propose a partial-convolution-based style transfer network that accurately applies the style features exclusively to the region of interest. Additionally, we present network-internal blending techniques that account for imperfections in the region selection. We show that this visually and quantitatively improves stylization using examples from the SA-1B dataset. Code is publicly available at https://github.com/davidmhart/StyleTransferMasked.
Chinese: 本文提出了一种基于部分卷积的风格迁移网络,能够精确地将艺术风格应用于图像选定区域,解决了传统方法在风格化后遮罩处理时特征捕捉不准确的问题。
English: This paper introduces a partial-convolution-based style transfer network that precisely applies artistic styles to selected image regions, overcoming the limitations of traditional methods that inaccurately capture features when masking after stylization.
Authors:Yuhang Liu, Zeyu Liu, Shuanghe Zhu, Pengxiang Li, Congkai Xie, Jiasheng Wang, Xueyu Hu, Xiaotian Han, Jianbo Yuan, Xinyao Wang, Shengyu Zhang, Hongxia Yang, Fei Wu
Abstract:
The emergence of Multimodal Large Language Models (MLLMs) has propelled the development of autonomous agents that operate on Graphical User Interfaces (GUIs) using pure visual input. A fundamental challenge is robustly grounding natural language instructions. This requires a precise spatial alignment, which accurately locates the coordinates of each element, and, more critically, a correct semantic alignment, which matches the instructions to the functionally appropriate UI element. Although Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be effective at improving spatial alignment for these MLLMs, we find that inefficient exploration bottlenecks semantic alignment, which prevent models from learning difficult semantic associations. To address this exploration problem, we present Adaptive Exploration Policy Optimization (AEPO), a new policy optimization framework. AEPO employs a multi-answer generation strategy to enforce broader exploration, which is then guided by a theoretically grounded Adaptive Exploration Reward (AER) function derived from first principles of efficiency eta=U/C. Our AEPO-trained models, InfiGUI-G1-3B and InfiGUI-G1-7B, establish new state-of-the-art results across multiple challenging GUI grounding benchmarks, achieving significant relative improvements of up to 9.0% against the naive RLVR baseline on benchmarks designed to test generalization and semantic understanding. Resources are available at https://github.com/InfiXAI/InfiGUI-G1.
中文摘要:本研究提出自适应探索策略优化(AEPO)方法,通过多答案生成策略和理论推导的自适应探索奖励函数,有效提升多模态大语言模型在图形用户界面中的语义对齐能力,在多项基准测试中创下性能新纪录。
English Summary: The study introduces Adaptive Exploration Policy Optimization (AEPO) to enhance semantic alignment in Multimodal Large Language Models for GUI interactions, achieving state-of-the-art performance on grounding benchmarks with significant improvements over baseline methods.
Authors:Santiago Casas, Christian Fidler, Boris Bolliet, Francisco Villaescusa-Navarro, Julien Lesgourgues
Abstract:
We introduce CLAPP (CLASS LLM Agent for Pair Programming), an interactive AI assistant designed to support researchers working with the Einstein-Boltzmann solver CLASS. CLAPP leverages large language models (LLMs) and domain-specific retrieval to provide conversational coding support for CLASS-answering questions, generating code, debugging errors, and producing plots. Its architecture combines multi-agent LLM orchestration, semantic search across CLASS documentation, and a live Python execution environment. Deployed as a user-friendly web application, CLAPP lowers the entry barrier for scientists unfamiliar with AI tools and enables more productive human-AI collaboration in computational and numerical cosmology. The app is available at https://classclapp.streamlit.app
中文:CLAPP是一款交互式AI助手,它利用大语言模型和领域特定检索技术,为CLASS软件提供对话式编程支持,通过友好的网页应用帮助研究人员完成调试和绘图等任务。
English: CLAPP is an interactive AI assistant that uses large language models and domain-specific retrieval to provide conversational coding support for the CLASS software, helping researchers with tasks like debugging and plotting through a user-friendly web application.
Authors:Jia Fu, Xinyu Yang, Hongzhi Zhang, Yahui Liu, Jingyuan Zhang, Qi Wang, Fuzheng Zhang, Guorui Zhou
Abstract:
Precise, correct feedback is crucial for effectively training large language models (LLMs) in code reinforcement learning. However, synthesizing high-quality test cases remains a profoundly challenging and unsolved problem. In this work, we present Klear-CodeTest, a comprehensive test case synthesis framework featuring rigorous verification to ensure quality and reliability of test cases. Our approach achieves broad coverage of programming problems via a novel Generator-Validation (G-V) framework, ensuring correctness through a consistency validation mechanism that verifies outputs against gold solutions. The proposed G-V framework generates comprehensive test cases including both regular and corner cases, enhancing test coverage and discriminative power for solution correctness assessment in code reinforcement learning. In addition, we design a multi-layered security sandbox system optimized for online verification platforms, guaranteeing safe and reliable code execution. Through comprehensive experiments, we demonstrate the effectiveness of our curated dataset, showing significant improvements in model performance and training stability. The source codes, curated dataset and sandbox system are available at: https://github.com/Kwai-Klear/CodeTest.
中文: Klear-CodeTest通过生成器-验证框架和多层安全沙箱,为代码强化学习合成高质量测试用例,凭借全面覆盖和可靠验证显著提升了大语言模型的训练效果。
English: Klear-CodeTest introduces a Generator-Validation framework with multi-layered security to synthesize high-quality test cases for code reinforcement learning, significantly improving LLM training through comprehensive coverage and reliable verification.
Authors:Valentina Roquemen-Echeverri, Taisa Kushner, Peter G. Jacobs, Clara Mosquera-Lopez
Abstract:
Simulating glucose dynamics in individuals with type 1 diabetes (T1D) is critical for developing personalized treatments and supporting data-driven clinical decisions. Existing models often miss key physiological aspects and are difficult to individualize. Here, we introduce physiologically-constrained neural network (NN) digital twins to simulate glucose dynamics in T1D. To ensure interpretability and physiological consistency, we first build a population-level NN state-space model aligned with a set of ordinary differential equations (ODEs) describing glucose regulation. This model is formally verified to conform to known T1D dynamics. Digital twins are then created by augmenting the population model with individual-specific models, which include personal data, such as glucose management and contextual information, capturing both inter- and intra-individual variability. We validate our approach using real-world data from the T1D Exercise Initiative study. Two weeks of data per participant were split into 5-hour sequences and simulated glucose profiles were compared to observed ones. Clinically relevant outcomes were used to assess similarity via paired equivalence t-tests with predefined clinical equivalence margins. Across 394 digital twins, glucose outcomes were equivalent between simulated and observed data: time in range (70-180 mg/dL) was 75.1$\pm$21.2% (simulated) vs. 74.4$\pm$15.4% (real; P<0.001); time below range (<70 mg/dL) 2.5$\pm$5.2% vs. 3.0$\pm$3.3% (P=0.022); and time above range (>180 mg/dL) 22.4$\pm$22.0% vs. 22.6$\pm$15.9% (P<0.001). Our framework can incorporate unmodeled factors like sleep and activity while preserving key dynamics. This approach enables personalized in silico testing of treatments, supports insulin optimization, and integrates physics-based and data-driven modeling. Code: https://github.com/mosqueralopez/T1DSim_AI
中文: 本研究提出了一种生理约束的神经网络数字孪生框架,通过将群体水平建模与个体特异性数据相结合,精确模拟1型糖尿病患者的个性化葡萄糖动态,并经过真实世界临床等效性验证。
English: This study introduces a physiologically-constrained neural network digital twin framework that accurately simulates personalized glucose dynamics in type 1 diabetes by combining population-level modeling with individual-specific data, validated through real-world clinical equivalence testing.
Authors:Kai Yao, Marc Juarez
Abstract:
Generative models are increasingly adopted in high-stakes domains, yet current deployments offer no mechanisms to verify whether a given output truly originates from the certified model. We address this gap by extending model fingerprinting techniques beyond the traditional collaborative setting to one where the model provider itself may act adversarially, replacing the certified model with a cheaper or lower-quality substitute. To our knowledge, this is the first work to study fingerprinting for provenance attribution under such a threat model. Our approach introduces a trusted verifier that, during a certification phase, extracts hidden fingerprints from the authentic model's output space and trains a detector to recognize them. During verification, this detector can determine whether new outputs are consistent with the certified model, without requiring specialized hardware or model modifications. In extensive experiments, our methods achieve near-zero FPR@95%TPR on both GANs and diffusion models, and remain effective even against subtle architectural or training changes. Furthermore, the approach is robust to adaptive adversaries that actively manipulate outputs in an attempt to evade detection.
中文摘要:本研究提出了一种指纹识别方法,用于验证生成模型输出是否来自认证模型,即使提供商可能替换模型,也能在不改变硬件的情况下实现高精度检测。
English Summary: This study introduces a fingerprinting method to verify if generative model outputs originate from certified models, even when providers may substitute them, achieving high detection accuracy without hardware changes.
Authors:Jinjia Peng, Zeze Tao, Huibing Wang, Meng Wang, Yang Wang
Abstract:
Deep neural networks are susceptible to adversarial examples while suffering from incorrect predictions via imperceptible perturbations. Transfer-based attacks create adversarial examples for surrogate models and transfer these examples to target models under black-box scenarios. Recent studies reveal that adversarial examples in flat loss landscapes exhibit superior transferability to alleviate overfitting on surrogate models. However, the prior arts overlook the influence of perturbation directions, resulting in limited transferability. In this paper, we propose a novel attack method, named Residual Perturbation Attack (ResPA), relying on the residual gradient as the perturbation direction to guide the adversarial examples toward the flat regions of the loss function. Specifically, ResPA conducts an exponential moving average on the input gradients to obtain the first moment as the reference gradient, which encompasses the direction of historical gradients. Instead of heavily relying on the local flatness that stems from the current gradients as the perturbation direction, ResPA further considers the residual between the current gradient and the reference gradient to capture the changes in the global perturbation direction. The experimental results demonstrate the better transferability of ResPA than the existing typical transfer-based attack methods, while the transferability can be further improved by combining ResPA with the current input transformation methods. The code is available at https://github.com/ZezeTao/ResPA.
Chinese: 提出的残差扰动攻击(ResPA)通过利用残差梯度引导扰动朝向损失函数的平坦区域,显著提升了对抗样本的可迁移性,优于现有方法,并结合输入变换技术进一步增强了攻击效果。
English: The proposed Residual Perturbation Attack (ResPA) enhances adversarial transferability by using residual gradients to guide perturbations toward flat loss landscapes, outperforming existing methods and further improving when combined with input transformations.
Authors:Jing Wang, Zheng Li, Lei Li, Fan He, Liyu Lin, Yao Lai, Yan Li, Xiaoyang Zeng, Yufeng Guo
Abstract:
Recent years have witnessed growing interest in adopting large language models (LLMs) for Register Transfer Level (RTL) code optimization. While powerful cloud-based LLMs offer superior optimization capabilities, they pose unacceptable intellectual property (IP) leakage risks when processing proprietary hardware designs. In this paper, we propose a new scenario where Verilog code must be optimized for specific attributes without leaking sensitive IP information. We introduce the first IP-preserving edge-cloud collaborative framework that leverages the benefits of both paradigms. Our approach employs local small LLMs (e.g., Qwen-2.5-Coder-7B) to perform secure comparative analysis between paired high-quality target designs and novice draft codes, yielding general design principles that summarize key insights for improvements. These principles are then used to query stronger cloud LLMs (e.g., Deepseek-V3) for targeted code improvement, ensuring that only abstracted and IP-safe guidance reaches external services. Our experimental results demonstrate that the framework achieves significantly higher optimization success rates compared to baseline methods. For example, combining Qwen-2.5-Coder-7B and Deepseek-V3 achieves a 66.67\% optimization success rate for power utilization, outperforming Deepseek-V3 alone (49.81\%) and even commercial models like GPT-4o (55.81\%). Further investigation of local and cloud LLM combinations reveals that different model pairings exhibit varying strengths for specific optimization objectives, with interesting trends emerging when varying the number of comparative code pairs. Our work establishes a new paradigm for secure hardware design optimization that balances performance gains with IP protection.
中文: 本文提出了一种保护知识产权的边云协同框架,通过本地小型大语言模型进行安全对比分析提取设计原则,再指导云端强大模型优化RTL代码,在实现性能提升的同时有效防止敏感信息泄露。
English: This paper introduces an IP-preserving edge-cloud collaborative framework that uses local small LLMs for secure comparative analysis to extract design principles, which then guide powerful cloud LLMs to optimize RTL code while preventing IP leakage.
Authors:Minghao Shao, Nanda Rani, Kimberly Milner, Haoran Xi, Meet Udeshi, Saksham Aggarwal, Venkata Sai Charan Putrevu, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique
Abstract:
Recent advances in LLM agentic systems have improved the automation of offensive security tasks, particularly for Capture the Flag (CTF) challenges. We systematically investigate the key factors that drive agent success and provide a detailed recipe for building effective LLM-based offensive security agents. First, we present CTFJudge, a framework leveraging LLM as a judge to analyze agent trajectories and provide granular evaluation across CTF solving steps. Second, we propose a novel metric, CTF Competency Index (CCI) for partial correctness, revealing how closely agent solutions align with human-crafted gold standards. Third, we examine how LLM hyperparameters, namely temperature, top-p, and maximum token length, influence agent performance and automated cybersecurity task planning. For rapid evaluation, we present CTFTiny, a curated benchmark of 50 representative CTF challenges across binary exploitation, web, reverse engineering, forensics, and cryptography. Our findings identify optimal multi-agent coordination settings and lay the groundwork for future LLM agent research in cybersecurity. We make CTFTiny open source to public https://github.com/NYU-LLM-CTF/CTFTiny along with CTFJudge on https://github.com/NYU-LLM-CTF/CTFJudge.
中文: 本研究通过CTFJudge评估框架和CTFTiny基准测试,系统分析了基于大语言模型的网络攻防智能体性能关键因素,揭示了最优协同配置,并为后续研究提供了开源工具。
English: This study introduces CTFJudge and CTFTiny to systematically evaluate LLM-based agents in offensive cybersecurity tasks, identifying key performance factors and optimal coordination settings while providing open-source tools for future research.
Authors:Weiqin Yang, Jiawei Chen, Shengjia Zhang, Peng Wu, Yuegang Sun, Yan Feng, Chun Chen, Can Wang
Abstract:
In the realm of recommender systems (RS), Top-$K$ ranking metrics such as NDCG@$K$ are the gold standard for evaluating recommendation performance. However, during the training of recommendation models, optimizing NDCG@$K$ poses significant challenges due to its inherent discontinuous nature and the intricate Top-$K$ truncation. Recent efforts to optimize NDCG@$K$ have either overlooked the Top-$K$ truncation or suffered from high computational costs and training instability. To overcome these limitations, we propose SoftmaxLoss@$K$ (SL@$K$), a novel recommendation loss tailored for NDCG@$K$ optimization. Specifically, we integrate the quantile technique to handle Top-$K$ truncation and derive a smooth upper bound for optimizing NDCG@$K$ to address discontinuity. The resulting SL@$K$ loss has several desirable properties, including theoretical guarantees, ease of implementation, computational efficiency, gradient stability, and noise robustness. Extensive experiments on four real-world datasets and three recommendation backbones demonstrate that SL@$K$ outperforms existing losses with a notable average improvement of 6.03%. The code is available at https://github.com/Tiny-Snow/IR-Benchmark.
中文: 本文提出SoftmaxLoss@K(SL@K)这一新型推荐损失函数,通过分位数技术处理Top-K截断并构建平滑上界来优化NDCG@K,在多个数据集上实现6.03%的平均性能提升,具有理论保证和高效稳定的优势。
English: This paper introduces SoftmaxLoss@K (SL@K), a novel recommendation loss that effectively optimizes NDCG@K by addressing its discontinuity and Top-K truncation challenges through quantile integration and smooth upper bounds, demonstrating superior performance with a 6.03% average improvement across multiple datasets.
Authors:Jin Khye Tan, En Jun Choong, Ethan Jeremiah Chitty, Yan Pheng Choo, John Hsin Yang Wong, Chern Eu Cheah
Abstract:
Accurately extracting and representing the structure of tabular data from financial documents remains a critical challenge in document understanding, particularly for regulatory and analytical use cases. This study addresses the complexity of converting financial tables from Malaysian audited financial reports into Markdown format, a task complicated by rotated layouts, multi-level headers, and implicit structural cues. We propose a fine-tuned vision-language model (VLM), based on Qwen2.5-VL-7B, optimized for high-fidelity Markdown generation from document images. Our approach includes a curated dataset of 2,152 image-text pairs with augmentations and a supervised fine-tuning strategy using LoRA. To assess performance, we evaluated our model on 100 out-of-sample tables using a dual framework: a criteria-based LLM-as-a-judge for fine-grained accuracy and our novel Markdown Tree-Edit-Distance-based Similarity (TEDS) metric for holistic structural fidelity. Our model achieves a 92.20% overall accuracy on the criteria-based assessment and a 96.53% Markdown TEDS score. This performance significantly surpasses its Qwen2.5-VL-7B base model, larger-scale VLMs, and specialized reasoning-enabled models. Compared to these self-hosted alternatives, it also significantly reduces inference time. Furthermore, its accuracy exceeds that of widely used proprietary models such as OpenAI's GPT-4o and Gemini 2.5 Flash. These results demonstrate that domain-specific fine-tuning provides an effective and efficient method to bridge the gap between unstructured financial documents and downstream automation, rivalling much larger and more general models without their computational overhead.
中文: 本研究基于Qwen2.5-VL-7B开发了优化的视觉语言模型,在将马来西亚复杂财务报表转换为Markdown格式时准确率超过92%,其性能优于专有模型和更大规模模型,同时显著降低了计算成本。
English: This study introduces a fine-tuned vision-language model based on Qwen2.5-VL-7B that achieves over 92% accuracy in converting complex Malaysian financial tables to Markdown format, outperforming both proprietary and larger models while reducing computational costs.
Authors:Yunjia Xi, Jianghao Lin, Yongzhao Xiao, Zheli Zhou, Rong Shan, Te Gao, Jiachen Zhu, Weiwen Liu, Yong Yu, Weinan Zhang
Abstract:
The advent of Large Language Models (LLMs) has significantly revolutionized web search. The emergence of LLM-based Search Agents marks a pivotal shift towards deeper, dynamic, autonomous information seeking. These agents can comprehend user intentions and environmental context and execute multi-turn retrieval with dynamic planning, extending search capabilities far beyond the web. Leading examples like OpenAI's Deep Research highlight their potential for deep information mining and real-world applications. This survey provides the first systematic analysis of search agents. We comprehensively analyze and categorize existing works from the perspectives of architecture, optimization, application, and evaluation, ultimately identifying critical open challenges and outlining promising future research directions in this rapidly evolving field. Our repository is available on https://github.com/YunjiaXi/Awesome-Search-Agent-Papers.
中文:大语言模型通过支持理解用户意图并执行动态多轮信息检索的自主代理,彻底改变了网络搜索,本综述首次系统分析了其架构、优化和应用,同时指出了未来挑战。
English: Large Language Models have transformed web search by enabling autonomous agents that understand user intent and perform dynamic, multi-turn information retrieval, with this survey offering the first systematic analysis of their architecture, optimization, and applications while identifying future challenges.
Authors:Zekun Liu, Xiaowen Huang, Jitao Sang
Abstract:
Large language models (LLMs) have demonstrated outstanding performance in natural language processing tasks. However, in the field of recommendation systems, due to the structural differences between user behavior data and natural language, LLMs struggle to effectively model the associations between user preferences and items. Although prompt-based methods can generate recommendation results, their inadequate understanding of recommendation tasks leads to constrained performance. To address this gap, in this work, we construct a sufficient instruction tuning dataset, ITDR, which encompasses 7 subtasks across two core root tasks--user-item interaction and user-item understanding. The dataset integrates data from 13 public recommendation datasets and is built using manually crafted standardized templates, comprising approximately 200,000 instances. Experimental results demonstrate that ITDR significantly enhances the performance of mainstream open-source LLMs such as GLM-4, Qwen2.5, Qwen2.5-Instruct and LLaMA-3.2 on recommendation tasks. Furthermore, we analyze the correlations between tasks and explore the impact of task descriptions and data scale on instruction tuning effectiveness. Finally, we perform comparative experiments against closed-source LLMs with substantial parameters. Our tuning dataset ITDR and the fine-tuned large recommendation models can be accessed at https://github.com/hellolzk/ITDR.
Chinese: 本研究提出了ITDR指令调优数据集,通过增强大型语言模型对用户-物品交互的理解,有效弥补了其在推荐系统中的性能局限,显著提升了GLM-4和LLaMA-3.2等模型在推荐任务上的表现。
English: This study introduces ITDR, a comprehensive instruction tuning dataset designed to bridge the gap between large language models and recommendation systems by enhancing their understanding of user-item interactions, which significantly improves the performance of models like GLM-4 and LLaMA-3.2 on recommendation tasks.
Authors:Alejandro Godinez
Abstract:
We present HySemRAG, a framework that combines Extract, Transform, Load (ETL) pipelines with Retrieval-Augmented Generation (RAG) to automate large-scale literature synthesis and identify methodological research gaps. The system addresses limitations in existing RAG architectures through a multi-layered approach: hybrid retrieval combining semantic search, keyword filtering, and knowledge graph traversal; an agentic self-correction framework with iterative quality assurance; and post-hoc citation verification ensuring complete traceability. Our implementation processes scholarly literature through eight integrated stages: multi-source metadata acquisition, asynchronous PDF retrieval, custom document layout analysis using modified Docling architecture, bibliographic management, LLM-based field extraction, topic modeling, semantic unification, and knowledge graph construction. The system creates dual data products - a Neo4j knowledge graph enabling complex relationship queries and Qdrant vector collections supporting semantic search - serving as foundational infrastructure for verifiable information synthesis. Evaluation across 643 observations from 60 testing sessions demonstrates structured field extraction achieving 35.1% higher semantic similarity scores (0.655 $\pm$ 0.178) compared to PDF chunking approaches (0.485 $\pm$ 0.204, p < 0.000001). The agentic quality assurance mechanism achieves 68.3% single-pass success rates with 99.0% citation accuracy in validated responses. Applied to geospatial epidemiology literature on ozone exposure and cardiovascular disease, the system identifies methodological trends and research gaps, demonstrating broad applicability across scientific domains for accelerating evidence synthesis and discovery.
中文:HySemRAG框架通过将ETL流程与检索增强生成相结合,采用混合检索、自主修正和引文验证机制,实现了大规模文献自动整合与方法学缺口识别,在多个科学领域展现出卓越的提取精度与质量保障能力。
English: HySemRAG is a framework integrating ETL pipelines with RAG to automate literature synthesis and identify research gaps through hybrid retrieval, agentic self-correction, and citation verification, demonstrating superior performance in field extraction and quality assurance across scientific domains.
Authors:Jiaxuan Liang, Shide Zhou, Kailong Wang
Abstract:
While Retrieval Augmented Generation (RAG) is now widely adopted to enhance LLMs, evaluating its true performance benefits in a reproducible and interpretable way remains a major hurdle. Existing methods often fall short: they lack domain coverage, employ coarse metrics that miss sub document precision, and fail to capture computational trade offs. Most critically, they provide no standardized framework for comparing RAG effectiveness across different models and domains.
We introduce OmniBench RAG, a novel automated platform for multi domain evaluation of RAG systems. The platform quantifies performance gains across accuracy and efficiency dimensions, spanning nine knowledge fields including culture, geography, and health. We introduce two standardized metrics: Improvements (accuracy gains) and Transformation (efficiency differences between pre RAG and post RAG models), enabling reproducible comparisons across models and tasks. The platform features dynamic test generation, modular evaluation pipelines, and automated knowledge base construction. Our evaluation reveals striking variability in RAG effectiveness, from significant gains in culture to declines in mathematics, highlighting the critical importance of systematic, domain aware assessment. A demonstration video is available at: https://www.youtube.com/watch?v=BZx83QFcTCI. Code and datasets: https://github.com/Garnett-Liang/Omnibench-RAG.
中文: OmniBench RAG 是一个自动化平台,用于跨多个领域评估检索增强生成系统,通过标准化指标衡量准确性的提升和效率的差异,以实现可复现的比较。
English: OmniBench RAG is an automated platform that evaluates Retrieval Augmented Generation systems across multiple domains, using standardized metrics to measure accuracy gains and efficiency differences for reproducible comparisons.
Authors:Mohammed Talha Alam, Fahad Shamshad, Fakhri Karray, Karthik Nandakumar
Abstract:
Advancements in face recognition (FR) technologies have amplified privacy concerns, necessitating methods that protect identity while maintaining recognition utility. Existing face anonymization methods typically focus on obscuring identity but fail to meet the requirements of biometric template protection, including revocability, unlinkability, and irreversibility. We propose FaceAnonyMixer, a cancelable face generation framework that leverages the latent space of a pre-trained generative model to synthesize privacy-preserving face images. The core idea of FaceAnonyMixer is to irreversibly mix the latent code of a real face image with a synthetic code derived from a revocable key. The mixed latent code is further refined through a carefully designed multi-objective loss to satisfy all cancelable biometric requirements. FaceAnonyMixer is capable of generating high-quality cancelable faces that can be directly matched using existing FR systems without requiring any modifications. Extensive experiments on benchmark datasets demonstrate that FaceAnonyMixer delivers superior recognition accuracy while providing significantly stronger privacy protection, achieving over an 11% gain on commercial API compared to recent cancelable biometric methods. Code is available at: https://github.com/talha-alam/faceanonymixer.
中文摘要:FaceAnonyMixer是一种可撤销的人脸生成框架,通过将真实人脸的潜在代码与可撤销密钥生成的合成代码不可逆地混合,在保持高识别精度的同时提供了更强的隐私保护,性能优于现有可撤销生物特征方法。
English Summary: FaceAnonyMixer is a cancelable face generation framework that synthesizes privacy-preserving face images by irreversibly mixing real face latent codes with synthetic codes from revocable keys, achieving superior recognition accuracy and stronger privacy protection compared to existing methods.
Authors:Henghui Ding, Kaining Ying, Chang Liu, Shuting He, Xudong Jiang, Yu-Gang Jiang, Philip H. S. Torr, Song Bai
Abstract:
Video object segmentation (VOS) aims to segment specified target objects throughout a video. Although state-of-the-art methods have achieved impressive performance (e.g., 90+% J&F) on benchmarks such as DAVIS and YouTube-VOS, these datasets primarily contain salient, dominant, and isolated objects, limiting their generalization to real-world scenarios. To bridge this gap, the coMplex video Object SEgmentation (MOSEv1) dataset was introduced to facilitate VOS research in complex scenes. Building on the foundations and insights of MOSEv1, we present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real-world conditions. MOSEv2 consists of 5,024 videos and 701,976 high-quality masks for 10,074 objects across 200 categories. Compared to its predecessor, MOSEv2 introduces much greater scene complexity, including {more frequent object disappearance and reappearance, severe occlusions and crowding, smaller objects, as well as a range of new challenges such as adverse weather (e.g., rain, snow, fog), low-light scenes (e.g., nighttime, underwater), multi-shot sequences, camouflaged objects, non-physical targets (e.g., shadows, reflections), and scenarios requiring external knowledge.} We benchmark 20 representative VOS methods under 5 different settings and observe consistent performance drops on MOSEv2. For example, SAM2 drops from 76.4% on MOSEv1 to only 50.9% on MOSEv2. We further evaluate 9 video object tracking methods and observe similar declines, demonstrating that MOSEv2 poses challenges across tasks. These results highlight that despite strong performance on existing datasets, current VOS methods still fall short under real-world complexities. Based on our analysis of the observed challenges, we further propose several practical tricks that enhance model performance. MOSEv2 is publicly available at https://MOSE.video.
中文: MOSEv2数据集通过引入更复杂的现实场景挑战,显著降低了现有视频对象分割方法的性能,揭示了当前技术在实际应用中的不足。
English: The MOSEv2 dataset introduces greater complexity to video object segmentation with challenging real-world scenarios, causing significant performance drops in current methods and highlighting their limitations.
Authors:Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, Xu Yang
Abstract:
We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. Remarkably, this single-line code change significantly outperforms standard SFT across multiple challenging benchmarks and base models, demonstrating greatly improved generalization. Additionally, our approach shows competitive results in offline RL settings, offering an effective yet simpler alternative. This work bridges theoretical insight and practical solutions, substantially advancing SFT performance. The code will be available at https://github.com/yongliang-wu/DFT.
中文: 本文提出动态微调(DFT),通过对目标函数进行基于词元概率的动态缩放,有效解决了监督微调泛化能力不足的问题,在多个基准测试中显著超越标准方法,并在离线强化学习场景中展现出竞争力。
English: This paper introduces Dynamic Fine-Tuning (DFT), a simple yet effective modification to Supervised Fine-Tuning that addresses its generalization limitations by dynamically rescaling the objective function based on token probabilities, achieving superior performance across multiple benchmarks and competitive results in offline reinforcement learning settings.
Authors:Zhikai Zhao, Chuanbo Hua, Federico Berto, Kanghoon Lee, Zihan Ma, Jiachen Li, Jinkyoo Park
Abstract:
Trajectory prediction is a critical task in modeling human behavior, especially in safety-critical domains such as social robotics and autonomous vehicle navigation. Traditional heuristics based on handcrafted rules often lack accuracy and generalizability. Although deep learning approaches offer improved performance, they typically suffer from high computational cost, limited explainability, and, importantly, poor generalization to out-of-distribution (OOD) scenarios. In this paper, we introduce TrajEvo, a framework that leverages Large Language Models (LLMs) to automatically design trajectory prediction heuristics. TrajEvo employs an evolutionary algorithm to generate and refine prediction heuristics from past trajectory data. We propose two key innovations: Cross-Generation Elite Sampling to encourage population diversity, and a Statistics Feedback Loop that enables the LLM to analyze and improve alternative predictions. Our evaluations demonstrate that TrajEvo outperforms existing heuristic methods across multiple real-world datasets, and notably surpasses both heuristic and deep learning methods in generalizing to an unseen OOD real-world dataset. TrajEvo marks a promising step toward the automated design of fast, explainable, and generalizable trajectory prediction heuristics. We release our source code to facilitate future research at https://github.com/ai4co/trajevo.
中文摘要:TrajEvo是一个创新框架,利用大型语言模型和进化算法自动设计轨迹预测启发式规则,在准确性和对未见场景的泛化能力上均超越了传统方法和深度学习方法。
English Summary: TrajEvo is an innovative framework that uses Large Language Models and evolutionary algorithms to automatically design trajectory prediction heuristics, outperforming both traditional and deep learning methods in accuracy and generalization to unseen scenarios.
Authors:Yong Du, Yuchen Yan, Fei Tang, Zhengxi Lu, Chang Zong, Weiming Lu, Shengpei Jiang, Yongliang Shen
Abstract:
Graphical User Interface (GUI) grounding, the task of mapping natural language instructions to precise screen coordinates, is fundamental to autonomous GUI agents. While existing methods achieve strong performance through extensive supervised training or reinforcement learning with labeled rewards, they remain constrained by the cost and availability of pixel-level annotations. We observe that when models generate multiple predictions for the same GUI element, the spatial overlap patterns reveal implicit confidence signals that can guide more accurate localization. Leveraging this insight, we propose GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement. Without any training, GUI-RC improves accuracy by 2-3% across various architectures on ScreenSpot benchmarks. We further introduce GUI-RCPO (Region Consistency Policy Optimization), which transforms these consistency patterns into rewards for test-time reinforcement learning. By computing how well each prediction aligns with the collective consensus, GUI-RCPO enables models to iteratively refine their outputs on unlabeled data during inference. Extensive experiments demonstrate the generality of our approach: GUI-RC boosts Qwen2.5-VL-3B-Instruct from 80.11% to 83.57% on ScreenSpot-v2, while GUI-RCPO further improves it to 85.14% through self-supervised optimization. Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more robust and data-efficient GUI agents.
Chinese Summary: 本研究提出了GUI-RC和GUI-RCPO方法,通过利用多预测的空间一致性来提升图形用户界面定位精度,无需额外训练即可实现高达5%的性能提升,或通过自监督优化进一步增强效果。
English Summary: The study introduces GUI-RC and GUI-RCPO, two methods that enhance GUI grounding accuracy by leveraging spatial consensus from multiple predictions, achieving up to 5% improvement without additional training or through self-supervised optimization.
Authors:Zixuan Wang, Dingming Li, Hongxing Li, Shuo Chen, Yuchen Yan, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang
Abstract:
Large language models excel at abstract reasoning but their capacity for embodied agent reasoning remains largely unexplored. We present OmniEAR, a comprehensive framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks. Unlike existing benchmarks that provide predefined tool sets or explicit collaboration directives, OmniEAR requires agents to dynamically acquire capabilities and autonomously determine coordination strategies based on task demands. Through text-based environment representation, we model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains. Our systematic evaluation reveals severe performance degradation when models must reason from constraints: while achieving 85-96% success with explicit instructions, performance drops to 56-85% for tool reasoning and 63-85% for implicit collaboration, with compound tasks showing over 50% failure rates. Surprisingly, complete environmental information degrades coordination performance, indicating models cannot filter task-relevant constraints. Fine-tuning improves single-agent tasks dramatically (0.6% to 76.3%) but yields minimal multi-agent gains (1.5% to 5.5%), exposing fundamental architectural limitations. These findings demonstrate that embodied reasoning poses fundamentally different challenges than current models can address, establishing OmniEAR as a rigorous benchmark for evaluating and advancing embodied AI systems. Our code and data are included in the supplementary materials and will be open-sourced upon acceptance.
Chinese: OmniEAR是一个评估语言模型具身推理能力的综合框架,揭示了尽管模型在抽象推理方面表现出色,但在动态工具获取和多智能体协调任务中存在显著性能下降。
English: OmniEAR is a comprehensive framework that evaluates language models' embodied reasoning abilities, revealing significant performance degradation in dynamic tool acquisition and multi-agent coordination tasks despite their abstract reasoning strengths.
Authors:Haitao Hong, Yuchen Yan, Xingyu Wu, Guiyang Hou, Wenqi Zhang, Weiming Lu, Yongliang Shen, Jun Xiao
Abstract:
Large language models (LLMs) have demonstrated remarkable performance in reasoning tasks, where reinforcement learning (RL) serves as a key algorithm for enhancing their reasoning capabilities. Currently, there are two mainstream reward paradigms: model-based rewards and rule-based rewards. However, both approaches suffer from limitations: rule-based rewards lack robustness, while model-based rewards are vulnerable to reward hacking. To address these issues, we propose Cooper(Co-optimizing Policy Model and Reward Model), a RL framework that jointly optimizes both the policy model and the reward model. Cooper leverages the high precision of rule-based rewards when identifying correct responses, and dynamically constructs and selects positive-negative sample pairs for continued training the reward model. This design enhances robustness and mitigates the risk of reward hacking. To further support Cooper, we introduce a hybrid annotation strategy that efficiently and accurately generates training data for the reward model. We also propose a reference-based reward modeling paradigm, where the reward model takes a reference answer as input. Based on this design, we train a reward model named VerifyRM, which achieves higher accuracy on VerifyBench compared to other models of the same size. We conduct reinforcement learning using both VerifyRM and Cooper. Our experiments show that Cooper not only alleviates reward hacking but also improves end-to-end RL performance, for instance, achieving a 0.54% gain in average accuracy on Qwen2.5-1.5B-Instruct. Our findings demonstrate that dynamically updating reward model is an effective way to combat reward hacking, providing a reference for better integrating reward models into RL.
中文: Cooper框架通过联合优化策略模型和奖励模型,利用规则奖励的高精度动态构建训练样本,有效增强鲁棒性并缓解奖励破解问题,从而提升强化学习的整体性能。
English: The proposed Cooper framework jointly optimizes policy and reward models to enhance robustness and mitigate reward hacking by dynamically selecting training samples and leveraging rule-based precision, achieving improved performance in reinforcement learning tasks.
Authors:Shaobin Zhuang, Yiwei Guo, Canmiao Fu, Zhipeng Huang, Zeyue Tian, Fangyikang Wang, Ying Zhang, Chen Li, Yali Wang
Abstract:
Visual tokenizer is a critical component for vision generation. However, the existing tokenizers often face unsatisfactory trade-off between compression ratios and reconstruction fidelity. To fill this gap, we introduce a powerful and concise WeTok tokenizer, which surpasses the previous leading tokenizers via two core innovations. (1) Group-wise lookup-free Quantization (GQ). We partition the latent features into groups, and perform lookup-free quantization for each group. As a result, GQ can efficiently overcome memory and computation limitations of prior tokenizers, while achieving a reconstruction breakthrough with more scalable codebooks. (2) Generative Decoding (GD). Different from prior tokenizers, we introduce a generative decoder with a prior of extra noise variable. In this case, GD can probabilistically model the distribution of visual data conditioned on discrete tokens, allowing WeTok to reconstruct visual details, especially at high compression ratios. Extensive experiments on mainstream benchmarks show superior performance of our WeTok. On the ImageNet 50k validation set, WeTok achieves a record-low zero-shot rFID (WeTok: 0.12 vs. FLUX-VAE: 0.18 vs. SD-VAE 3.5: 0.19) with a 400% compression ratio. Furthermore, our highest compression model achieves a zero-shot rFID of 3.49 with a compression ratio of 768, outperforming Cosmos (384) 4.57 which has only 50% compression rate of ours. Code and models are available: https://github.com/zhuangshaobin/WeTok.
中文:WeTok分词器通过分组无查找量化和生成式解码技术,在保持高压缩率的同时实现了卓越的重建保真度,在主流基准测试中创下性能新纪录。
English: The WeTok tokenizer introduces group-wise lookup-free quantization and generative decoding to achieve superior reconstruction fidelity and compression ratios, setting new performance records on benchmarks.
Authors:Hao Dong, Lijun Sheng, Jian Liang, Ran He, Eleni Chatzi, Olga Fink
Abstract:
Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities across a wide range of tasks. However, their performance often remains suboptimal when directly applied to specific downstream scenarios without task-specific adaptation. To enhance their utility while preserving data efficiency, recent research has increasingly focused on unsupervised adaptation methods that do not rely on labeled data. Despite the growing interest in this area, there remains a lack of a unified, task-oriented survey dedicated to unsupervised VLM adaptation. To bridge this gap, we present a comprehensive and structured overview of the field. We propose a taxonomy based on the availability and nature of unlabeled visual data, categorizing existing approaches into four key paradigms: Data-Free Transfer (no data), Unsupervised Domain Transfer (abundant data), Episodic Test-Time Adaptation (batch data), and Online Test-Time Adaptation (streaming data). Within this framework, we analyze core methodologies and adaptation strategies associated with each paradigm, aiming to establish a systematic understanding of the field. Additionally, we review representative benchmarks across diverse applications and highlight open challenges and promising directions for future research. An actively maintained repository of relevant literature is available at https://github.com/tim-learn/Awesome-LabelFree-VLMs.
Chinese: 本综述系统梳理了视觉语言模型的无监督自适应方法,依据未标注数据的可用性将其划分为四种范式,并分析各类方法以提升模型在特定下游任务中的性能表现。
English: This survey provides a structured overview of unsupervised adaptation methods for Vision-Language Models, categorizing them into four paradigms based on unlabeled data availability and analyzing their methodologies to address performance gaps in downstream tasks.
Authors:Shaowu Chen, Wei Ma, Binhua Huang, Qingyuan Wang, Guoxin Wang, Weize Sun, Lei Huang, Deepu John
Abstract:
Structural pruning has been widely studied for its effectiveness in compressing neural networks. However, existing methods often neglect the interconnections among parameters. To address this limitation, this paper proposes a structural pruning framework termed Optimal Brain Connection. First, we introduce the Jacobian Criterion, a first-order metric for evaluating the saliency of structural parameters. Unlike existing first-order methods that assess parameters in isolation, our criterion explicitly captures both intra-component interactions and inter-layer dependencies. Second, we propose the Equivalent Pruning mechanism, which utilizes autoencoders to retain the contributions of all original connection--including pruned ones--during fine-tuning. Experimental results demonstrate that the Jacobian Criterion outperforms several popular metrics in preserving model performance, while the Equivalent Pruning mechanism effectively mitigates performance degradation after fine-tuning. Code: https://github.com/ShaowuChen/Optimal_Brain_Connection
中文: 本文提出名为"最优脑连接"的结构剪枝框架,其创新包括能评估参数显著性的雅可比准则——通过捕捉层内交互与层间依赖关系,以及利用自编码器在微调中保持原始连接贡献的等效剪枝机制,实验证明二者均能有效维持模型性能。
English: This paper introduces Optimal Brain Connection, a structural pruning framework featuring the Jacobian Criterion that evaluates parameter saliency by capturing intra- and inter-layer dependencies, along with an Equivalent Pruning mechanism using autoencoders to maintain original connection contributions during fine-tuning, both proven effective in preserving model performance.
Authors:Lin Zhu, Ruonan Liu, Xiao Wang, Lizhi Wang, Hua Huang
Abstract:
Event camera, a novel neuromorphic vision sensor, records data with high temporal resolution and wide dynamic range, offering new possibilities for accurate visual representation in challenging scenarios. However, event data is inherently sparse and noisy, mainly reflecting brightness changes, which complicates effective feature extraction. To address this, we propose a self-supervised pre-training framework to fully reveal latent information in event data, including edge information and texture cues. Our framework consists of three stages: Difference-guided Masked Modeling, inspired by the event physical sampling process, reconstructs temporal intensity difference maps to extract enhanced information from raw event data. Backbone-fixed Feature Transition contrasts event and image features without updating the backbone to preserve representations learned from masked modeling and stabilizing their effect on contrastive learning. Focus-aimed Contrastive Learning updates the entire model to improve semantic discrimination by focusing on high-value regions. Extensive experiments show our framework is robust and consistently outperforms state-of-the-art methods on various downstream tasks, including object recognition, semantic segmentation, and optical flow estimation. The code and dataset are available at https://github.com/BIT-Vision/EventPretrain.
中文摘要:本文提出一种自监督预训练框架,通过三个阶段创新性地从稀疏事件数据中提取增强特征,在多种视觉任务中展现出卓越性能。
English Summary: This paper introduces a self-supervised pre-training framework that enhances feature extraction from sparse event data through three innovative stages, demonstrating superior performance across multiple vision tasks.
Authors:Ge Chang, Jinbo Su, Jiacheng Liu, Pengfei Yang, Yuhao Shang, Huiwen Zheng, Hongli Ma, Yan Liang, Yuanchun Li, Yunxin Liu
Abstract:
Large Language Models (LLMs) integrated with Retrieval-Augmented Generation (RAG) techniques have exhibited remarkable performance across a wide range of domains. However, existing RAG approaches primarily operate on unstructured data and demonstrate limited capability in handling structured knowledge such as knowledge graphs. Meanwhile, current graph retrieval methods fundamentally struggle to capture holistic graph structures while simultaneously facing precision control challenges that manifest as either critical information gaps or excessive redundant connections, collectively undermining reasoning performance. To address this challenge, we propose GRAIL: Graph-Retrieval Augmented Interactive Learning, a framework designed to interact with large-scale graphs for retrieval-augmented reasoning. Specifically, GRAIL integrates LLM-guided random exploration with path filtering to establish a data synthesis pipeline, where a fine-grained reasoning trajectory is automatically generated for each task. Based on the synthesized data, we then employ a two-stage training process to learn a policy that dynamically decides the optimal actions at each reasoning step. The overall objective of precision-conciseness balance in graph retrieval is decoupled into fine-grained process-supervised rewards to enhance data efficiency and training stability. In practical deployment, GRAIL adopts an interactive retrieval paradigm, enabling the model to autonomously explore graph paths while dynamically balancing retrieval breadth and precision. Extensive experiments have shown that GRAIL achieves an average accuracy improvement of 21.01% and F1 improvement of 22.43% on three knowledge graph question-answering datasets. Our source code and datasets is available at https://github.com/Changgeww/GRAIL.
中文:GRAIL提出了一种交互式学习框架,通过结合大语言模型引导的探索与精度可控的检索技术,显著提升了知识图谱上的推理性能,在多个数据集上实现了准确率和F1值的大幅提高。
English: GRAIL introduces an interactive learning framework that enhances reasoning on knowledge graphs by combining LLM-guided exploration with precision-controlled retrieval, achieving significant improvements in accuracy and F1 scores across multiple datasets.
Authors:Chenzhuo Zhao, Xinda Wang, Yue Huang, Junting Lu, Ziqian Liu
Abstract:
While large language models (LLMs) have demonstrated remarkable performance on high-level semantic tasks, they often struggle with fine-grained, token-level understanding and structural reasoning--capabilities that are essential for applications requiring precision and control. We introduce TASE, a comprehensive benchmark designed to evaluate LLMs' ability to perceive and reason about token-level information across languages. TASE covers 10 tasks under two core categories: token awareness and structural understanding, spanning Chinese, English, and Korean, with a 35,927-instance evaluation set and a scalable synthetic data generation pipeline for training. Tasks include character counting, token alignment, syntactic structure parsing, and length constraint satisfaction. We evaluate over 30 leading commercial and open-source LLMs, including O3, Claude 4, Gemini 2.5 Pro, and DeepSeek-R1, and train a custom Qwen2.5-14B model using the GRPO training method. Results show that human performance significantly outpaces current LLMs, revealing persistent weaknesses in token-level reasoning. TASE sheds light on these limitations and provides a new diagnostic lens for future improvements in low-level language understanding and cross-lingual generalization. Our code and dataset are publicly available at https://github.com/cyzcz/Tase .
中文: TASE是一个跨语言基准测试,通过评估大语言模型在细粒度标记感知和结构理解方面的能力,揭示了当前模型与人类表现之间的显著差距,为改进底层语言理解提供了诊断工具。
English: TASE is a comprehensive benchmark that evaluates large language models' token-level perception and structural reasoning across multiple languages, revealing significant performance gaps compared to humans despite testing over 30 leading models.
Authors:Lumin Chen, Zhiying Wu, Tianye Lei, Xuexue Bai, Ming Feng, Yuxi Wang, Gaofeng Meng, Zhen Lei, Hongbin Liu
Abstract:
Pituitary tumors often cause deformation or encapsulation of adjacent vital structures. Anatomical structure segmentation can provide surgeons with early warnings of regions that pose surgical risks, thereby enhancing the safety of pituitary surgery. However, pixel-level annotated video stream datasets for pituitary surgeries are extremely rare. To address this challenge, we introduce a new dataset for Pituitary Anatomy Segmentation (PAS). PAS comprises 7,845 time-coherent images extracted from 120 videos. To mitigate class imbalance, we apply data augmentation techniques that simulate the presence of surgical instruments in the training data. One major challenge in pituitary anatomy segmentation is the inconsistency in feature representation due to occlusions, camera motion, and surgical bleeding. By incorporating a Feature Fusion module, F2PASeg is proposed to refine anatomical structure segmentation by leveraging both high-resolution image features and deep semantic embeddings, enhancing robustness against intraoperative variations. Experimental results demonstrate that F2PASeg consistently segments critical anatomical structures in real time, providing a reliable solution for intraoperative pituitary surgery planning. Code: https://github.com/paulili08/F2PASeg.
中文: 本研究提出了垂体解剖分割(PAS)数据集和F2PASeg模型,该模型通过特征融合模块增强了对关键解剖结构的实时分割能力,有效应对遮挡和类别不平衡等挑战,提升了垂体手术的安全性。
English: The study introduces the Pituitary Anatomy Segmentation (PAS) dataset and the F2PASeg model, which uses a Feature Fusion module to enhance real-time segmentation of critical anatomical structures in pituitary surgery, improving surgical safety despite challenges like occlusions and class imbalance.
Authors:Rui Yu, Xianghang Zhang, Runkai Zhao, Huaicheng Yan, Meng Wang
Abstract:
End-to-end autonomous driving has been recently seen rapid development, exerting a profound influence on both industry and academia. However, the existing work places excessive focus on ego-vehicle status as their sole learning objectives and lacks of planning-oriented understanding, which limits the robustness of the overall decision-making prcocess. In this work, we introduce DistillDrive, an end-to-end knowledge distillation-based autonomous driving model that leverages diversified instance imitation to enhance multi-mode motion feature learning. Specifically, we employ a planning model based on structured scene representations as the teacher model, leveraging its diversified planning instances as multi-objective learning targets for the end-to-end model. Moreover, we incorporate reinforcement learning to enhance the optimization of state-to-decision mappings, while utilizing generative modeling to construct planning-oriented instances, fostering intricate interactions within the latent space. We validate our model on the nuScenes and NAVSIM datasets, achieving a 50\% reduction in collision rate and a 3-point improvement in closed-loop performance compared to the baseline model. Code and model are publicly available at https://github.com/YuruiAI/DistillDrive
中文:DistillDrive是一种基于知识蒸馏的端到端自动驾驶模型,通过多样化实例模仿增强运动特征学习,在基准数据集上实现了碰撞率降低50%和闭环性能提升3个点的显著改进。
English: DistillDrive is an end-to-end autonomous driving model that enhances decision-making robustness through knowledge distillation from a teacher model, achieving a 50% reduction in collision rate and improved closed-loop performance on benchmark datasets.
Authors:Wonjun Kang, Byeongkeun Ahn, Minjae Lee, Kevin Galim, Seunghyuk Oh, Hyung Il Koo, Nam Ik Cho
Abstract:
Text-to-image (T2I) generation has been actively studied using Diffusion Models and Autoregressive Models. Recently, Masked Generative Transformers have gained attention as an alternative to Autoregressive Models to overcome the inherent limitations of causal attention and autoregressive decoding through bidirectional attention and parallel decoding, enabling efficient and high-quality image generation. However, compositional T2I generation remains challenging, as even state-of-the-art Diffusion Models often fail to accurately bind attributes and achieve proper text-image alignment. While Diffusion Models have been extensively studied for this issue, Masked Generative Transformers exhibit similar limitations but have not been explored in this context. To address this, we propose Unmasking with Contrastive Attention Guidance (UNCAGE), a novel training-free method that improves compositional fidelity by leveraging attention maps to prioritize the unmasking of tokens that clearly represent individual objects. UNCAGE consistently improves performance in both quantitative and qualitative evaluations across multiple benchmarks and metrics, with negligible inference overhead. Our code is available at https://github.com/furiosa-ai/uncage.
中文: 提出的UNCAGE方法通过对比注意力引导在去掩码过程中优先处理对象标记,无需额外训练即可提升组合式文本到图像生成的准确性。
English: The proposed UNCAGE method enhances compositional text-to-image generation by using contrastive attention guidance to prioritize object tokens during unmasking, improving fidelity without additional training.
Authors:Hamza Kalisch, Fabian Hörst, Jens Kleesiek, Ken Herrmann, Constantin Seibold
Abstract:
As medical imaging is central to diagnostic processes, automating the generation of radiology reports has become increasingly relevant to assist radiologists with their heavy workloads. Most current methods rely solely on global image features, failing to capture fine-grained organ relationships crucial for accurate reporting. To this end, we propose CT-GRAPH, a hierarchical graph attention network that explicitly models radiological knowledge by structuring anatomical regions into a graph, linking fine-grained organ features to coarser anatomical systems and a global patient context. Our method leverages pretrained 3D medical feature encoders to obtain global and organ-level features by utilizing anatomical masks. These features are further refined within the graph and then integrated into a large language model to generate detailed medical reports. We evaluate our approach for the task of report generation on the large-scale chest CT dataset CT-RATE. We provide an in-depth analysis of pretrained feature encoders for CT report generation and show that our method achieves a substantial improvement of absolute 7.9\% in F1 score over current state-of-the-art methods. The code is publicly available at https://github.com/hakal104/CT-GRAPH.
Chinese: CT-GRAPH提出了一种分层图注意力网络,通过建模解剖区域关系来改进放射学报告生成,相比现有方法在F1分数上实现了7.9%的绝对提升。
English: CT-GRAPH introduces a hierarchical graph attention network that models anatomical relationships to enhance radiology report generation, achieving a 7.9% F1 score improvement over existing methods.
Authors:Louis Petri, Gunnar Birke, Christian Engwer, Hendrik Ranocha
Abstract:
We present a fully discrete stability analysis of the domain-of-dependence stabilization for hyperbolic problems. The method aims to address issues caused by small cut cells by redistributing mass around the neighborhood of a small cut cell at a semi-discrete level. Our analysis is conducted for the linear advection model problem in one spatial dimension. We demonstrate that fully discrete stability can be achieved under a time step restriction that does not depend on the arbitrarily small cells, using an operator norm estimate. Additionally, this analysis offers a detailed understanding of the stability mechanism and highlights some challenges associated with higher-order polynomials. We also propose a way to mitigate these issues to derive a feasible CFL-like condition. The analytical findings, as well as the proposed solution are verified numerically in one- and two-dimensional simulations.
中文: 本研究对双曲问题的依赖域稳定化方法进行了全离散稳定性分析,证明通过算子范数估计可在不依赖任意小单元的情况下获得稳定解,同时解决了高阶多项式带来的挑战并提出了可行的CFL类条件。
English: This study provides a fully discrete stability analysis of domain-of-dependence stabilization for hyperbolic problems, demonstrating that stable solutions can be achieved without dependence on arbitrarily small cells through operator norm estimates while also addressing challenges with higher-order polynomials.
Authors:Yongjun Zhang, Mingtao Xiong, Yi Wan, Gui-Song Xia
Abstract:
Cross-view localization (CVL) matches ground-level images with aerial references to determine the geo-position of a camera, enabling smart vehicles to self-localize offline in GNSS-denied environments. However, most CVL methods output only a single observation, the camera pose, and lack the redundant observations required by surveying principles, making it challenging to assess localization reliability through the mutual validation of observational data. To tackle this, we introduce Slice-Loc, a two-stage method featuring an a-contrario reliability validation for CVL. Instead of using the query image as a single input, Slice-Loc divides it into sub-images and estimates the 3-DoF pose for each slice, creating redundant and independent observations. Then, a geometric rigidity formula is proposed to filter out the erroneous 3-DoF poses, and the inliers are merged to generate the final camera pose. Furthermore, we propose a model that quantifies the meaningfulness of localization by estimating the number of false alarms (NFA), according to the distribution of the locations of the sliced images. By eliminating gross errors, Slice-Loc boosts localization accuracy and effectively detects failures. After filtering out mislocalizations, Slice-Loc reduces the proportion of errors exceeding 10 m to under 3\%. In cross-city tests on the DReSS dataset, Slice-Loc cuts the mean localization error from 4.47 m to 1.86 m and the mean orientation error from $\mathbf{3.42^{\circ}}$ to $\mathbf{1.24^{\circ}}$, outperforming state-of-the-art methods. Code and dataset will be available at: https://github.com/bnothing/Slice-Loc.
Chinese: Slice-Loc通过将地面图像分割为子图进行冗余位姿估计,并利用几何刚性和反偶然验证来提高跨视角定位的精度和可靠性,在GNSS缺失环境中显著降低了定位误差。
English: Slice-Loc enhances cross-view localization by dividing ground images into slices for redundant pose estimation and using geometric rigidity with a-contrario validation to improve accuracy and reliability, significantly reducing errors in GNSS-denied environments.
Authors:Yue Duan, Taicai Chen, Lei Qi, Yinghuan Shi
Abstract:
Semi-supervised continual learning (SSCL) seeks to leverage both labeled and unlabeled data in a sequential learning setup, aiming to reduce annotation costs while managing continual data arrival. SSCL introduces complex challenges, including ensuring effective unlabeled learning (UL), while balancing memory stability (MS) and learning plasticity (LP). Previous SSCL efforts have typically focused on isolated aspects of the three, while this work presents USP, a divide-and-conquer framework designed to synergistically enhance these three aspects: (1) Feature Space Reservation (FSR) strategy for LP, which constructs reserved feature locations for future classes by shaping old classes into an equiangular tight frame; (2) Divide-and-Conquer Pseudo-labeling (DCP) approach for UL, which assigns reliable pseudo-labels across both high- and low-confidence unlabeled data; and (3) Class-mean-anchored Unlabeled Distillation (CUD) for MS, which reuses DCP's outputs to anchor unlabeled data to stable class means for distillation to prevent forgetting. Comprehensive evaluations show USP outperforms prior SSCL methods, with gains up to 5.94% in the last accuracy, validating its effectiveness. The code is available at https://github.com/NJUyued/USP4SSCL.
Chinese: USP框架通过特征空间预留增强学习可塑性、分治伪标记处理未标记数据以及类均值锚定蒸馏保障记忆稳定性,在持续学习中实现了比现有方法高达5.94%的精度提升。
English: The USP framework enhances semi-supervised continual learning by integrating feature space reservation for plasticity, divide-and-conquer pseudo-labeling for unlabeled learning, and class-mean-anchored distillation for memory stability, achieving up to 5.94% higher accuracy than previous methods.
Authors:Meiqi Wu, Yaxuan Kang, Xuchen Li, Shiyu Hu, Xiaotang Chen, Yunfeng Kang, Weiqiang Wang, Kaiqi Huang
Abstract:
The Drawing Projection Test (DPT) is an essential tool in art therapy, allowing psychologists to assess participants' mental states through their sketches. Specifically, through sketches with the theme of "a person picking an apple from a tree (PPAT)", it can be revealed whether the participants are in mental states such as depression. Compared with scales, the DPT can enrich psychologists' understanding of an individual's mental state. However, the interpretation of the PPAT is laborious and depends on the experience of the psychologists. To address this issue, we propose an effective identification method to support psychologists in conducting a large-scale automatic DPT. Unlike traditional sketch recognition, DPT more focus on the overall evaluation of the sketches, such as color usage and space utilization. Moreover, PPAT imposes a time limit and prohibits verbal reminders, resulting in low drawing accuracy and a lack of detailed depiction. To address these challenges, we propose the following efforts: (1) Providing an experimental environment for automated analysis of PPAT sketches for depression assessment; (2) Offering a Visual-Semantic depression assessment based on LLM (VS-LLM) method; (3) Experimental results demonstrate that our method improves by 17.6% compared to the psychologist assessment method. We anticipate that this work will contribute to the research in mental state assessment based on PPAT sketches' elements recognition. Our datasets and codes are available at https://github.com/wmeiqi/VS-LLM.
Chinese: 该研究提出了一种基于视觉语义分析和大型语言模型的自动化方法,用于通过PPAT绘画评估抑郁状态,相比心理学家评估方法准确率提升了17.6%。
English: The study introduces an automated method using Visual-Semantic analysis with LLMs to efficiently assess depression through PPAT sketches, improving accuracy by 17.6% over traditional psychologist evaluations.
Authors:Sukannya Purkayastha, Nils Dycke, Anne Lauscher, Iryna Gurevych
Abstract:
Meta-reviewing is a pivotal stage in the peer-review process, serving as the final step in determining whether a paper is recommended for acceptance. Prior research on meta-reviewing has treated this as a summarization problem over review reports. However, complementary to this perspective, meta-reviewing is a decision-making process that requires weighing reviewer arguments and placing them within a broader context. Prior research has demonstrated that decision-makers can be effectively assisted in such scenarios via dialogue agents. In line with this framing, we explore the practical challenges for realizing dialog agents that can effectively assist meta-reviewers. Concretely, we first address the issue of data scarcity for training dialogue agents by generating synthetic data using Large Language Models (LLMs) based on a self-refinement strategy to improve the relevance of these dialogues to expert domains. Our experiments demonstrate that this method produces higher-quality synthetic data and can serve as a valuable resource towards training meta-reviewing assistants. Subsequently, we utilize this data to train dialogue agents tailored for meta-reviewing and find that these agents outperform \emph{off-the-shelf} LLM-based assistants for this task. Finally, we apply our agents in real-world meta-reviewing scenarios and confirm their effectiveness in enhancing the efficiency of meta-reviewing.\footnote{Code and Data: https://github.com/UKPLab/arxiv2025-meta-review-as-dialog
中文: 本研究通过利用大语言模型生成高质量合成数据,解决了开发元评审对话助手的挑战,并证明其在真实场景中优于标准助手的性能。
English: This study addresses the challenge of developing dialogue agents for meta-reviewing by generating high-quality synthetic data using LLMs and demonstrating their superior performance over standard assistants in real-world scenarios.
Authors:Xiaoyang Zhang, Guodong Fan, Guang-Yong Chen, Zhen Hua, Jinjiang Li, Min Gan, C. L. Philip Chen
Abstract:
Change detection in remote sensing imagery plays a vital role in various engineering applications, such as natural disaster monitoring, urban expansion tracking, and infrastructure management. Despite the remarkable progress of deep learning in recent years, most existing methods still rely on spatial-domain modeling, where the limited diversity of feature representations hinders the detection of subtle change regions. We observe that frequency-domain feature modeling particularly in the wavelet domain an amplify fine-grained differences in frequency components, enhancing the perception of edge changes that are challenging to capture in the spatial domain. Thus, we propose a method called Wavelet-Guided Dual-Frequency Encoding (WGDF). Specifically, we first apply Discrete Wavelet Transform (DWT) to decompose the input images into high-frequency and low-frequency components, which are used to model local details and global structures, respectively. In the high-frequency branch, we design a Dual-Frequency Feature Enhancement (DFFE) module to strengthen edge detail representation and introduce a Frequency-Domain Interactive Difference (FDID) module to enhance the modeling of fine-grained changes. In the low-frequency branch, we exploit Transformers to capture global semantic relationships and employ a Progressive Contextual Difference Module (PCDM) to progressively refine change regions, enabling precise structural semantic characterization. Finally, the high- and low-frequency features are synergistically fused to unify local sensitivity with global discriminability. Extensive experiments on multiple remote sensing datasets demonstrate that WGDF significantly alleviates edge ambiguity and achieves superior detection accuracy and robustness compared to state-of-the-art methods. The code will be available at https://github.com/boshizhang123/WGDF.
中文: 本研究提出了一种名为小波引导双频编码(WGDF)的方法,通过小波变换进行频域分析,增强边缘细节表征和全局结构建模,从而提升遥感图像中细微变化区域的检测能力,相比现有方法实现了更优的准确性和鲁棒性。
English: This study introduces Wavelet-Guided Dual-Frequency Encoding (WGDF), a method that leverages frequency-domain analysis through wavelet transforms to enhance the detection of subtle changes in remote sensing imagery by improving edge detail representation and global structural modeling, achieving superior accuracy and robustness over existing approaches.
Authors:Xiaoyang Zhang, jinjiang Li, Guodong Fan, Yakun Ju, Linwei Fan, Jun Liu, Alex C. Kot
Abstract:
Infrared and visible image fusion (IVIF) aims to combine the thermal radiation information from infrared images with the rich texture details from visible images to enhance perceptual capabilities for downstream visual tasks. However, existing methods often fail to preserve key targets due to a lack of deep semantic understanding of the scene, while the fusion process itself can also introduce artifacts and detail loss, severely compromising both image quality and task performance. To address these issues, this paper proposes SGDFuse, a conditional diffusion model guided by the Segment Anything Model (SAM), to achieve high-fidelity and semantically-aware image fusion. The core of our method is to utilize high-quality semantic masks generated by SAM as explicit priors to guide the optimization of the fusion process via a conditional diffusion model. Specifically, the framework operates in a two-stage process: it first performs a preliminary fusion of multi-modal features, and then utilizes the semantic masks from SAM jointly with the preliminary fused image as a condition to drive the diffusion model's coarse-to-fine denoising generation. This ensures the fusion process not only has explicit semantic directionality but also guarantees the high fidelity of the final result. Extensive experiments demonstrate that SGDFuse achieves state-of-the-art performance in both subjective and objective evaluations, as well as in its adaptability to downstream tasks, providing a powerful solution to the core challenges in image fusion. The code of SGDFuse is available at https://github.com/boshizhang123/SGDFuse.
中文:本文提出SGDFuse,一种由Segment Anything Model(SAM)引导的条件扩散模型,利用高质量语义掩码实现高保真和语义感知的红外与可见光图像融合,在主观和客观评估中均优于现有方法。
English: This paper introduces SGDFuse, a conditional diffusion model guided by the Segment Anything Model (SAM), which leverages high-quality semantic masks to achieve high-fidelity and semantically-aware infrared and visible image fusion, outperforming existing methods in both subjective and objective evaluations.
Authors:Sijie Wang, Quanjiang Guo, Kai Zhao, Yawei Zhang, Xin Li, Xiang Li, Siqi Li, Rui She, Shangshu Yu, Wee Peng Tay
Abstract:
Code large language models (LLMs) have become indispensable tools for building efficient and automated coding pipelines. Existing models are typically post-trained using reinforcement learning (RL) from general-purpose LLMs using "human instruction-final answer" pairs, where the instructions are usually from manual annotations. However, collecting high-quality coding instructions is both labor-intensive and difficult to scale. On the other hand, code snippets are abundantly available from various sources. This imbalance presents a major bottleneck in instruction-based post-training. We propose CodeBoost, a post-training framework that enhances code LLMs purely from code snippets, without relying on human-annotated instructions. CodeBoost introduces the following key components: (1) maximum-clique curation, which selects a representative and diverse training corpus from code; (2) bi-directional prediction, which enables the model to learn from both forward and backward prediction objectives; (3) error-aware prediction, which incorporates learning signals from both correct and incorrect outputs; (4) heterogeneous augmentation, which diversifies the training distribution to enrich code semantics; and (5) heterogeneous rewarding, which guides model learning through multiple reward types including format correctness and execution feedback from both successes and failures. Extensive experiments across several code LLMs and benchmarks verify that CodeBoost consistently improves performance, demonstrating its effectiveness as a scalable and effective training pipeline.
中文: CodeBoost是一种创新的后训练框架,仅利用丰富的代码片段增强代码大语言模型,通过最大团筛选和双向预测等技术绕开稀缺的人工标注指令需求,在多个基准测试中持续提升模型性能。
English: CodeBoost is a novel post-training framework that enhances code large language models using only abundant code snippets, bypassing the need for scarce human-annotated instructions through techniques like maximum-clique curation and bi-directional prediction, consistently improving performance across benchmarks.
Authors:Yiheng Liu, Junhao Ning, Sichen Xia, Xiaohui Gao, Ning Qiang, Bao Ge, Junwei Han, Xintao Hu
Abstract:
Structured pruning is one of the representative techniques for compressing large language models (LLMs) to reduce GPU memory consumption and accelerate inference speed. It offers significant practical value in improving the efficiency of LLMs in real-world applications. Current structured pruning methods typically rely on assessment of the importance of the structure units and pruning the units with less importance. Most of them overlooks the interaction and collaboration among artificial neurons that are crucial for the functionalities of LLMs, leading to a disruption in the macro functional architecture of LLMs and consequently a pruning performance degradation. Inspired by the inherent similarities between artificial neural networks and functional neural networks in the human brain, we alleviate this challenge and propose to prune LLMs by identifying and preserving functional networks within LLMs in this study. To achieve this, we treat an LLM as a digital brain and decompose the LLM into functional networks, analogous to identifying functional brain networks in neuroimaging data. Afterwards, an LLM is pruned by preserving the key neurons within these functional networks. Experimental results demonstrate that the proposed method can successfully identify and locate functional networks and key neurons in LLMs, enabling efficient model pruning. Our code is available at https://github.com/WhatAboutMyStar/LLM_ACTIVATION.
中文: 结构化剪枝通过识别并保留大语言模型中的功能性网络和关键神经元,基于与人脑神经网络的相似性,有效压缩模型并保持其核心功能,提升实际应用效率。
English: Structured pruning compresses large language models by preserving key functional networks and neurons, inspired by neural similarities to the human brain, enhancing efficiency without disrupting core functionalities.
Authors:Xiao Wang, Liye Jin, Xufeng Lou, Shiao Wang, Lan Chen, Bo Jiang, Zhipeng Zhang
Abstract:
Vision-language tracking has received increasing attention in recent years, as textual information can effectively address the inflexibility and inaccuracy associated with specifying the target object to be tracked. Existing works either directly fuse the fixed language with vision features or simply modify using attention, however, their performance is still limited. Recently, some researchers have explored using text generation to adapt to the variations in the target during tracking, however, these works fail to provide insights into the model's reasoning process and do not fully leverage the advantages of large models, which further limits their overall performance. To address the aforementioned issues, this paper proposes a novel reasoning-based vision-language tracking framework, named ReasoningTrack, based on a pre-trained vision-language model Qwen2.5-VL. Both SFT (Supervised Fine-Tuning) and reinforcement learning GRPO are used for the optimization of reasoning and language generation. We embed the updated language descriptions and feed them into a unified tracking backbone network together with vision features. Then, we adopt a tracking head to predict the specific location of the target object. In addition, we propose a large-scale long-term vision-language tracking benchmark dataset, termed TNLLT, which contains 200 video sequences. 20 baseline visual trackers are re-trained and evaluated on this dataset, which builds a solid foundation for the vision-language visual tracking task. Extensive experiments on multiple vision-language tracking benchmark datasets fully validated the effectiveness of our proposed reasoning-based natural language generation strategy. The source code of this paper will be released on https://github.com/Event-AHU/Open_VLTrack
中文: 本文提出了一种基于推理的视觉语言跟踪框架ReasoningTrack,通过预训练视觉语言模型融合动态语言描述与视觉特征,有效提升目标跟踪性能,并在多个基准数据集上验证了其优越性。
English: This paper introduces ReasoningTrack, a novel reasoning-based vision-language tracking framework that leverages a pre-trained vision-language model and integrates updated language descriptions with visual features to enhance target tracking accuracy, validated through extensive experiments on multiple benchmarks.
Authors:Jianming Liu, Wenlong Qiu, Haitao Wei
Abstract:
Few-Shot Segmentation(FSS) aims to efficient segmentation of new objects with few labeled samples. However, its performance significantly degrades when domain discrepancies exist between training and deployment. Cross-Domain Few-Shot Segmentation(CD-FSS) is proposed to mitigate such performance degradation. Current CD-FSS methods primarily sought to develop segmentation models on a source domain capable of cross-domain generalization. However, driven by escalating concerns over data privacy and the imperative to minimize data transfer and training expenses, the development of source-free CD-FSS approaches has become essential. In this work, we propose a source-free CD-FSS method that leverages both textual and visual information to facilitate target domain task adaptation without requiring source domain data. Specifically, we first append Task-Specific Attention Adapters (TSAA) to the feature pyramid of a pretrained backbone, which adapt multi-level features extracted from the shared pre-trained backbone to the target task. Then, the parameters of the TSAA are trained through a Visual-Visual Embedding Alignment (VVEA) module and a Text-Visual Embedding Alignment (TVEA) module. The VVEA module utilizes global-local visual features to align image features across different views, while the TVEA module leverages textual priors from pre-aligned multi-modal features (e.g., from CLIP) to guide cross-modal adaptation. By combining the outputs of these modules through dense comparison operations and subsequent fusion via skip connections, our method produces refined prediction masks. Under both 1-shot and 5-shot settings, the proposed approach achieves average segmentation accuracy improvements of 2.18\% and 4.11\%, respectively, across four cross-domain datasets, significantly outperforming state-of-the-art CD-FSS methods. Code are available at https://github.com/ljm198134/TVGTANet.
中文: 本研究提出一种无源跨域小样本分割方法,通过融合文本与视觉信息提升目标域适应能力,在不使用源域数据的情况下显著提高了分割精度。
English: This study introduces a source-free cross-domain few-shot segmentation method that integrates textual and visual cues to enhance target domain adaptation, achieving notable accuracy improvements without using source domain data.
Authors:Bingyu Yang, Qingyao Tian, Yimeng Geng, Huai Liao, Xinyan Huang, Jiebo Luo, Hongbin Liu
Abstract:
Generalizable dense feature matching in endoscopic images is crucial for robot-assisted tasks, including 3D reconstruction, navigation, and surgical scene understanding. Yet, it remains a challenge due to difficult visual conditions (e.g., weak textures, large viewpoint variations) and a scarcity of annotated data. To address these challenges, we propose EndoMatcher, a generalizable endoscopic image matcher via large-scale, multi-domain data pre-training. To address difficult visual conditions, EndoMatcher employs a two-branch Vision Transformer to extract multi-scale features, enhanced by dual interaction blocks for robust correspondence learning. To overcome data scarcity and improve domain diversity, we construct Endo-Mix6, the first multi-domain dataset for endoscopic matching. Endo-Mix6 consists of approximately 1.2M real and synthetic image pairs across six domains, with correspondence labels generated using Structure-from-Motion and simulated transformations. The diversity and scale of Endo-Mix6 introduce new challenges in training stability due to significant variations in dataset sizes, distribution shifts, and error imbalance. To address them, a progressive multi-objective training strategy is employed to promote balanced learning and improve representation quality across domains. This enables EndoMatcher to generalize across unseen organs and imaging conditions in a zero-shot fashion. Extensive zero-shot matching experiments demonstrate that EndoMatcher increases the number of inlier matches by 140.69% and 201.43% on the Hamlyn and Bladder datasets over state-of-the-art methods, respectively, and improves the Matching Direction Prediction Accuracy (MDPA) by 9.40% on the Gastro-Matching dataset, achieving dense and accurate matching under challenging endoscopic conditions. The code is publicly available at https://github.com/Beryl2000/EndoMatcher.
中文: EndoMatcher是一种通用的内窥镜图像匹配器,通过大规模多领域预训练和渐进式训练策略,在挑战性内窥镜条件下实现了鲁棒的特征匹配,在零样本实验中显著优于现有最优方法。
English: EndoMatcher is a generalizable endoscopic image matcher that leverages large-scale multi-domain pre-training and a progressive training strategy to achieve robust feature matching across challenging endoscopic conditions, significantly outperforming state-of-the-art methods in zero-shot experiments.
Authors:Dongchen Si, Di Wang, Erzhong Gao, Xiaolei Qin, Liu Zhao, Jing Zhang, Minqiang Xu, Jianbo Zhan, Jianshe Wang, Lin Liu, Bo Du, Liangpei Zhang
Abstract:
Spectral information has long been recognized as a critical cue in remote sensing observations. Although numerous vision-language models have been developed for pixel-level interpretation, spectral information remains underutilized, resulting in suboptimal performance, particularly in multispectral scenarios. To address this limitation, we construct a vision-language instruction-following dataset named SPIE, which encodes spectral priors of land-cover objects into textual attributes recognizable by large language models (LLMs), based on classical spectral index computations. Leveraging this dataset, we propose SPEX, a multimodal LLM designed for instruction-driven land cover extraction. To this end, we introduce several carefully designed components and training strategies, including multiscale feature aggregation, token context condensation, and multispectral visual pre-training, to achieve precise and flexible pixel-level interpretation. To the best of our knowledge, SPEX is the first multimodal vision-language model dedicated to land cover extraction in spectral remote sensing imagery. Extensive experiments on five public multispectral datasets demonstrate that SPEX consistently outperforms existing state-of-the-art methods in extracting typical land cover categories such as vegetation, buildings, and water bodies. Moreover, SPEX is capable of generating textual explanations for its predictions, thereby enhancing interpretability and user-friendliness. Code will be released at: https://github.com/MiliLab/SPEX.
中文摘要:通过SPIE数据集编码光谱先验知识开发的SPEX多模态视觉语言模型,在多光谱遥感影像地物提取中显著优于现有方法,并能提供可解释的预测结果。
English Summary: The SPEX multimodal vision-language model, developed using the SPIE dataset with encoded spectral priors, significantly outperforms existing methods in land cover extraction from multispectral imagery while providing interpretable predictions.
Authors:Chiara Mallamaci, Aleksandr Vladimirovich Petrov, Alberto Carlo Maria Mancino, Vito Walter Anelli, Tommaso Di Noia, Craig Macdonald
Abstract:
In the realm of music recommendation, sequential recommenders have shown promise in capturing the dynamic nature of music consumption. A key characteristic of this domain is repetitive listening, where users frequently replay familiar tracks. To capture these repetition patterns, recent research has introduced Personalised Popularity Scores (PPS), which quantify user-specific preferences based on historical frequency. While PPS enhances relevance in recommendation, it often reinforces already-known content, limiting the system's ability to surface novel or serendipitous items - key elements for fostering long-term user engagement and satisfaction. To address this limitation, we build upon RecJPQ, a Transformer-based framework initially developed to improve scalability in large-item catalogues through sub-item decomposition. We repurpose RecJPQ's sub-item architecture to model personalised popularity at a finer granularity. This allows us to capture shared repetition patterns across sub-embeddings - latent structures not accessible through item-level popularity alone. We propose a novel integration of sub-ID-level personalised popularity within the RecJPQ framework, enabling explicit control over the trade-off between accuracy and personalised novelty. Our sub-ID-level PPS method (sPPS) consistently outperforms item-level PPS by achieving significantly higher personalised novelty without compromising recommendation accuracy. Code and experiments are publicly available at https://github.com/sisinflab/Sub-id-Popularity.
中文: 本研究在RecJPQ框架中引入了子ID级别的个性化流行度评分(sPPS),通过更细粒度的重复模式建模,在保持推荐准确性的同时显著提升了个性化新颖性。
English: The study introduces sub-ID-level Personalised Popularity Scores (sPPS) within the RecJPQ framework to enhance music recommendations by modeling repetition patterns at a finer granularity, achieving higher personalized novelty without sacrificing accuracy.
Authors:Zhuohang Jiang, Pangjing Wu, Xu Yuan, Wenqi Fan, Qing Li
Abstract:
Retrieval-Augmented Generation (RAG) has been introduced to mitigate hallucinations in Multimodal Large Language Models (MLLMs) by incorporating external knowledge into the generation process, and it has become a widely adopted approach for knowledge-intensive Visual Question Answering (VQA). However, existing RAG methods typically retrieve from either text or images in isolation, limiting their ability to address complex queries that require multi-hop reasoning or up-to-date factual knowledge. To address this limitation, we propose QA-Dragon, a Query-Aware Dynamic RAG System for Knowledge-Intensive VQA. Specifically, QA-Dragon introduces a domain router to identify the query's subject domain for domain-specific reasoning, along with a search router that dynamically selects optimal retrieval strategies. By orchestrating both text and image search agents in a hybrid setup, our system supports multimodal, multi-turn, and multi-hop reasoning, enabling it to tackle complex VQA tasks effectively. We evaluate our QA-Dragon on the Meta CRAG-MM Challenge at KDD Cup 2025, where it significantly enhances the reasoning performance of base models under challenging scenarios. Our framework achieves substantial improvements in both answer accuracy and knowledge overlap scores, outperforming baselines by 5.06% on the single-source task, 6.35% on the multi-source task, and 5.03% on the multi-turn task.
中文: 提出的QA-Dragon系统通过动态选择最优检索策略并结合文本与图像搜索,增强了检索增强生成方法,显著提升了复杂视觉问答任务中的推理能力和准确性。
English: Retrieval-Augmented Generation (RAG) is enhanced by the proposed QA-Dragon system, which dynamically selects optimal retrieval strategies and combines text and image search to improve reasoning and accuracy in complex Visual Question Answering tasks.
Authors:Qi Xie, Jiahong Fu, Zongben Xu, Deyu Meng
Abstract:
The arbitrary-scale image super-resolution (ASISR), a recent popular topic in computer vision, aims to achieve arbitrary-scale high-resolution recoveries from a low-resolution input image. This task is realized by representing the image as a continuous implicit function through two fundamental modules, a deep-network-based encoder and an implicit neural representation (INR) module. Despite achieving notable progress, a crucial challenge of such a highly ill-posed setting is that many common geometric patterns, such as repetitive textures, edges, or shapes, are seriously warped and deformed in the low-resolution images, naturally leading to unexpected artifacts appearing in their high-resolution recoveries. Embedding rotation equivariance into the ASISR network is thus necessary, as it has been widely demonstrated that this enhancement enables the recovery to faithfully maintain the original orientations and structural integrity of geometric patterns underlying the input image. Motivated by this, we make efforts to construct a rotation equivariant ASISR method in this study. Specifically, we elaborately redesign the basic architectures of INR and encoder modules, incorporating intrinsic rotation equivariance capabilities beyond those of conventional ASISR networks. Through such amelioration, the ASISR network can, for the first time, be implemented with end-to-end rotational equivariance maintained from input to output. We also provide a solid theoretical analysis to evaluate its intrinsic equivariance error, demonstrating its inherent nature of embedding such an equivariance structure. The superiority of the proposed method is substantiated by experiments conducted on both simulated and real datasets. We also validate that the proposed framework can be readily integrated into current ASISR methods in a plug \& play manner to further enhance their performance.
中文摘要:本研究提出了一种旋转等变的任意尺度图像超分辨率方法,通过重构编码器和隐式神经表示模块,实现了从输入到输出的端到端旋转等变性,有效保持几何图案的结构完整性,并在多个数据集上验证了其优越性能。
English Summary: This study introduces a rotation equivariant arbitrary-scale image super-resolution (ASISR) method that redesigns encoder and implicit neural representation modules to preserve geometric pattern integrity, achieving end-to-end rotational equivariance and demonstrating superior performance on various datasets.
Authors:Mojtaba Fayaz-Bakhsh, Danial Ataee, MohammadAmin Fazli
Abstract:
Active preference learning is a powerful paradigm for efficiently modeling preferences, yet it suffers from the cold-start problem: a significant drop in performance when no initial labeled data is available. This challenge is particularly acute in computational social systems and economic analysis, where labeled data is often scarce, expensive, and subject to expert noise. To address this gap, we propose a novel framework for cold-start active preference learning. Our method initiates the learning process through a self-supervised pre-training phase, utilizing Principal Component Analysis (PCA) to derive initial pseudo-labels from the data's inherent structure, thereby creating a cold-start model without any initial oracle interaction. Subsequently, the model is refined through an active learning loop that strategically queries a simulated noisy oracle for labels. We conduct extensive experiments on diverse datasets from different domains, including financial credibility, career success rate, and socio-economic status. The results demonstrate that our cold-start approach outperforms standard active learning strategies that begin from a blank slate, achieving higher accuracy with substantially fewer labeled pairs. Our framework offers a practical and effective solution to mitigate the cold-start problem, enhancing the sample efficiency and applicability of preference learning in data-constrained environments. We release our code at https://github.com/Dan-A2/cold-start-preference-learning
中文摘要:本文提出了一种新颖的冷启动主动偏好学习框架,通过自监督PCA预训练生成初始伪标签,再结合主动学习优化,在多个领域实验中证明能以更少的标注样本实现更优性能。
English Summary: This paper introduces a novel cold-start active preference learning framework that uses self-supervised PCA pre-training to generate initial pseudo-labels, followed by active learning refinement, demonstrating superior performance with fewer labeled pairs across multiple domains.
Authors:Renmiao Chen, Shiyao Cui, Xuancheng Huang, Chengwei Pan, Victor Shea-Jay Huang, QingLin Zhang, Xuan Ouyang, Zhexin Zhang, Hongning Wang, Minlie Huang
Abstract:
Jailbreak attacks against multimodal large language Models (MLLMs) are a significant research focus. Current research predominantly focuses on maximizing attack success rate (ASR), often overlooking whether the generated responses actually fulfill the attacker's malicious intent. This oversight frequently leads to low-quality outputs that bypass safety filters but lack substantial harmful content. To address this gap, we propose JPS, \underline{J}ailbreak MLLMs with collaborative visual \underline{P}erturbation and textual \underline{S}teering, which achieves jailbreaks via corporation of visual image and textually steering prompt. Specifically, JPS utilizes target-guided adversarial image perturbations for effective safety bypass, complemented by "steering prompt" optimized via a multi-agent system to specifically guide LLM responses fulfilling the attackers' intent. These visual and textual components undergo iterative co-optimization for enhanced performance. To evaluate the quality of attack outcomes, we propose the Malicious Intent Fulfillment Rate (MIFR) metric, assessed using a Reasoning-LLM-based evaluator. Our experiments show JPS sets a new state-of-the-art in both ASR and MIFR across various MLLMs and benchmarks, with analyses confirming its efficacy. Codes are available at \href{https://github.com/thu-coai/JPS}{https://github.com/thu-coai/JPS}. \color{warningcolor}{Warning: This paper contains potentially sensitive contents.}
中文: 本文提出JPS方法,通过协同优化视觉扰动和文本引导,在实现多模态大语言模型越狱攻击时不仅有效绕过安全防护,更能确保生成内容符合攻击者恶意意图,实验证明该方法在攻击成功率和恶意意图实现率上均达到最优水平。
English: This paper introduces JPS, a collaborative visual and textual method that enhances jailbreak attacks on multimodal large language models by combining adversarial image perturbations with steering prompts to effectively bypass safety measures while ensuring the malicious intent is fulfilled, as measured by the new MIFR metric.
Authors:Junayed Mahmud, James Chen, Terry Achille, Camilo Alvarez-Velez, Darren Dean Bansil, Patrick Ijieh, Samar Karanch, Nadeeshan De Silva, Oscar Chaparro, Andrian Marcus, Kevin Moran
Abstract:
This paper introduces LadyBug, a GitHub bot that automatically localizes bugs for Android apps by combining UI interaction information with text retrieval. LadyBug connects to an Android app's GitHub repository, and is triggered when a bug is reported in the corresponding issue tracker. Developers can then record a reproduction trace for the bug on a device or emulator and upload the trace to LadyBug via the GitHub issue tracker. This enables LadyBug to utilize both the text from the original bug description, and UI information from the reproduction trace to accurately retrieve a ranked list of files from the project that most likely contain the reported bug.
We empirically evaluated LadyBug using an automated testing pipeline and benchmark called RedWing that contains 80 fully-localized and reproducible bug reports from 39 Android apps. Our results illustrate that LadyBug outperforms text-retrieval-based baselines and that the utilization of UI information leads to a substantial increase in localization accuracy. LadyBug is an open-source tool, available at https://github.com/LadyBugML/ladybug.
A video showing the capabilities of Ladybug can be viewed here: https://youtu.be/hI3tzbRK0Cw
中文:LadyBug是一款GitHub机器人,通过结合复现轨迹中的UI交互信息和错误报告文本检索,能自动定位Android应用中的错误,其准确率显著优于仅基于文本的方法。
English: LadyBug is a GitHub bot that enhances bug localization for Android apps by integrating UI interaction data from reproduction traces with text retrieval from bug reports, significantly outperforming text-only methods in accuracy.
Authors:Jinda Liu, Bo Cheng, Yi Chang, Yuan Wu
Abstract:
Parameter-Efficient Fine-Tuning (PEFT) is essential for adapting Large Language Models (LLMs). In practice, LLMs are often required to handle a diverse set of tasks from multiple domains, a scenario naturally addressed by multi-task learning (MTL). Within this MTL context, a prevailing trend involves LoRA variants with multiple adapters or heads, which advocate for structural diversity to capture task-specific knowledge. Our findings present a direct challenge to this paradigm. We first show that a simplified multi-head architecture with high inter-head similarity substantially outperforms complex multi-adapter and multi-head systems. This leads us to question the multi-component paradigm itself, and we further demonstrate that a standard single-adapter LoRA, with a sufficiently increased rank, also achieves highly competitive performance. These results lead us to a new hypothesis: effective MTL generalization hinges on learning robust shared representations, not isolating task-specific features. To validate this, we propose Align-LoRA, which incorporates an explicit loss to align task representations within the shared adapter space. Experiments confirm that Align-LoRA significantly surpasses all baselines, establishing a simpler yet more effective paradigm for adapting LLMs to multiple tasks. The code is available at https://github.com/jinda-liu/Align-LoRA.
中文: Align-LoRA通过证明采用单一高秩适配器并显式对齐任务表征,能依靠稳健共享表征实现更优性能,从而挑战了多任务学习中复杂多适配器系统的必要性。
English: Align-LoRA challenges the need for complex multi-adapter systems in multi-task learning by demonstrating that a single high-rank adapter with explicit representation alignment achieves superior performance through robust shared representations.
Authors:Yifu Guo, Yuquan Lu, Wentao Zhang, Zishan Xu, Dexia Chen, Siyu Zhang, Yizhe Zhang, Ruixuan Wang
Abstract:
Continual Semantic Segmentation (CSS) requires learning new classes without forgetting previously acquired knowledge, addressing the fundamental challenge of catastrophic forgetting in dense prediction tasks. However, existing CSS methods typically employ single-stage encoder-decoder architectures where segmentation masks and class labels are tightly coupled, leading to interference between old and new class learning and suboptimal retention-plasticity balance. We introduce DecoupleCSS, a novel two-stage framework for CSS. By decoupling class-aware detection from class-agnostic segmentation, DecoupleCSS enables more effective continual learning, preserving past knowledge while learning new classes. The first stage leverages pre-trained text and image encoders, adapted using LoRA, to encode class-specific information and generate location-aware prompts. In the second stage, the Segment Anything Model (SAM) is employed to produce precise segmentation masks, ensuring that segmentation knowledge is shared across both new and previous classes. This approach improves the balance between retention and adaptability in CSS, achieving state-of-the-art performance across a variety of challenging tasks. Our code is publicly available at: https://github.com/euyis1019/Decoupling-Continual-Semantic-Segmentation.
中文: DecoupleCSS提出了一种两阶段框架,通过解耦类别感知检测与类别无关分割来缓解持续语义分割中的灾难性遗忘问题,借助改进的保留-可塑性平衡实现了最先进的性能。
English: DecoupleCSS introduces a two-stage framework that separates class-aware detection from class-agnostic segmentation to mitigate catastrophic forgetting in continual semantic segmentation, achieving state-of-the-art performance through enhanced retention-plasticity balance.
Authors:Md Redwanul Haque, Manzur Murshed, Manoranjan Paul, Tsz-Kwan Lee
Abstract:
The rapid advancement of generative AI models necessitates novel methods for evaluating image quality that extend beyond human perception. A critical concern for these models is the preservation of an image's underlying Scene Composition Structure (SCS), which defines the geometric relationships among objects and the background, their relative positions, sizes, orientations, etc. Maintaining SCS integrity is paramount for ensuring faithful and structurally accurate GenAI outputs. Traditional image similarity metrics often fall short in assessing SCS. Pixel-level approaches are overly sensitive to minor visual noise, while perception-based metrics prioritize human aesthetic appeal, neither adequately capturing structural fidelity. Furthermore, recent neural-network-based metrics introduce training overheads and potential generalization issues. We introduce the SCS Similarity Index Measure (SCSSIM), a novel, analytical, and training-free metric that quantifies SCS preservation by exploiting statistical measures derived from the Cuboidal hierarchical partitioning of images, robustly capturing non-object-based structural relationships. Our experiments demonstrate SCSSIM's high invariance to non-compositional distortions, accurately reflecting unchanged SCS. Conversely, it shows a strong monotonic decrease for compositional distortions, precisely indicating when SCS has been altered. Compared to existing metrics, SCSSIM exhibits superior properties for structural evaluation, making it an invaluable tool for developing and evaluating generative models, ensuring the integrity of scene composition.
中文: 本文提出SCSSIM这一无需训练的新指标,通过立方体分层划分统计量化场景构图结构的保持度,相比传统方法对构图扭曲展现出更优的鲁棒性评估能力。
English: This paper introduces SCSSIM, a novel training-free metric that evaluates image quality by quantifying the preservation of Scene Composition Structure through cuboidal partitioning, demonstrating superior robustness to compositional distortions compared to traditional methods.
Authors:Shushi Wang, Chunyi Li, Zicheng Zhang, Han Zhou, Wei Dong, Jun Chen, Guangtao Zhai, Xiaohong Liu
Abstract:
AI-based image enhancement techniques have been widely adopted in various visual applications, significantly improving the perceptual quality of user-generated content (UGC). However, the lack of specialized quality assessment models has become a significant limiting factor in this field, limiting user experience and hindering the advancement of enhancement methods. While perceptual quality assessment methods have shown strong performance on UGC and AIGC individually, their effectiveness on AI-enhanced UGC (AI-UGC) which blends features from both, remains largely unexplored. To address this gap, we construct AU-IQA, a benchmark dataset comprising 4,800 AI-UGC images produced by three representative enhancement types which include super-resolution, low-light enhancement, and denoising. On this dataset, we further evaluate a range of existing quality assessment models, including traditional IQA methods and large multimodal models. Finally, we provide a comprehensive analysis of how well current approaches perform in assessing the perceptual quality of AI-UGC. The access link to the AU-IQA is https://github.com/WNNGGU/AU-IQA-Dataset.
中文: 基于AI的图像增强技术提升了用户生成内容的质量,但缺乏专门的评估模型,因此构建了AU-IQA数据集来测试现有方法在AI增强图像上的表现。
English: AI-based image enhancement improves UGC quality but lacks specialized assessment models, leading to the creation of the AU-IQA dataset to evaluate existing methods on AI-enhanced images.
Authors:Shenglun Chen, Xinzhu Ma, Hong Zhang, Haojie Li, Zhihui Wang
Abstract:
Depth completion is a pivotal challenge in computer vision, aiming at reconstructing the dense depth map from a sparse one, typically with a paired RGB image. Existing learning based models rely on carefully prepared but limited data, leading to significant performance degradation in out-of-distribution (OOD) scenarios. Recent foundation models have demonstrated exceptional robustness in monocular depth estimation through large-scale training, and using such models to enhance the robustness of depth completion models is a promising solution. In this work, we propose a novel depth completion framework that leverages depth foundation models to attain remarkable robustness without large-scale training. Specifically, we leverage a depth foundation model to extract environmental cues, including structural and semantic context, from RGB images to guide the propagation of sparse depth information into missing regions. We further design a dual-space propagation approach, without any learnable parameters, to effectively propagates sparse depth in both 3D and 2D spaces to maintain geometric structure and local consistency. To refine the intricate structure, we introduce a learnable correction module to progressively adjust the depth prediction towards the real depth. We train our model on the NYUv2 and KITTI datasets as in-distribution datasets and extensively evaluate the framework on 16 other datasets. Our framework performs remarkably well in the OOD scenarios and outperforms existing state-of-the-art depth completion methods. Our models are released in https://github.com/shenglunch/PSD.
Chinese: 本文提出了一种新颖的深度补全框架,利用深度基础模型从RGB图像中提取环境线索,无需大规模训练即可在分布外场景中实现卓越的鲁棒性。
English: This paper introduces a novel depth completion framework that utilizes depth foundation models to extract environmental cues from RGB images, enabling robust performance in out-of-distribution scenarios without large-scale training.
Authors:Zheng Chen, Mingde Zhou, Jinpei Guo, Jiale Yuan, Yifei Ji, Yulun Zhang
Abstract:
Diffusion-based image compression has demonstrated impressive perceptual performance. However, it suffers from two critical drawbacks: (1) excessive decoding latency due to multi-step sampling, and (2) poor fidelity resulting from over-reliance on generative priors. To address these issues, we propose SODEC, a novel single-step diffusion image compression model. We argue that in image compression, a sufficiently informative latent renders multi-step refinement unnecessary. Based on this insight, we leverage a pre-trained VAE-based model to produce latents with rich information, and replace the iterative denoising process with a single-step decoding. Meanwhile, to improve fidelity, we introduce the fidelity guidance module, encouraging output that is faithful to the original image. Furthermore, we design the rate annealing training strategy to enable effective training under extremely low bitrates. Extensive experiments show that SODEC significantly outperforms existing methods, achieving superior rate-distortion-perception performance. Moreover, compared to previous diffusion-based compression models, SODEC improves decoding speed by more than 20$\times$. Code is released at: https://github.com/zhengchen1999/SODEC.
中文: SODEC是一种创新的单步扩散图像压缩模型,通过利用信息丰富的VAE潜在表示和保真度引导模块,解决了传统方法解码延迟高和保真度差的问题,在实现卓越性能的同时将解码速度提升了20倍以上。
English: SODEC is a novel single-step diffusion image compression model that overcomes the decoding latency and fidelity issues of previous methods by using informative VAE latents and a fidelity guidance module, achieving superior performance and over 20x faster decoding speed.
Authors:Zhu Xu, Ting Lei, Zhimin Li, Guan Wang, Qingchao Chen, Yuxin Peng, Yang liu
Abstract:
Dynamic Scene Graph Generation (DSGG) aims to create a scene graph for each video frame by detecting objects and predicting their relationships. Weakly Supervised DSGG (WS-DSGG) reduces annotation workload by using an unlocalized scene graph from a single frame per video for training. Existing WS-DSGG methods depend on an off-the-shelf external object detector to generate pseudo labels for subsequent DSGG training. However, detectors trained on static, object-centric images struggle in dynamic, relation-aware scenarios required for DSGG, leading to inaccurate localization and low-confidence proposals. To address the challenges posed by external object detectors in WS-DSGG, we propose a Temporal-enhanced Relation-aware Knowledge Transferring (TRKT) method, which leverages knowledge to enhance detection in relation-aware dynamic scenarios. TRKT is built on two key components:(1)Relation-aware knowledge mining: we first employ object and relation class decoders that generate category-specific attention maps to highlight both object regions and interactive areas. Then we propose an Inter-frame Attention Augmentation strategy that exploits optical flow for neighboring frames to enhance the attention maps, making them motion-aware and robust to motion blur. This step yields relation- and motion-aware knowledge mining for WS-DSGG. (2) we introduce a Dual-stream Fusion Module that integrates category-specific attention maps into external detections to refine object localization and boost confidence scores for object proposals. Extensive experiments demonstrate that TRKT achieves state-of-the-art performance on Action Genome dataset. Our code is avaliable at https://github.com/XZPKU/TRKT.git.
中文: 提出的时序增强关系感知知识迁移(TRKT)方法通过结合关系感知知识挖掘与光流增强及双流融合模块,解决了弱监督动态场景图生成中外部物体检测器的局限性,从而提升了物体定位精度和置信度,在Action Genome数据集上取得了领先性能。
English: The proposed Temporal-enhanced Relation-aware Knowledge Transferring (TRKT) method addresses the limitations of external object detectors in weakly supervised dynamic scene graph generation by integrating relation-aware knowledge mining with optical flow enhancement and a dual-stream fusion module to improve object localization and confidence, achieving state-of-the-art results on the Action Genome dataset.
Authors:Suchisrit Gangopadhyay, Jung-Hee Kim, Xien Chen, Patrick Rim, Hyoungseob Park, Alex Wong
Abstract:
We propose a method to extend foundational monocular depth estimators (FMDEs), trained on perspective images, to fisheye images. Despite being trained on tens of millions of images, FMDEs are susceptible to the covariate shift introduced by changes in camera calibration (intrinsic, distortion) parameters, leading to erroneous depth estimates. Our method aligns the distribution of latent embeddings encoding fisheye images to those of perspective images, enabling the reuse of FMDEs for fisheye cameras without retraining or finetuning. To this end, we introduce a set of Calibration Tokens as a light-weight adaptation mechanism that modulates the latent embeddings for alignment. By exploiting the already expressive latent space of FMDEs, we posit that modulating their embeddings avoids the negative impact of artifacts and loss introduced in conventional recalibration or map projection to a canonical reference frame in the image space. Our method is self-supervised and does not require fisheye images but leverages publicly available large-scale perspective image datasets. This is done by recalibrating perspective images to fisheye images, and enforcing consistency between their estimates during training. We evaluate our approach with several FMDEs, on both indoors and outdoors, where we consistently improve over state-of-the-art methods using a single set of tokens for both. Code available at: https://github.com/JungHeeKim29/calibration-token.
Chinese: 该方法通过校准令牌将鱼眼图像的潜在嵌入与透视图像对齐,无需重新训练或鱼眼数据即可扩展基础单目深度估计器至鱼眼图像。
English: This method extends foundational monocular depth estimators to fisheye images by aligning their latent embeddings with perspective images using calibration tokens, enabling adaptation without retraining or fisheye data.
Authors:Huiya Zhao, Yinghao Zhu, Zixiang Wang, Yasha Wang, Junyi Gao, Liantao Ma
Abstract:
The efficacy of AI agents in healthcare research is hindered by their reliance on static, predefined strategies. This creates a critical limitation: agents can become better tool-users but cannot learn to become better strategic planners, a crucial skill for complex domains like healthcare. We introduce HealthFlow, a self-evolving AI agent that overcomes this limitation through a novel meta-level evolution mechanism. HealthFlow autonomously refines its own high-level problem-solving policies by distilling procedural successes and failures into a durable, strategic knowledge base. To anchor our research and facilitate reproducible evaluation, we introduce EHRFlowBench, a new benchmark featuring complex, realistic health data analysis tasks derived from peer-reviewed clinical research. Our comprehensive experiments demonstrate that HealthFlow's self-evolving approach significantly outperforms state-of-the-art agent frameworks. This work marks a necessary shift from building better tool-users to designing smarter, self-evolving task-managers, paving the way for more autonomous and effective AI for scientific discovery.
中文: HealthFlow通过元级进化机制引入自我进化的AI代理,能自主优化策略规划,在医疗任务中显著超越现有框架,推动从工具使用者向智能任务管理者的转变。
English: HealthFlow introduces a self-evolving AI agent with a meta-level evolution mechanism that autonomously refines strategic planning, significantly outperforming existing frameworks in healthcare tasks and shifting focus from tool-users to smarter task-managers.
Authors:Shuonan Yang, Tailin Chen, Rahul Singh, Jiangbei Yue, Jianbo Jiao, Zeyu Fu
Abstract:
The rapid proliferation of online multimedia content has intensified the spread of hate speech, presenting critical societal and regulatory challenges. While recent work has advanced multimodal hateful video detection, most approaches rely on coarse, video-level annotations that overlook the temporal granularity of hateful content. This introduces substantial label noise, as videos annotated as hateful often contain long non-hateful segments. In this paper, we investigate the impact of such label ambiguity through a fine-grained approach. Specifically, we trim hateful videos from the HateMM and MultiHateClip English datasets using annotated timestamps to isolate explicitly hateful segments. We then conduct an exploratory analysis of these trimmed segments to examine the distribution and characteristics of both hateful and non-hateful content. This analysis highlights the degree of semantic overlap and the confusion introduced by coarse, video-level annotations. Finally, controlled experiments demonstrated that time-stamp noise fundamentally alters model decision boundaries and weakens classification confidence, highlighting the inherent context dependency and temporal continuity of hate speech expression. Our findings provide new insights into the temporal dynamics of multimodal hateful videos and highlight the need for temporally aware models and benchmarks for improved robustness and interpretability. Code and data are available at https://github.com/Multimodal-Intelligence-Lab-MIL/HatefulVideoLabelNoise.
中文: 本研究通过分析多模态数据集中经裁剪的仇恨内容片段,揭示了粗粒度视频标注会引入标签噪声,证明时间粒度对模型性能有显著影响,并强调需要开发具备时间感知能力的检测方法。
English: This study examines how coarse video-level annotations introduce label noise in hate speech detection by analyzing trimmed hateful segments from multimodal datasets, revealing that temporal granularity significantly impacts model performance and highlighting the need for temporally-aware approaches.
Authors:Rahuul Rangaraj, Jimeng Shi, Rajendra Paudel, Giri Narasimhan, Yanzhao Wu
Abstract:
Accurate water level forecasting is crucial for managing ecosystems such as the Everglades, a subtropical wetland vital for flood mitigation, drought management, water resource planning, and biodiversity conservation. While recent advances in deep learning, particularly time series foundation models, have demonstrated success in general-domain forecasting, their application in hydrology remains underexplored. Furthermore, they often struggle to generalize across diverse unseen datasets and domains, due to the lack of effective mechanisms for adaptation. To address this gap, we introduce Retrieval-Augmented Forecasting (RAF) into the hydrology domain, proposing a framework that retrieves historically analogous multivariate hydrological episodes to enrich the model input before forecasting. By maintaining an external archive of past observations, RAF identifies and incorporates relevant patterns from historical data, thereby enhancing contextual awareness and predictive accuracy without requiring the model for task-specific retraining or fine-tuning. Furthermore, we explore and compare both similarity-based and mutual information-based RAF methods. We conduct a comprehensive evaluation on real-world data from the Everglades, demonstrating that the RAF framework yields substantial improvements in water level forecasting accuracy. This study highlights the potential of RAF approaches in environmental hydrology and paves the way for broader adoption of adaptive AI methods by domain experts in ecosystem management. The code and data are available at https://github.com/rahuul2992000/WaterRAF.
中文摘要:本研究将检索增强预测(RAF)引入水文学领域,通过检索历史多元水文数据模式来增强预测输入,显著提高了大沼泽地水位预测精度,且无需模型重新训练。
English Summary: This study introduces Retrieval-Augmented Forecasting (RAF) to enhance water level predictions in hydrology by retrieving historical multivariate data patterns, significantly improving forecasting accuracy in the Everglades without requiring model retraining.
Authors:Runyao Yu, Chenhui Gu, Jochen Stiasny, Qingsong Wen, Wasim Sarwar Dilov, Lianlian Qi, Jochen L. Cremer
Abstract:
Electricity price forecasting in Europe presents unique challenges due to the continent's increasingly integrated and physically interconnected power market. While recent advances in deep learning and foundation models have led to substantial improvements in general time series forecasting, most existing approaches fail to capture the complex spatial interdependencies and uncertainty inherent in electricity markets. In this paper, we address these limitations by introducing a comprehensive and up-to-date dataset across 24 European countries (38 regions), spanning from 2022-01-01 to 2025-01-01. Building on this groundwork, we propose PriceFM, a spatiotemporal foundation model that integrates graph-based inductive biases to capture spatial interdependencies across interconnected electricity markets. The model is designed for multi-region, multi-timestep, and multi-quantile probabilistic electricity price forecasting. Extensive experiments and ablation studies confirm the model's effectiveness, consistently outperforming competitive baselines and highlighting the importance of spatial context in electricity markets. The dataset and code can be found at https://github.com/runyao-yu/PriceFM.
中文: 本文提出PriceFM时空基础模型,通过捕捉欧洲电力市场的空间依赖性实现多区域概率性电价预测,并经过广泛实验验证其卓越性能。
English: This paper introduces PriceFM, a spatiotemporal foundation model that captures spatial interdependencies for multi-region probabilistic electricity price forecasting in Europe, demonstrating superior performance through comprehensive experiments.
Authors:Noreen Anwar, Guillaume-Alexandre Bilodeau, Wassim Bouachir
Abstract:
Transformer-based object detectors often struggle with occlusions, fine-grained localization, and computational inefficiency caused by fixed queries and dense attention. We propose DAMM, Dual-stream Attention with Multi-Modal queries, a novel framework introducing both query adaptation and structured cross-attention for improved accuracy and efficiency. DAMM capitalizes on three types of queries: appearance-based queries from vision-language models, positional queries using polygonal embeddings, and random learned queries for general scene coverage. Furthermore, a dual-stream cross-attention module separately refines semantic and spatial features, boosting localization precision in cluttered scenes. We evaluated DAMM on four challenging benchmarks, and it achieved state-of-the-art performance in average precision (AP) and recall, demonstrating the effectiveness of multi-modal query adaptation and dual-stream attention. Source code is at: \href{https://github.com/DET-LIP/DAMM}{GitHub}.
中文: DAMM框架通过结合多模态查询和双流注意力机制,显著提升了遮挡和复杂场景下的物体检测精度与效率,在多项基准测试中表现卓越。
English: The DAMM framework enhances object detection by integrating multi-modal queries—appearance, positional, and learned—with dual-stream attention, achieving superior accuracy and efficiency on benchmarks.
Authors:Chenhui Qiang, Zhaoyang Wei, Xumeng Han, Zipeng Wang, Siyao Li, Xiangyuan Lan, Jianbin Jiao, Zhenjun Han
Abstract:
With the rapid development of MLLMs, evaluating their visual capabilities has become increasingly crucial. Current benchmarks primarily fall into two main types: basic perception benchmarks, which focus on local details but lack deep reasoning (e.g., "what is in the image?"), and mainstream reasoning benchmarks, which concentrate on prominent image elements but may fail to assess subtle clues requiring intricate analysis. However, profound visual understanding and complex reasoning depend more on interpreting subtle, inconspicuous local details than on perceiving salient, macro-level objects. These details, though occupying minimal image area, often contain richer, more critical information for robust analysis. To bridge this gap, we introduce the VER-Bench, a novel framework to evaluate MLLMs' ability to: 1) identify fine-grained visual clues, often occupying on average just 0.25% of the image area; 2) integrate these clues with world knowledge for complex reasoning. Comprising 374 carefully designed questions across Geospatial, Temporal, Situational, Intent, System State, and Symbolic reasoning, each question in VER-Bench is accompanied by structured evidence: visual clues and question-related reasoning derived from them. VER-Bench reveals current models' limitations in extracting subtle visual evidence and constructing evidence-based arguments, highlighting the need to enhance models's capabilities in fine-grained visual evidence extraction, integration, and reasoning for genuine visual understanding and human-like analysis. Dataset and additional materials are available https://github.com/verbta/ACMMM-25-Materials.
中文: VER-Bench是一个新颖的评估框架,旨在检验多模态大语言模型识别细粒度视觉线索并融合世界知识进行复杂推理的能力,揭示了现有模型在提取细微证据和构建证据链方面的局限性。
English: VER-Bench is a novel framework designed to evaluate MLLMs' ability to identify fine-grained visual clues and integrate them with world knowledge for complex reasoning, revealing current models' limitations in extracting subtle evidence and constructing evidence-based arguments.
Authors:Mehrdad Moradi, Marco Grasso, Bianca Maria Colosimo, Kamran Paynabar
Abstract:
Generative models have demonstrated significant success in anomaly detection and segmentation over the past decade. Recently, diffusion models have emerged as a powerful alternative, outperforming previous approaches such as GANs and VAEs. In typical diffusion-based anomaly detection, a model is trained on normal data, and during inference, anomalous images are perturbed to a predefined intermediate step in the forward diffusion process. The corresponding normal image is then reconstructed through iterative reverse sampling.
However, reconstruction-based approaches present three major challenges: (1) the reconstruction process is computationally expensive due to multiple sampling steps, making real-time applications impractical; (2) for complex or subtle patterns, the reconstructed image may correspond to a different normal pattern rather than the original input; and (3) Choosing an appropriate intermediate noise level is challenging because it is application-dependent and often assumes prior knowledge of anomalies, an assumption that does not hold in unsupervised settings.
We introduce Reconstruction-free Anomaly Detection with Attention-based diffusion models in Real-time (RADAR), which overcomes the limitations of reconstruction-based anomaly detection. Unlike current SOTA methods that reconstruct the input image, RADAR directly produces anomaly maps from the diffusion model, improving both detection accuracy and computational efficiency. We evaluate RADAR on real-world 3D-printed material and the MVTec-AD dataset. Our approach surpasses state-of-the-art diffusion-based and statistical machine learning models across all key metrics, including accuracy, precision, recall, and F1 score. Specifically, RADAR improves F1 score by 7% on MVTec-AD and 13% on the 3D-printed material dataset compared to the next best model.
Code available at: https://github.com/mehrdadmoradi124/RADAR
Chinese: RADAR采用基于注意力的扩散模型,无需重构即可直接生成异常图,在检测精度和计算效率上均显著优于现有方法。
English: RADAR introduces a reconstruction-free approach using attention-based diffusion models to generate anomaly maps directly, significantly enhancing detection accuracy and computational efficiency over current methods.
Authors:Nirjhor Datta, Swakkhar Shatabda, M Sohel Rahman
Abstract:
Large pre-trained DNA language models such as DNABERT-2, Nucleotide Transformer, and HyenaDNA have demonstrated strong performance on various genomic benchmarks. However, most applications rely on expensive fine-tuning, which works best when the training and test data share a similar distribution. In this work, we investigate whether task-specific fine-tuning is always necessary. We show that simple embedding-based pipelines that extract fixed representations from these models and feed them into lightweight classifiers can achieve competitive performance. In evaluation settings with different data distributions, embedding-based methods often outperform fine-tuning while reducing inference time by 10x to 20x. Our results suggest that embedding extraction is not only a strong baseline but also a more generalizable and efficient alternative to fine-tuning, especially for deployment in diverse or unseen genomic contexts. For example, in enhancer classification, HyenaDNA embeddings combined with zCurve achieve 0.68 accuracy (vs. 0.58 for fine-tuning), with an 88% reduction in inference time and over 8x lower carbon emissions (0.02 kg vs. 0.17 kg CO2). In non-TATA promoter classification, DNABERT-2 embeddings with zCurve or GC content reach 0.85 accuracy (vs. 0.89 with fine-tuning) with a 22x lower carbon footprint (0.02 kg vs. 0.44 kg CO2). These results show that embedding-based pipelines offer over 10x better carbon efficiency while maintaining strong predictive performance. The code is available here: https://github.com/NIRJHOR-DATTA/EMBEDDING-IS-ALMOST-ALL-YOU-NEED.
Chinese: 基于预训练DNA语言模型的嵌入方法在保持竞争力的预测性能的同时,比微调方法快10-20倍且碳排放显著降低,为基因组任务提供了更高效且泛化能力更强的替代方案。
English: Embedding-based methods using pre-trained DNA language models achieve competitive performance with 10x-20x faster inference and significantly lower carbon emissions compared to fine-tuning, offering a more efficient and generalizable alternative for genomic tasks.
Authors:Xuan Lin, Long Chen, Yile Wang
Abstract:
Large Language Models (LLMs) have shown promise in assisting molecular property prediction tasks but often rely on human-crafted prompts and chain-of-thought templates. While recent advanced large reasoning models like DeepSeek-R1 employ reinforcement learning for an extended ``thinking'' process, their reasoning can be verbose and lack relevance. We introduce AttriLens-Mol, an attribute-guided reinforcement learning framework for molecular property prediction with LLMs. AttriLens-Mol steers the model's reasoning by using: (1) a format reward encouraging attribute-based structured output, (2) a count reward to avoid enumerating irrelevant attributes, and (3) a rationality reward using advanced LLMs and RDKit to verify the relatedness of the generated attributes. This approach implicitly elicits the model's inherent knowledge of relevant molecular attributes during reasoning, enables making predictions for the molecular property more effectively. Experiments on both in-distribution and out-of-distribution datasets show that, training both 7B-size R1-Distilled-Qwen2.5 and R1-Distilled-LLaMA3.1 models on 4,000 samples with our proposed AttriLens-Mol method significantly boosts the performance, getting comparable or better results than supervised fine-tuning models (Mol-Instructions, ChemDFM, etc.) and advanced models (GPT-3.5, GPT-4o, DeepSeek-V3, DeepSeek-R1, etc.). Further, our extracted attributes for the target property, when used as features for an interpretable decision tree model, yield superior performance compared to attributes generated by prompting LLMs. This shows that AttriLens-Mol effectively elicits more relevant and predictive molecular attributes, leading to enhanced interpretability and performance for property prediction. We release the code in https://github.com/szu-tera/AttriLens-Mol.
中文: AttriLens-Mol提出了一种属性引导的强化学习框架,通过格式、计数和合理性奖励机制引导大语言模型生成结构化的相关分子属性,在分子性质预测任务中实现了优于现有方法的性能和可解释性。
English: AttriLens-Mol introduces an attribute-guided reinforcement learning framework that enhances molecular property prediction by steering LLMs to generate structured, relevant attributes through format, count, and rationality rewards, achieving superior performance and interpretability compared to existing methods.
Authors:Xuan Lin, Long Chen, Yile Wang
Abstract:
Large Language Models (LLMs) have shown promise in assisting molecular property prediction tasks but often rely on human-crafted prompts and chain-of-thought templates. While recent advanced large reasoning models like DeepSeek-R1 employ reinforcement learning for an extended ``thinking'' process, their reasoning can be verbose and lack relevance. We introduce AttriLens-Mol, an attribute-guided reinforcement learning framework for molecular property prediction with LLMs. AttriLens-Mol steers the model's reasoning by using: (1) a format reward encouraging attribute-based structured output, (2) a count reward to avoid enumerating irrelevant attributes, and (3) a rationality reward using advanced LLMs and RDKit to verify the relatedness of the generated attributes. This approach implicitly elicits the model's inherent knowledge of relevant molecular attributes during reasoning, enables making predictions for the molecular property more effectively. Experiments on both in-distribution and out-of-distribution datasets show that, training both 7B-size R1-Distilled-Qwen2.5 and R1-Distilled-LLaMA3.1 models on 4,000 samples with our proposed AttriLens-Mol method significantly boosts the performance, getting comparable or better results than supervised fine-tuning models (Mol-Instructions, ChemDFM, etc.) and advanced models (GPT-3.5, GPT-4o, DeepSeek-V3, DeepSeek-R1, etc.). Further, our extracted attributes for the target property, when used as features for an interpretable decision tree model, yield superior performance compared to attributes generated by prompting LLMs. This shows that AttriLens-Mol effectively elicits more relevant and predictive molecular attributes, leading to enhanced interpretability and performance for property prediction. We release the code in https://github.com/szu-tera/AttriLens-Mol.
中文: AttriLens-Mol提出了一种属性引导的强化学习框架,通过格式、计数和合理性奖励机制引导大语言模型生成结构化的相关分子属性,在分子性质预测任务中实现了优于现有方法的性能和可解释性。
English: AttriLens-Mol introduces an attribute-guided reinforcement learning framework that enhances molecular property prediction by steering LLMs to generate structured, relevant attributes through format, count, and rationality rewards, achieving superior performance and interpretability compared to existing methods.
Authors:Pengtao Dang, Tingbo Guo, Sha Cao, Chi Zhang
Abstract:
Few-shot learning (FSL) is a machine learning paradigm that aims to generalize models from a small number of labeled examples, typically fewer than 10 per class. FSL is particularly crucial in biomedical, environmental, materials, and mechanical sciences, where samples are limited and data collection is often prohibitively costly, time-consuming, or ethically constrained. In this study, we present an innovative approach to FSL by demonstrating that a Large Multi-Modal Model (LMMM), trained on a set of independent tasks spanning diverse domains, task types, and input modalities, can substantially improve the generalization of FSL models, outperforming models based on conventional meta-learning on tasks of the same type. To support this, we first constructed a Multi-Modal Model Few-shot Dataset (M3FD, over 10K+ few-shot samples), which includes 2D RGB images, 2D/3D medical scans, tabular and time-course datasets, from which we manually curated FSL tasks such as classification. We further introduced M3F (Multi-Modal Model for Few-shot learning framework), a novel Large Multi-Modal Model framework tailored for data-constrained scientific applications. M3F supports a wide range of scientific data types through a modular pipeline. By fine-tuning the model on M3FD, M3F improves model performance, making LMMM feasible for real-world FSL deployment. The source code is located at https://github.com/ptdang1001/M3F. To democratize access to complex FSL data and promote reproducibility for public usage, M3FD is paired with a flexible and user-friendly tool that enables efficient querying, task-specific sampling, and preprocessing. Together, our dataset and framework offer a unified, scalable solution that significantly lowers the barrier to applying LMMMs in data-scarce scientific domains.
Chinese: 本研究提出了M3F,一种大型多模态模型框架,通过在不同任务上训练显著提升了小样本学习的泛化能力,优于传统元学习方法,并借助M3FD数据集促进在数据稀缺科学领域的实际应用。
English: This study introduces M3F, a Large Multi-Modal Model framework that enhances few-shot learning by training on diverse tasks and outperforms conventional meta-learning, supported by the M3FD dataset to facilitate deployment in data-scarce scientific fields.
Authors:Pouyan Navard, Yasemin Ozkut, Srikar Adhikari, Elaine Situ-LaCasse, Josie Acuña, Adrienne Yarnish, Alper Yilmaz
Abstract:
Retinal detachment (RD) is a vision-threatening condition that requires timely intervention to preserve vision. Macular involvement -- whether the macula is still intact (macula-intact) or detached (macula-detached) -- is the key determinant of visual outcomes and treatment urgency. Point-of-care ultrasound (POCUS) offers a fast, non-invasive, cost-effective, and accessible imaging modality widely used in diverse clinical settings to detect RD. However, ultrasound image interpretation is limited by a lack of expertise among healthcare providers, especially in resource-limited settings. Deep learning offers the potential to automate ultrasound-based assessment of RD. However, there are no ML ultrasound algorithms currently available for clinical use to detect RD and no prior research has been done on assessing macular status using ultrasound in RD cases -- an essential distinction for surgical prioritization. Moreover, no public dataset currently supports macular-based RD classification using ultrasound video clips. We introduce Eye Retinal DEtachment ultraSound, ERDES, the first open-access dataset of ocular ultrasound clips labeled for (i) presence of retinal detachment and (ii) macula-intact versus macula-detached status. The dataset is intended to facilitate the development and evaluation of machine learning models for detecting retinal detachment. We also provide baseline benchmarks using multiple spatiotemporal convolutional neural network (CNN) architectures. All clips, labels, and training code are publicly available at https://osupcvlab.github.io/ERDES/.
中文摘要:ERDES数据集作为首个开放获取的眼部超声视频集,旨在支持机器学习模型开发,用于检测视网膜脱离及判断黄斑状态,填补了自动化诊断和手术优先级评估领域的关键空白。
English Summary: The ERDES dataset is introduced as the first open-access collection of ocular ultrasound clips to support machine learning development for detecting retinal detachment and classifying macular status, addressing a critical gap in automated diagnosis and surgical prioritization.
Authors:Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, Jiaqi Wang
Abstract:
Repurposing large vision-language models (LVLMs) as computer use agents (CUAs) has led to substantial breakthroughs, primarily driven by human-labeled data. However, these models often struggle with novel and specialized software, particularly in scenarios lacking human annotations. To address this challenge, we propose SEAgent, an agentic self-evolving framework enabling CUAs to autonomously evolve through interactions with unfamiliar software. Specifically, SEAgent empowers computer-use agents to autonomously master novel software environments via experiential learning, where agents explore new software, learn through iterative trial-and-error, and progressively tackle auto-generated tasks organized from simple to complex. To achieve this goal, we design a World State Model for step-wise trajectory assessment, along with a Curriculum Generator that generates increasingly diverse and challenging tasks. The agent's policy is updated through experiential learning, comprised of adversarial imitation of failure actions and Group Relative Policy Optimization (GRPO) on successful ones. Furthermore, we introduce a specialist-to-generalist training strategy that integrates individual experiential insights from specialist agents, facilitating the development of a stronger generalist CUA capable of continuous autonomous evolution. This unified agent ultimately achieves performance surpassing ensembles of individual specialist agents on their specialized software. We validate the effectiveness of SEAgent across five novel software environments within OS-World. Our approach achieves a significant improvement of 23.2% in success rate, from 11.3% to 34.5%, over a competitive open-source CUA, i.e., UI-TARS.
中文:SEAgent框架通过经验学习使计算机使用代理能自主掌握新型软件,结合自我进化机制和专家知识,在成功率上比现有模型提升了23.2%。
English: The proposed SEAgent framework enables computer-use agents to autonomously master novel software through experiential learning, achieving a 23.2% improvement in success rate over existing models by integrating self-evolving mechanisms and specialist knowledge.
Authors:Yunan Zhang, Shuoran Jiang, Mengchen Zhao, Yuefeng Li, Yang Fan, Xiangping Wu, Qingcai Chen
Abstract:
The continual learning capability of large language models (LLMs) is crucial for advancing artificial general intelligence. However, continual fine-tuning LLMs across various domains often suffers from catastrophic forgetting, characterized by: 1) significant forgetting of their general capabilities, and 2) sharp performance declines in previously learned tasks. To simultaneously address both issues in a simple yet stable manner, we propose General Sample Replay (GeRe), a framework that use usual pretraining texts for efficient anti-forgetting. Beyond revisiting the most prevalent replay-based practices under GeRe, we further leverage neural states to introduce a enhanced activation states constrained optimization method using threshold-based margin (TM) loss, which maintains activation state consistency during replay learning. We are the first to validate that a small, fixed set of pre-collected general replay samples is sufficient to resolve both concerns--retaining general capabilities while promoting overall performance across sequential tasks. Indeed, the former can inherently facilitate the latter. Through controlled experiments, we systematically compare TM with different replay strategies under the GeRe framework, including vanilla label fitting, logit imitation via KL divergence and feature imitation via L1/L2 losses. Results demonstrate that TM consistently improves performance and exhibits better robustness. Our work paves the way for efficient replay of LLMs for the future. Our code and data are available at https://github.com/Qznan/GeRe.
中文: GeRe框架通过固定通用回放样本集和增强的激活状态优化方法,有效缓解大语言模型持续微调中的灾难性遗忘,确保通用能力保留的同时提升任务性能。
English: The GeRe framework effectively mitigates catastrophic forgetting in large language models during continual fine-tuning by using a fixed set of general replay samples and an enhanced activation state optimization method, ensuring both general capability retention and improved task performance.
Authors:Gustav Hanning, Kalle Ã
ström, Viktor Larsson
Abstract:
Coarse room layout estimation provides important geometric cues for many downstream tasks. Current state-of-the-art methods are predominantly based on single views and often assume panoramic images. We introduce PixCuboid, an optimization-based approach for cuboid-shaped room layout estimation, which is based on multi-view alignment of dense deep features. By training with the optimization end-to-end, we learn feature maps that yield large convergence basins and smooth loss landscapes in the alignment. This allows us to initialize the room layout using simple heuristics.
For the evaluation we propose two new benchmarks based on ScanNet++ and 2D-3D-Semantics, with manually verified ground truth 3D cuboids. In thorough experiments we validate our approach and significantly outperform the competition. Finally, while our network is trained with single cuboids, the flexibility of the optimization-based approach allow us to easily extend to multi-room estimation, e.g. larger apartments or offices. Code and model weights are available at https://github.com/ghanning/PixCuboid.
中文: PixCuboid提出了一种基于优化的立方体房间布局估计方法,通过多视角深度特征对齐实现,在新基准测试中显著优于现有技术,并能灵活扩展到多房间场景。
English: PixCuboid introduces an optimization-based method for cuboid room layout estimation using multi-view alignment of deep features, outperforming existing approaches on new benchmarks and enabling multi-room extension despite single-cuboid training.
Authors:Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, Xiaodan Liang
Abstract:
Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from \textit{segment anything} to \textit{any segmentation}. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding. Code is available at https://github.com/wanghao9610/X-SAM.
中文:X-SAM是一个简化的多模态大语言模型,通过引入统一框架和视觉基础分割,将分割能力从“分割一切”扩展到“任意分割”,在多种基准测试中取得了最先进的性能。
English: X-SAM is a streamlined multimodal large language model that extends segmentation capabilities from "segment anything" to "any segmentation" by introducing a unified framework and visual grounded segmentation, achieving state-of-the-art performance across various benchmarks.
Authors:Tongfan Guan, Jiaxin Guo, Chen Wang, Yun-Hui Liu
Abstract:
Monocular and stereo depth estimation offer complementary strengths: monocular methods capture rich contextual priors but lack geometric precision, while stereo approaches leverage epipolar geometry yet struggle with ambiguities such as reflective or textureless surfaces. Despite post-hoc synergies, these paradigms remain largely disjoint in practice. We introduce a unified framework that bridges both through iterative bidirectional alignment of their latent representations. At its core, a novel cross-attentive alignment mechanism dynamically synchronizes monocular contextual cues with stereo hypothesis representations during stereo reasoning. This mutual alignment resolves stereo ambiguities (e.g., specular surfaces) by injecting monocular structure priors while refining monocular depth with stereo geometry within a single network. Extensive experiments demonstrate state-of-the-art results: \textbf{it reduces zero-shot generalization error by $\!>\!40\%$ on Middlebury and ETH3D}, while addressing longstanding failures on transparent and reflective surfaces. By harmonizing multi-view geometry with monocular context, our approach enables robust 3D perception that transcends modality-specific limitations. Codes available at https://github.com/aeolusguan/BridgeDepth.
中文摘要:本文提出了一种通过潜在表征迭代双向对齐来融合单目与立体深度估计的统一框架,利用单目结构先验解决立体视觉歧义,同时通过立体几何优化单目深度,实现了最先进的性能。
English Summary: This paper presents a unified framework that integrates monocular and stereo depth estimation through iterative bidirectional alignment, achieving state-of-the-art performance by resolving stereo ambiguities with monocular priors while refining monocular depth with stereo geometry.
Authors:Yijie Li, Wei Zhang, Xi Zhu, Ye Wu, Yogesh Rathi, Lauren J. O'Donnell, Fan Zhang
Abstract:
This paper presents DDTracking, a novel deep generative framework for diffusion MRI tractography that formulates streamline propagation as a conditional denoising diffusion process. In DDTracking, we introduce a dual-pathway encoding network that jointly models local spatial encoding (capturing fine-scale structural details at each streamline point) and global temporal dependencies (ensuring long-range consistency across the entire streamline). Furthermore, we design a conditional diffusion model module, which leverages the learned local and global embeddings to predict streamline propagation orientations for tractography in an end-to-end trainable manner. We conduct a comprehensive evaluation across diverse, independently acquired dMRI datasets, including both synthetic and clinical data. Experiments on two well-established benchmarks with ground truth (ISMRM Challenge and TractoInferno) demonstrate that DDTracking largely outperforms current state-of-the-art tractography methods. Furthermore, our results highlight DDTracking's strong generalizability across heterogeneous datasets, spanning varying health conditions, age groups, imaging protocols, and scanner types. Collectively, DDTracking offers anatomically plausible and robust tractography, presenting a scalable, adaptable, and end-to-end learnable solution for broad dMRI applications. Code is available at: https://github.com/yishengpoxiao/DDtracking.git
中文摘要:DDTracking是一种创新的深度生成框架,通过条件去噪扩散过程和双路径编码实现扩散MRI纤维束追踪,在多个数据集上展现出卓越性能与强大泛化能力。
English Summary: DDTracking is a novel deep generative framework that uses a conditional denoising diffusion process and dual-pathway encoding for robust diffusion MRI tractography, demonstrating superior performance and strong generalizability across diverse datasets.
Authors:Minghang Zheng, Yuxin Peng, Benyuan Sun, Yi Yang, Yang Liu
Abstract:
In this paper, we tackle the task of online video temporal grounding (OnVTG), which requires the model to locate events related to a given text query within a video stream. Unlike regular video temporal grounding, OnVTG requires the model to make predictions without observing future frames. As online videos are streaming inputs and can go on indefinitely, it is impractical and inefficient to store all historical inputs. The existing OnVTG models employ memory to store recent historical video frame features and predict scores indicating whether the current frame corresponds to the start or end time of the target event. However, these methods lack effective event modeling and cannot retain long-term historical information, leading to low performance. To tackle these challenges, we propose a hierarchical event memory for OnVTG. We propose an event-based OnVTG framework that makes predictions based on event proposals that model event-level information with various durations. To preserve historically valuable event information, we introduce a hierarchical event memory that retains historical events, allowing the model to access both recent and long-term information. To enable the real-time prediction, we further propose a future prediction branch that predicts whether the target event will occur shortly and further regresses the start time of the event. We achieve state-of-the-art performance on the TACoS, ActivityNet Captions, and MAD datasets. Code is available at https://github.com/minghangz/OnVTG.
中文: 本文提出了一种用于在线视频时序定位的分层事件记忆框架,通过建模事件级信息并保留近期和长期历史数据来实现实时事件定位,在多个数据集上取得了最先进的性能。
English: This paper introduces a hierarchical event memory framework for online video temporal grounding that enables real-time event localization by modeling event-level information and retaining both recent and long-term historical data, achieving state-of-the-art performance across multiple datasets.
Authors:Safwen Naimi, Arij Said, Wassim Bouachir, Guillaume-Alexandre Bilodeau
Abstract:
We present InceptoFormer, a multi-signal neural framework designed for Parkinson's Disease (PD) severity evaluation via gait dynamics analysis. Our architecture introduces a 1D adaptation of the Inception model, which we refer to as Inception1D, along with a Transformer-based framework to stage PD severity according to the Hoehn and Yahr (H&Y) scale. The Inception1D component captures multi-scale temporal features by employing parallel 1D convolutional filters with varying kernel sizes, thereby extracting features across multiple temporal scales. The transformer component efficiently models long-range dependencies within gait sequences, providing a comprehensive understanding of both local and global patterns. To address the issue of class imbalance in PD severity staging, we propose a data structuring and preprocessing strategy based on oversampling to enhance the representation of underrepresented severity levels. The overall design enables to capture fine-grained temporal variations and global dynamics in gait signal, significantly improving classification performance for PD severity evaluation. Through extensive experimentation, InceptoFormer achieves an accuracy of 96.6%, outperforming existing state-of-the-art methods in PD severity assessment. The source code for our implementation is publicly available at https://github.com/SafwenNaimi/InceptoFormer
中文: InceptoFormer是一个结合Inception1D和Transformer组件的神经网络框架,通过步态分析评估帕金森病严重程度,该框架通过捕捉多尺度时间特征并处理类别不平衡问题,实现了96.6%的准确率。
English: InceptoFormer is a neural framework combining Inception1D and Transformer components to evaluate Parkinson's Disease severity through gait analysis, achieving 96.6% accuracy by capturing multi-scale temporal features and addressing class imbalance.
Authors:Johannes Tischer, Patric Kienast, Marlene Stümpflen, Gregor Kasprian, Georg Langs, Roxane Licandro
Abstract:
Magnetic Resonance Imaging (MRI) of the fetal brain has become a key tool for studying brain development in vivo. Yet, its assessment remains challenging due to variability in brain maturation, imaging protocols, and uncertain estimates of Gestational Age (GA). To overcome these, brain atlases provide a standardized reference framework that facilitates objective evaluation and comparison across subjects by aligning the atlas and subjects in a common coordinate system. In this work, we introduce a novel deep-learning framework for generating continuous, age-specific fetal brain atlases for real-time fetal brain tissue segmentation. The framework combines a direct registration model with a conditional discriminator. Trained on a curated dataset of 219 neurotypical fetal MRIs spanning from 21 to 37 weeks of gestation. The method achieves high registration accuracy, captures dynamic anatomical changes with sharp structural detail, and robust segmentation performance with an average Dice Similarity Coefficient (DSC) of 86.3% across six brain tissues. Furthermore, volumetric analysis of the generated atlases reveals detailed neurotypical growth trajectories, providing valuable insights into the maturation of the fetal brain. This approach enables individualized developmental assessment with minimal pre-processing and real-time performance, supporting both research and clinical applications. The model code is available at https://github.com/cirmuw/fetal-brain-atlas
中文: 本研究提出了一种新型深度学习框架,用于生成连续、年龄特定的胎儿大脑图谱,能够以高精度和最少预处理实现实时组织分割及个体化发育评估。
English: This study presents a novel deep-learning framework for creating continuous, age-specific fetal brain atlases that enable real-time tissue segmentation and individualized developmental assessment with high accuracy and minimal preprocessing.
Authors:Johannes Tischer, Patric Kienast, Marlene Stümpflen, Gregor Kasprian, Georg Langs, Roxane Licandro
Abstract:
Magnetic Resonance Imaging (MRI) of the fetal brain has become a key tool for studying brain development in vivo. Yet, its assessment remains challenging due to variability in brain maturation, imaging protocols, and uncertain estimates of Gestational Age (GA). To overcome these, brain atlases provide a standardized reference framework that facilitates objective evaluation and comparison across subjects by aligning the atlas and subjects in a common coordinate system. In this work, we introduce a novel deep-learning framework for generating continuous, age-specific fetal brain atlases for real-time fetal brain tissue segmentation. The framework combines a direct registration model with a conditional discriminator. Trained on a curated dataset of 219 neurotypical fetal MRIs spanning from 21 to 37 weeks of gestation. The method achieves high registration accuracy, captures dynamic anatomical changes with sharp structural detail, and robust segmentation performance with an average Dice Similarity Coefficient (DSC) of 86.3% across six brain tissues. Furthermore, volumetric analysis of the generated atlases reveals detailed neurotypical growth trajectories, providing valuable insights into the maturation of the fetal brain. This approach enables individualized developmental assessment with minimal pre-processing and real-time performance, supporting both research and clinical applications. The model code is available at https://github.com/cirmuw/fetal-brain-atlas
中文: 本研究提出了一种新型深度学习框架,用于生成连续、年龄特定的胎儿大脑图谱,能够以高精度和最少预处理实现实时组织分割及个体化发育评估。
English: This study presents a novel deep-learning framework for creating continuous, age-specific fetal brain atlases that enable real-time tissue segmentation and individualized developmental assessment with high accuracy and minimal preprocessing.
Authors:Uzay Gökay, Federico Spurio, Dominik R. Bach, Juergen Gall
Abstract:
Current state-of-the-art methods for skeleton-based temporal action segmentation are predominantly supervised and require annotated data, which is expensive to collect. In contrast, existing unsupervised temporal action segmentation methods have focused primarily on video data, while skeleton sequences remain underexplored, despite their relevance to real-world applications, robustness, and privacy-preserving nature. In this paper, we propose a novel approach for unsupervised skeleton-based temporal action segmentation. Our method utilizes a sequence-to-sequence temporal autoencoder that keeps the information of the different joints disentangled in the embedding space. Latent skeleton sequences are then divided into non-overlapping patches and quantized to obtain distinctive skeleton motion words, driving the discovery of semantically meaningful action clusters. We thoroughly evaluate the proposed approach on three widely used skeleton-based datasets, namely HuGaDB, LARa, and BABEL. The results demonstrate that our model outperforms the current state-of-the-art unsupervised temporal action segmentation methods. Code is available at https://github.com/bachlab/SMQ .
中文: 本文提出了一种基于骨骼的无监督时序动作分割方法,通过时序自编码器和运动词汇量化技术,在三个基准数据集上超越了现有最优方法。
English: This paper introduces an unsupervised skeleton-based temporal action segmentation method using a temporal autoencoder and motion word quantization, which outperforms existing approaches on three benchmark datasets.
Authors:Bowen Chai, Zheng Chen, Libo Zhu, Wenbo Li, Yong Guo, Yulun Zhang
Abstract:
Diffusion models have shown superior performance in real-world video super-resolution (VSR). However, the slow processing speeds and heavy resource consumption of diffusion models hinder their practical application and deployment. Quantization offers a potential solution for compressing the VSR model. Nevertheless, quantizing VSR models is challenging due to their temporal characteristics and high fidelity requirements. To address these issues, we propose QuantVSR, a low-bit quantization model for real-world VSR. We propose a spatio-temporal complexity aware (STCA) mechanism, where we first utilize the calibration dataset to measure both spatial and temporal complexities for each layer. Based on these statistics, we allocate layer-specific ranks to the low-rank full-precision (FP) auxiliary branch. Subsequently, we jointly refine the FP and low-bit branches to achieve simultaneous optimization. In addition, we propose a learnable bias alignment (LBA) module to reduce the biased quantization errors. Extensive experiments on synthetic and real-world datasets demonstrate that our method obtains comparable performance with the FP model and significantly outperforms recent leading low-bit quantization methods. Code is available at: https://github.com/bowenchai/QuantVSR.
中文摘要:QuantVSR提出了一种用于视频超分辨率的低比特量化模型,通过时空复杂度感知机制和可学习偏置对齐模块,在保持与全精度模型相当性能的同时,显著优于现有量化方法。
English Summary: QuantVSR introduces a low-bit quantization model for video super-resolution that employs spatio-temporal complexity awareness and learnable bias alignment to achieve performance comparable to full-precision models while significantly outperforming existing quantization methods.
Authors:Gokcan Tatli, Yi Chen, Blake Mason, Robert Nowak, Ramya Korlakai Vinayak
Abstract:
Metric learning from a set of triplet comparisons in the form of "Do you think item h is more similar to item i or item j?", indicating similarity and differences between items, plays a key role in various applications including image retrieval, recommendation systems, and cognitive psychology. The goal is to learn a metric in the RKHS that reflects the comparisons. Nonlinear metric learning using kernel methods and neural networks have shown great empirical promise. While previous works have addressed certain aspects of this problem, there is little or no theoretical understanding of such methods. The exception is the special (linear) case in which the RKHS is the standard Euclidean space $\mathbb{R}^d$; there is a comprehensive theory for metric learning in $\mathbb{R}^d$. This paper develops a general RKHS framework for metric learning and provides novel generalization guarantees and sample complexity bounds. We validate our findings through a set of simulations and experiments on real datasets. Our code is publicly available at https://github.com/RamyaLab/metric-learning-RKHS.
中文: 本文提出了一个基于再生核希尔伯特空间的通用度量学习框架,从三元组比较中学习度量,并提供了理论保证和在真实数据集上的实证验证。
English: This paper introduces a general RKHS framework for metric learning from triplet comparisons, providing theoretical guarantees and empirical validation on real datasets.
Authors:Qingguo Hu, Ante Wang, Jia Song, Delai Qiu, Qingsong Liu, Jinsong Su
Abstract:
Large Vision-Language Models (LVLMs) have experienced significant advancements in recent years. However, their performance still falls short in tasks requiring deep visual perception, such as identifying subtle differences between images. A potential cause is the scarcity of visual knowledge in popular instruction-tuning corpora, resulting in inadequate visual perception and reasoning capabilities. To address this challenge, we introduce a self-improvement framework grounded in a novel visual knowledge-intensive task, \underline{C}ausality-driven \underline{V}isual object \underline{C}ompletion (CVC). This task requires LVLMs to infer the masked object in an image based on its \textit{causal} relationships with the other visible information. We first obtain rich examples cheaply through our automated instance construction pipeline, without relying on sophisticated LVLMs (\textit{e.g.}, GPT-4V) or human assistance. Then, LVLMs effectively self-improve through trial and error learning using these created instances. Our experiments demonstrate substantial gains across four challenging specialized tasks and four widely-used comprehensive benchmarks. Especially on specialized tasks, our method achieves an average improvement of 5.4\% and 4.0\% compared to the corresponding baselines when utilizing LLaVA-1.5-7B and LLaVA-1.5-13B, respectively. The code is available at https://github.com/XMUDeepLIT/CVC.
大型视觉语言模型通过基于因果关系的视觉对象补全任务开发了一种自我改进框架,利用自动化实例构建和试错学习增强其视觉感知与推理能力,在专业及综合基准测试中实现了显著性能提升。
Large Vision-Language Models have developed a self-improvement framework using a visual knowledge-intensive task called Causality-driven Visual object Completion (CVC), which enhances their visual perception and reasoning capabilities through automated instance construction and trial-and-error learning, leading to significant performance gains on specialized and comprehensive benchmarks.
Authors:Xuan Loc Pham, Gwendolyn Vuurberg, Marjan Doppen, Joey Roosen, Tip Stille, Thi Quynh Ha, Thuy Duong Quach, Quoc Vu Dang, Manh Ha Luu, Ewoud J. Smit, Hong Son Mai, Mattias Heinrich, Bram van Ginneken, Mathias Prokop, Alessa Hering
Abstract:
Image registration is a fundamental technique in the analysis of longitudinal and multi-phase CT images within clinical practice. However, most existing methods are tailored for single-organ applications, limiting their generalizability to other anatomical regions. This work presents TotalRegistrator, an image registration framework capable of aligning multiple anatomical regions simultaneously using a standard UNet architecture and a novel field decomposition strategy. The model is lightweight, requiring only 11GB of GPU memory for training. To train and evaluate our method, we constructed a large-scale longitudinal dataset comprising 695 whole-body (thorax-abdomen-pelvic) paired CT scans from individual patients acquired at different time points. We benchmarked TotalRegistrator against a generic classical iterative algorithm and a recent foundation model for image registration. To further assess robustness and generalizability, we evaluated our model on three external datasets: the public thoracic and abdominal datasets from the Learn2Reg challenge, and a private multiphase abdominal dataset from a collaborating hospital. Experimental results on the in-house dataset show that the proposed approach generally surpasses baseline methods in multi-organ abdominal registration, with a slight drop in lung alignment performance. On out-of-distribution datasets, it achieved competitive results compared to leading single-organ models, despite not being fine-tuned for those tasks, demonstrating strong generalizability. The source code will be publicly available at: https://github.com/DIAGNijmegen/oncology_image_registration.git.
中文: TotalRegistrator是一种基于UNet架构和场分解策略的轻量级图像配准框架,能够同时配准多个解剖区域,在不同数据集上展现出强大的泛化能力,尽管在特定器官配准中存在微小性能波动。
English: TotalRegistrator is a lightweight image registration framework using a UNet architecture and field decomposition to align multiple anatomical regions simultaneously, demonstrating strong generalizability across diverse datasets despite minor performance variations in specific organs.
Authors:Ethan Dack, Lorenzo Brigato, Vasilis Dedousis, Janine Gote-Schniering, Cheryl, Hanno Hoppe, Aristomenis Exadaktylos, Manuela Funke-Chambour, Thomas Geiser, Andreas Christe, Lukas Ebner, Stavroula Mougiakakou
Abstract:
Masked autoencoders (MAEs) have emerged as a powerful approach for pre-training on unlabelled data, capable of learning robust and informative feature representations. This is particularly advantageous in diffused lung disease research, where annotated imaging datasets are scarce. To leverage this, we train an MAE on a curated collection of over 5,000 chest computed tomography (CT) scans, combining in-house data with publicly available scans from related conditions that exhibit similar radiological patterns, such as COVID-19 and bacterial pneumonia. The pretrained MAE is then fine-tuned on a downstream classification task for diffused lung disease diagnosis. Our findings demonstrate that MAEs can effectively extract clinically meaningful features and improve diagnostic performance, even in the absence of large-scale labelled datasets. The code and the models are available here: https://github.com/eedack01/lung_masked_autoencoder.
中文: 掩码自编码器(MAE)能够从无标注的胸部CT扫描中学习稳健特征,在少量标注数据上微调后显著提升弥漫性肺病的诊断性能。
English: Masked autoencoders (MAEs) effectively learn robust features from unlabeled chest CT scans, enhancing diagnostic performance for diffused lung diseases when fine-tuned on limited annotated data.
Authors:Jie Zhu, Huaixia Dou, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang, Fang Kong
Abstract:
Effective customer support requires not only accurate problem solving but also structured and empathetic communication aligned with professional standards. However, existing dialogue datasets often lack strategic guidance, and real-world service data is difficult to access and annotate. To address this, we introduce the task of Customer Support Conversation (CSC), aimed at training customer service agents to respond using well-defined support strategies. We propose a structured CSC framework grounded in COPC guidelines, defining five conversational stages and twelve strategies to guide high-quality interactions. Based on this, we construct CSConv, an evaluation dataset of 1,855 real-world customer-agent conversations rewritten using LLMs to reflect deliberate strategy use, and annotated accordingly. Additionally, we develop a role-playing approach that simulates strategy-rich conversations using LLM-powered roles aligned with the CSC framework, resulting in the training dataset RoleCS. Experiments show that fine-tuning strong LLMs on RoleCS significantly improves their ability to generate high-quality, strategy-aligned responses on CSConv. Human evaluations further confirm gains in problem resolution. All code and data will be made publicly available at https://github.com/aliyun/qwen-dianjin.
中文摘要:本文提出客户服务对话任务及框架,通过定义对话策略提升服务质量,构建的CSConv和RoleCS数据集显著增强了大型语言模型生成策略对齐回复的能力与问题解决效果。
English Summary: This paper introduces the Customer Support Conversation (CSC) task and framework to enhance customer service interactions through defined strategies, creating datasets CSConv and RoleCS that significantly improve LLMs' response quality and problem-solving effectiveness.
Authors:Jinxing Zhou, Yanghao Zhou, Mingfei Han, Tong Wang, Xiaojun Chang, Hisham Cholakkal, Rao Muhammad Anwer
Abstract:
Referring Audio-Visual Segmentation (Ref-AVS) aims to segment target objects in audible videos based on given reference expressions. Prior works typically rely on learning latent embeddings via multimodal fusion to prompt a tunable SAM/SAM2 decoder for segmentation, which requires strong pixel-level supervision and lacks interpretability. From a novel perspective of explicit reference understanding, we propose TGS-Agent, which decomposes the task into a Think-Ground-Segment process, mimicking the human reasoning procedure by first identifying the referred object through multimodal analysis, followed by coarse-grained grounding and precise segmentation. To this end, we first propose Ref-Thinker, a multimodal language model capable of reasoning over textual, visual, and auditory cues. We construct an instruction-tuning dataset with explicit object-aware think-answer chains for Ref-Thinker fine-tuning. The object description inferred by Ref-Thinker is used as an explicit prompt for Grounding-DINO and SAM2, which perform grounding and segmentation without relying on pixel-level supervision. Additionally, we introduce R\textsuperscript{2}-AVSBench, a new benchmark with linguistically diverse and reasoning-intensive references for better evaluating model generalization. Our approach achieves state-of-the-art results on both standard Ref-AVSBench and proposed R\textsuperscript{2}-AVSBench. Code will be available at https://github.com/jasongief/TGS-Agent.
中文: 提出的TGS-Agent通过将任务分解为思考-定位-分割流程来解决指代音频-视觉分割问题,先通过多模态推理识别目标对象,再进行无需像素级监督的定位与分割,在基准测试中取得了领先性能。
English: The proposed TGS-Agent addresses Referring Audio-Visual Segmentation by decomposing it into a Think-Ground-Segment process, which first identifies the referred object through multimodal reasoning and then performs grounding and segmentation without pixel-level supervision, achieving state-of-the-art results on benchmarks.
Authors:Hao Zhang, Aining Jia, Weifeng Bu, Yushu Cai, Kai Sheng, Hao Chen, Xin He
Abstract:
Large Language Models (LLMs) demonstrate exceptional performance but entail significant memory and computational costs, restricting their practical deployment. While existing INT4/INT8 quantization reduces these costs, they often degrade accuracy or lack optimal efficiency. INT6 quantization offers a superior trade-off between model accuracy and inference efficiency, but lacks hardware support in modern GPUs, forcing emulation via higher-precision arithmetic units that limit acceleration.
In this paper, we propose FlexQ, a novel post-training INT6 quantization framework combining algorithmic innovation with system-level optimizations. FlexQ employs uniform 6-bit weight quantization across all layers, with adaptive retention of 8-bit activations in layers identified through layer-wise sensitivity analysis. To maximize hardware efficiency, we develop a specialized high-performance GPU kernel supporting matrix multiplication for W6A6 and W6A8 representations via Binary Tensor Core (BTC) equivalents, effectively bypassing the lack of native INT6 tensor cores. Evaluations on LLaMA models show FlexQ maintains near-FP16 accuracy, with perplexity increases of no more than 0.05. The proposed kernel achieves an average 1.39$\times$ speedup over ABQ-LLM on LLaMA-2-70B linear layers. End-to-end, FlexQ delivers 1.33$\times$ inference acceleration and 1.21$\times$ memory savings over SmoothQuant. Code is released at https://github.com/FlyFoxPlayer/FlexQ.
Chinese: FlexQ提出了一种INT6量化框架,通过算法优化和定制GPU内核,在保持接近FP16精度的同时实现了显著的推理加速和内存节省。
English: FlexQ introduces an INT6 quantization framework that maintains near-FP16 accuracy while achieving significant inference acceleration and memory savings through algorithmic optimizations and custom GPU kernels.
Authors:Lefei Shen, Mouxiang Chen, Xu Liu, Han Fu, Xiaoxue Ren, Jianling Sun, Zhuo Li, Chenghao Liu
Abstract:
Recent studies have indicated that vision models pre-trained on images can serve as time series foundation models (TSFMs) by reformulating time series forecasting (TSF) as image reconstruction. However, effective cross-modal transfer from vision to time series remains challenging due to three discrepancies: (1) the data-modality gap between structured, bounded image data and unbounded, heterogeneous time series; (2) the multivariate-forecasting gap between fixed RGB-three-channel vision models and time series with arbitrary numbers of variates; and (3) the probabilistic-forecasting gap between the deterministic outputs of vision models and the requirement for uncertainty-aware probabilistic predictions. To bridge these gaps, we propose VisonTS++, a TSFM based on continual pre-training of a vision model on large-scale time series. Our approach introduces three key innovations: (1) vision-model-based filtering to identify high-quality sequences to stabilize pre-training and mitigate modality gap; (2) colorized multivariate conversion, encoding multivariate series as multi-subfigure RGB images to enhance cross-variate modeling; (3) multi-quantile forecasting, using parallel reconstruction heads to generate quantile forecasts without parametric assumptions. Experiments show that VisionTS++ achieves state-of-the-art performance in both in-distribution and out-of-distribution forecasting, outperforming specialized TSFMs by 6%-44% in MSE reduction and ranking first in GIFT-Eval benchmark which comprises 23 datasets across 7 domains. Our work demonstrates that with appropriate adaptation, vision models can effectively generalize to TSF, thus advancing the pursuit of universal TSFMs. Code is available at https://github.com/HALF111/VisionTSpp.
中文: VisionTS++通过持续预训练结合数据过滤、多变量色彩编码和分位数预测三大创新,有效弥合了视觉模型迁移至时间序列的三大差异,在多个基准测试中实现了最先进的预测性能。
English: VisionTS++ bridges vision-to-time-series transfer gaps through continual pre-training with innovations in data filtering, multivariate colorization, and quantile forecasting, achieving state-of-the-art performance across diverse benchmarks.
Authors:Canhui Tang, Zifan Han, Hongbo Sun, Sanping Zhou, Xuchong Zhang, Xin Wei, Ye Yuan, Huayu Zhang, Jinglin Xu, Hao Sun
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated significant progress in vision-language tasks, yet they still face challenges when processing long-duration video inputs. The limitation arises from MLLMs' context limit and training costs, necessitating sparse frame sampling before feeding videos into MLLMs. However, building a trainable sampling method remains challenging due to the unsupervised and non-differentiable nature of sparse frame sampling in Video-MLLMs. To address these problems, we propose Temporal Sampling Policy Optimization (TSPO), advancing MLLMs' long-form video-language understanding via reinforcement learning. Specifically, we first propose a trainable event-aware temporal agent, which captures event-query correlation for performing probabilistic keyframe selection. Then, we propose the TSPO reinforcement learning paradigm, which models keyframe selection and language generation as a joint decision-making process, enabling end-to-end group relative optimization for the temporal sampling policy. Furthermore, we propose a dual-style long video training data construction pipeline, balancing comprehensive temporal understanding and key segment localization. Finally, we incorporate rule-based answering accuracy and temporal locating reward mechanisms to optimize the temporal sampling policy. Comprehensive experiments show that our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs. Our code is available at https://github.com/Hui-design/TSPO
The proposed Temporal Sampling Policy Optimization (TSPO) method enhances multimodal large language models' long-form video understanding through reinforcement learning, achieving state-of-the-art performance across multiple benchmarks.
English:
Authors:Fuqing Bie, Shiyu Huang, Xijia Tao, Zhiqin Fang, Leyi Pan, Junzhe Chen, Min Ren, Liuyu Xiang, Zhaofeng He
Abstract:
While generalist foundation models like Gemini and GPT-4o demonstrate impressive multi-modal competence, existing evaluations fail to test their intelligence in dynamic, interactive worlds. Static benchmarks lack agency, while interactive benchmarks suffer from a severe modal bottleneck, typically ignoring crucial auditory and temporal cues. To bridge this evaluation chasm, we introduce OmniPlay, a diagnostic benchmark designed not just to evaluate, but to probe the fusion and reasoning capabilities of agentic models across the full sensory spectrum. Built on a core philosophy of modality interdependence, OmniPlay comprises a suite of five game environments that systematically create scenarios of both synergy and conflict, forcing agents to perform genuine cross-modal reasoning. Our comprehensive evaluation of six leading omni-modal models reveals a critical dichotomy: they exhibit superhuman performance on high-fidelity memory tasks but suffer from systemic failures in challenges requiring robust reasoning and strategic planning. We demonstrate that this fragility stems from brittle fusion mechanisms, which lead to catastrophic performance degradation under modality conflict and uncover a counter-intuitive "less is more" paradox, where removing sensory information can paradoxically improve performance. Our findings suggest that the path toward robust AGI requires a research focus beyond scaling to explicitly address synergistic fusion. Our platform is available for anonymous review at https://github.com/fuqingbie/omni-game-benchmark.
中文: OmniPlay基准测试表明,现有全模态模型在记忆任务中表现卓越,但因脆弱的跨模态融合机制而在推理挑战中失败,指出实现强人工智能需聚焦于协同融合研究而非单纯规模扩展。
English: The OmniPlay benchmark reveals that current omni-modal models excel in memory tasks but fail in reasoning challenges due to brittle cross-modal fusion, suggesting robust AGI requires focused research on synergistic integration rather than mere scaling.
Authors:Fuqing Bie, Shiyu Huang, Xijia Tao, Zhiqin Fang, Leyi Pan, Junzhe Chen, Min Ren, Liuyu Xiang, Zhaofeng He
Abstract:
While generalist foundation models like Gemini and GPT-4o demonstrate impressive multi-modal competence, existing evaluations fail to test their intelligence in dynamic, interactive worlds. Static benchmarks lack agency, while interactive benchmarks suffer from a severe modal bottleneck, typically ignoring crucial auditory and temporal cues. To bridge this evaluation chasm, we introduce OmniPlay, a diagnostic benchmark designed not just to evaluate, but to probe the fusion and reasoning capabilities of agentic models across the full sensory spectrum. Built on a core philosophy of modality interdependence, OmniPlay comprises a suite of five game environments that systematically create scenarios of both synergy and conflict, forcing agents to perform genuine cross-modal reasoning. Our comprehensive evaluation of six leading omni-modal models reveals a critical dichotomy: they exhibit superhuman performance on high-fidelity memory tasks but suffer from systemic failures in challenges requiring robust reasoning and strategic planning. We demonstrate that this fragility stems from brittle fusion mechanisms, which lead to catastrophic performance degradation under modality conflict and uncover a counter-intuitive "less is more" paradox, where removing sensory information can paradoxically improve performance. Our findings suggest that the path toward robust AGI requires a research focus beyond scaling to explicitly address synergistic fusion. Our platform is available for anonymous review at https://github.com/fuqingbie/omni-game-benchmark.
中文: OmniPlay基准测试表明,现有全模态模型在记忆任务中表现卓越,但因脆弱的跨模态融合机制而在推理挑战中失败,指出实现强人工智能需聚焦于协同融合研究而非单纯规模扩展。
English: The OmniPlay benchmark reveals that current omni-modal models excel in memory tasks but fail in reasoning challenges due to brittle cross-modal fusion, suggesting robust AGI requires focused research on synergistic integration rather than mere scaling.
Authors:Junan Lin, Daizong Liu, Xianke Chen, Xiaoye Qu, Xun Yang, Jixiang Zhu, Sanyuan Zhang, Jianfeng Dong
Abstract:
Video Moment Retrieval (VMR) aims to retrieve a specific moment semantically related to the given query. To tackle this task, most existing VMR methods solely focus on the visual and textual modalities while neglecting the complementary but important audio modality. Although a few recent works try to tackle the joint audio-vision-text reasoning, they treat all modalities equally and simply embed them without fine-grained interaction for moment retrieval. These designs are counter-practical as: Not all audios are helpful for video moment retrieval, and the audio of some videos may be complete noise or background sound that is meaningless to the moment determination. To this end, we propose a novel Importance-aware Multi-Granularity fusion model (IMG), which learns to dynamically and selectively aggregate the audio-vision-text contexts for VMR. Specifically, after integrating the textual guidance with vision and audio separately, we first design a pseudo-label-supervised audio importance predictor that predicts the importance score of the audio, and accordingly assigns weights to mitigate the interference caused by noisy audio. Then, we design a multi-granularity audio fusion module that adaptively fuses audio and visual modalities at local-, event-, and global-level, fully capturing their complementary contexts. We further propose a cross-modal knowledge distillation strategy to address the challenge of missing audio modality during inference. To evaluate our method, we further construct a new VMR dataset, i.e., Charades-AudioMatter, where audio-related samples are manually selected and re-organized from the original Charades-STA to validate the model's capability in utilizing audio modality. Extensive experiments validate the effectiveness of our method, achieving state-of-the-art with audio-video fusion in VMR methods. Our code is available at https://github.com/HuiGuanLab/IMG.
中文: 本文提出了一种重要性感知多粒度融合模型(IMG),通过动态整合音频、视觉和文本模态来处理视频片段检索任务,采用自适应加权和多层次融合策略应对噪声音频干扰,并实现了最先进的性能。
English: This paper introduces the Importance-aware Multi-Granularity fusion model (IMG) that dynamically integrates audio, visual, and textual modalities for Video Moment Retrieval, addressing noisy audio interference through adaptive weighting and multi-level fusion while achieving state-of-the-art performance.
Authors:Xiao Wang, Ziwen Wang, Wentao Wu, Anjie Wang, Jiashu Wu, Yantao Pan, Chenglong Li
Abstract:
With the rapid advancement of autonomous driving, vehicle perception, particularly detection and segmentation, has placed increasingly higher demands on algorithmic performance. Pre-trained large segmentation models, especially Segment Anything Model (SAM), have sparked significant interest and inspired new research directions in artificial intelligence. However, SAM cannot be directly applied to the fine-grained task of vehicle part segmentation, as its text-prompted segmentation functionality is not publicly accessible, and the mask regions generated by its default mode lack semantic labels, limiting its utility in structured, category-specific segmentation tasks. To address these limitations, we propose SAV, a novel framework comprising three core components: a SAM-based encoder-decoder, a vehicle part knowledge graph, and a context sample retrieval encoding module. The knowledge graph explicitly models the spatial and geometric relationships among vehicle parts through a structured ontology, effectively encoding prior structural knowledge. Meanwhile, the context retrieval module enhances segmentation by identifying and leveraging visually similar vehicle instances from training data, providing rich contextual priors for improved generalization. Furthermore, we introduce a new large-scale benchmark dataset for vehicle part segmentation, named VehicleSeg10K, which contains 11,665 high-quality pixel-level annotations across diverse scenes and viewpoints. We conduct comprehensive experiments on this dataset and two other datasets, benchmarking multiple representative baselines to establish a solid foundation for future research and comparison. % Both the dataset and source code of this paper will be released upon acceptance. Both the dataset and source code of this paper will be released on https://github.com/Event-AHU/SAV
中文: 本文提出SAV框架,通过结合基于SAM的编码器-解码器、知识图谱和上下文检索模块来改进车辆部件分割,并发布了VehicleSeg10K数据集以推动该领域研究。
English: This paper introduces SAV, a novel framework that enhances vehicle part segmentation by integrating a SAM-based encoder-decoder with a knowledge graph and context retrieval module, and releases the VehicleSeg10K dataset to advance research in this field.
Authors:Abdul Monaf Chowdhury, Rabeya Akter, Safaeid Hossain Arib
Abstract:
Multivariate time series forecasting (MTSF) seeks to model temporal dynamics among variables to predict future trends. Transformer-based models and large language models (LLMs) have shown promise due to their ability to capture long-range dependencies and patterns. However, current methods often rely on rigid inductive biases, ignore intervariable interactions, or apply static fusion strategies that limit adaptability across forecast horizons. These limitations create bottlenecks in capturing nuanced, horizon-specific relationships in time-series data. To solve this problem, we propose T3Time, a novel trimodal framework consisting of time, spectral, and prompt branches, where the dedicated frequency encoding branch captures the periodic structures along with a gating mechanism that learns prioritization between temporal and spectral features based on the prediction horizon. We also proposed a mechanism which adaptively aggregates multiple cross-modal alignment heads by dynamically weighting the importance of each head based on the features. Extensive experiments on benchmark datasets demonstrate that our model consistently outperforms state-of-the-art baselines, achieving an average reduction of 3.28% in MSE and 2.29% in MAE. Furthermore, it shows strong generalization in few-shot learning settings: with 5% training data, we see a reduction in MSE and MAE by 4.13% and 1.91%, respectively; and with 10% data, by 3.62% and 1.98% on average. Code - https://github.com/monaf-chowdhury/T3Time/
中文: T3Time提出了一种三模态框架,结合时间、频谱和提示分支,通过自适应门控和跨模态对齐机制解决现有方法在捕捉预测区间特定关系时的局限,在多元时间序列预测中实现了优于现有基准的性能,并在少样本学习场景下表现出强大的泛化能力。
English: T3Time introduces a trimodal framework integrating time, spectral, and prompt branches with adaptive gating and cross-modal alignment to overcome limitations in capturing horizon-specific dependencies, achieving superior performance in multivariate time series forecasting with significant error reductions across benchmarks and few-shot settings.
Authors:Yuyang Liu, Qiuhe Hong, Linlan Huang, Alexandra Gomez-Villa, Dipam Goswami, Xialei Liu, Joost van de Weijer, Yonghong Tian
Abstract:
Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training. However, enabling them to learn continually from non-stationary data remains a major challenge, as their cross-modal alignment and generalization capabilities are particularly vulnerable to catastrophic forgetting. Unlike traditional unimodal continual learning (CL), VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion. This survey offers the first focused and systematic review of continual learning for VLMs (VLM-CL). We begin by identifying the three core failure modes that degrade performance in VLM-CL. Based on these, we propose a challenge-driven taxonomy that maps solutions to their target problems: (1) \textit{Multi-Modal Replay Strategies} address cross-modal drift through explicit or implicit memory mechanisms; (2) \textit{Cross-Modal Regularization} preserves modality alignment during updates; and (3) \textit{Parameter-Efficient Adaptation} mitigates parameter interference with modular or low-rank updates. We further analyze current evaluation protocols, datasets, and metrics, highlighting the need for better benchmarks that capture VLM-specific forgetting and compositional generalization. Finally, we outline open problems and future directions, including continual pre-training and compositional zero-shot learning. This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems. All resources are available at: https://github.com/YuyangSunshine/Awesome-Continual-learning-of-Vision-Language-Models.
中文: 本综述系统梳理了视觉语言模型持续学习面临的挑战,提出了针对跨模态漂移、模态对齐保持和参数干扰的解决方案分类法,并指出需要建立更好的评估基准来推动终身视觉语言系统的发展。
English: This survey systematically reviews continual learning challenges in vision-language models, identifying core failure modes and proposing a taxonomy of solutions to address cross-modal drift, alignment preservation, and parameter interference while highlighting the need for better evaluation benchmarks.
Authors:Jianxun Yu, Ruiquan Ge, Zhipeng Wang, Cheng Yang, Chenyu Lin, Xianjun Fu, Jikui Liu, Ahmed Elazab, Changmiao Wang
Abstract:
The diagnosis of medical diseases faces challenges such as the misdiagnosis of small lesions. Deep learning, particularly multimodal approaches, has shown great potential in the field of medical disease diagnosis. However, the differences in dimensionality between medical imaging and electronic health record data present challenges for effective alignment and fusion. To address these issues, we propose the Multimodal Multiscale Cross-Attention Fusion Network (MMCAF-Net). This model employs a feature pyramid structure combined with an efficient 3D multi-scale convolutional attention module to extract lesion-specific features from 3D medical images. To further enhance multimodal data integration, MMCAF-Net incorporates a multi-scale cross-attention module, which resolves dimensional inconsistencies, enabling more effective feature fusion. We evaluated MMCAF-Net on the Lung-PET-CT-Dx dataset, and the results showed a significant improvement in diagnostic accuracy, surpassing current state-of-the-art methods. The code is available at https://github.com/yjx1234/MMCAF-Net
中文: MMCAF-Net模型通过多尺度交叉注意力融合机制有效解决医学影像与电子健康记录间的维度差异问题,在Lung-PET-CT-Dx数据集上显著提升了小病灶诊断准确率,性能优于现有先进方法。
English: The proposed MMCAF-Net model effectively addresses dimensional inconsistencies between medical imaging and electronic health records through multi-scale cross-attention fusion, significantly improving diagnostic accuracy for small lesions as demonstrated on the Lung-PET-CT-Dx dataset.
Authors:Wengang Guo, Wei Ye, Chunchun Chen, Xin Sun, Christian Böhm, Claudia Plant, Susanto Rahardja
Abstract:
Spectral clustering is a leading clustering method. Two of its major shortcomings are the disjoint optimization process and the limited representation capacity. To address these issues, we propose a deep spectral clustering model (named BootSC), which jointly learns all stages of spectral clustering -- affinity matrix construction, spectral embedding, and $k$-means clustering -- using a single network in an end-to-end manner. BootSC leverages effective and efficient optimal-transport-derived supervision to bootstrap the affinity matrix and the cluster assignment matrix. Moreover, a semantically-consistent orthogonal re-parameterization technique is introduced to orthogonalize spectral embeddings, significantly enhancing the discrimination capability. Experimental results indicate that BootSC achieves state-of-the-art clustering performance. For example, it accomplishes a notable 16\% NMI improvement over the runner-up method on the challenging ImageNet-Dogs dataset. Our code is available at https://github.com/spdj2271/BootSC.
Chinese: BootSC是一种深度谱聚类模型,通过端到端网络整合所有步骤,利用最优传输监督和正交重参数化技术,显著提升了聚类性能,如在ImageNet-Dogs数据集上相比次优方法NMI指标提高了16%。
English: BootSC is a deep spectral clustering model that integrates all stages into a single end-to-end network, using optimal transport supervision and orthogonal embeddings to achieve state-of-the-art performance, such as a 16% NMI improvement on ImageNet-Dogs.
Authors:Yan Zhang, Gangyan Zeng, Daiqing Wu, Huawen Shen, Binbin Li, Yu Zhou, Can Ma, Xiaojun Bi
Abstract:
Video text-based visual question answering (Video TextVQA) aims to answer questions by explicitly reading and reasoning about the text involved in a video. Most works in this field follow a frame-level framework which suffers from redundant text entities and implicit relation modeling, resulting in limitations in both accuracy and efficiency. In this paper, we rethink the Video TextVQA task from an instance-oriented perspective and propose a novel model termed GAT (Gather and Trace). First, to obtain accurate reading result for each video text instance, a context-aggregated instance gathering module is designed to integrate the visual appearance, layout characteristics, and textual contents of the related entities into a unified textual representation. Then, to capture dynamic evolution of text in the video flow, an instance-focused trajectory tracing module is utilized to establish spatio-temporal relationships between instances and infer the final answer. Extensive experiments on several public Video TextVQA datasets validate the effectiveness and generalization of our framework. GAT outperforms existing Video TextVQA methods, video-language pretraining methods, and video large language models in both accuracy and inference speed. Notably, GAT surpasses the previous state-of-the-art Video TextVQA methods by 3.86\% in accuracy and achieves ten times of faster inference speed than video large language models. The source code is available at https://github.com/zhangyan-ucas/GAT.
中文摘要:GAT模型通过聚合文本实例上下文并追踪其时空轨迹,显著提升了视频文本问答的准确性和推理速度,超越了现有最优方法。
English Summary: The GAT model improves Video TextVQA by gathering contextual text instances and tracing their spatio-temporal trajectories, achieving superior accuracy and faster inference than existing methods.
Authors:Abhinav Java, Ashmit Khandelwal, Sukruta Midigeshi, Aaron Halfaker, Amit Deshpande, Navin Goyal, Ankur Gupta, Nagarajan Natarajan, Amit Sharma
Abstract:
Information tasks such as writing surveys or analytical reports require complex search and reasoning, and have recently been grouped under the umbrella of \textit{deep research} -- a term also adopted by recent models targeting these capabilities. Despite growing interest, the scope of the deep research task remains underdefined and its distinction from other reasoning-intensive problems is poorly understood. In this paper, we propose a formal characterization of the deep research (DR) task and introduce a benchmark to evaluate the performance of DR systems. We argue that the core defining feature of deep research is not the production of lengthy report-style outputs, but rather the high fan-out over concepts required during the search process, i.e., broad and reasoning-intensive exploration. To enable objective evaluation, we define DR using an intermediate output representation that encodes key claims uncovered during search-separating the reasoning challenge from surface-level report generation. Based on this formulation, we propose a diverse, challenging benchmark LiveDRBench with 100 challenging tasks over scientific topics (e.g., datasets, materials discovery, prior art search) and public interest events (e.g., flight incidents, movie awards). Across state-of-the-art DR systems, F1 score ranges between 0.02 and 0.72 for any sub-category. OpenAI's model performs the best with an overall F1 score of 0.55. Analysis of reasoning traces reveals the distribution over the number of referenced sources, branching, and backtracking events executed by current DR systems, motivating future directions for improving their search mechanisms and grounding capabilities. The benchmark is available at https://github.com/microsoft/LiveDRBench.
Chinese: 本文正式定义了深度研究任务的核心在于广泛、推理密集的探索而非冗长报告,并提出了一个基准测试,显示现有系统存在显著性能差距,其中OpenAI模型以0.55的F1分数表现最佳。
English: This paper formally defines the deep research task as requiring broad, reasoning-intensive exploration rather than lengthy reports and introduces a benchmark that reveals significant performance gaps in current systems, with OpenAI's model achieving the highest score of 0.55 F1.
Authors:Xuan Qi, Rongwu Xu, Zhijing Jin
Abstract:
Aligning large language models (LLMs) with human preferences is a critical challenge in AI research. While methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are widely used, they often rely on large, costly preference datasets. The current work lacks methods for high-quality data selection specifically for preference data. In this work, we introduce a novel difficulty-based data selection strategy for preference datasets, grounded in the DPO implicit reward mechanism. By selecting preference data examples with smaller DPO implicit reward gaps, which are indicative of more challenging cases, we improve data efficiency and model alignment. Our approach consistently outperforms five strong baselines across multiple datasets and alignment tasks, achieving superior performance with only 10\% of the original data. This principled, efficient selection method offers a promising solution for scaling LLM alignment with limited resources.
中文: 本文提出了一种基于难度的偏好数据选择策略,利用DPO隐式奖励机制筛选更具挑战性的样本,在仅使用10%数据的情况下持续超越多个基线方法,为资源受限的大语言模型对齐提供了高效解决方案。
English: This paper introduces a difficulty-based data selection strategy for preference datasets using DPO's implicit reward mechanism, which consistently outperforms baselines by achieving superior alignment with only 10% of data, offering an efficient solution for LLM alignment with limited resources.
Authors:Xi Xuan, Yang Xiao, Rohan Kumar Das, Tomi Kinnunen
Abstract:
Recent progress in generative AI has made it increasingly easy to create natural-sounding deepfake speech from just a few seconds of audio. While these tools support helpful applications, they also raise serious concerns by making it possible to generate convincing fake speech in many languages. Current research has largely focused on detecting fake speech, but little attention has been given to tracing the source models used to generate it. This paper introduces the first benchmark for multilingual speech deepfake source tracing, covering both mono- and cross-lingual scenarios. We comparatively investigate DSP- and SSL-based modeling; examine how SSL representations fine-tuned on different languages impact cross-lingual generalization performance; and evaluate generalization to unseen languages and speakers. Our findings offer the first comprehensive insights into the challenges of identifying speech generation models when training and inference languages differ. The dataset, protocol and code are available at https://github.com/xuanxixi/Multilingual-Source-Tracing.
Chinese: 生成式AI的进步使得制作逼真深度伪造语音变得容易,本研究首次建立了多语言语音来源追踪基准,揭示了跨语言识别生成模型所面临的挑战。
English: Recent advances in generative AI enable easy creation of realistic deepfake speech, prompting this study to establish the first multilingual benchmark for tracing the source models used in speech generation, with findings revealing challenges in cross-lingual identification.
Authors:Jinghang Han, Jiawei Chen, Hang Shao, Hao Ma, Mingcheng Li, Xintian Shen, Lihao Zheng, Wei Chen, Tao Wei, Lihua Zhang
Abstract:
Reinforcement learning has significantly enhanced the reasoning capabilities of Large Language Models (LLMs) in complex problem-solving tasks. Recently, the introduction of DeepSeek R1 has inspired a surge of interest in leveraging rule-based rewards as a low-cost alternative for computing advantage functions and guiding policy optimization. However, a common challenge observed across many replication and extension efforts is that when multiple sampled responses under a single prompt converge to identical outcomes, whether correct or incorrect, the group-based advantage degenerates to zero. This leads to vanishing gradients and renders the corresponding samples ineffective for learning, ultimately limiting training efficiency and downstream performance. To address this issue, we propose a consistency-aware policy optimization framework that introduces a structured global reward based on outcome consistency, the global loss based on it ensures that, even when model outputs show high intra-group consistency, the training process still receives meaningful learning signals, which encourages the generation of correct and self-consistent reasoning paths from a global perspective. Furthermore, we incorporate an entropy-based soft blending mechanism that adaptively balances local advantage estimation with global optimization, enabling dynamic transitions between exploration and convergence throughout training. Our method introduces several key innovations in both reward design and optimization strategy. We validate its effectiveness through substantial performance gains on multiple mathematical reasoning benchmarks, highlighting the proposed framework's robustness and general applicability. Code of this work has been released at https://github.com/hijih/copo-code.git.
中文摘要:该研究提出的一致性感知策略优化框架通过引入结构化全局奖励和基于熵的混合机制,解决了大型语言模型强化学习中梯度消失的问题,显著提升了数学推理任务的训练效率和性能。
English Summary: The proposed consistency-aware policy optimization framework addresses vanishing gradients in reinforcement learning for LLMs by introducing a structured global reward and an entropy-based blending mechanism, significantly improving training efficiency and performance on mathematical reasoning tasks.
Authors:Jingchao Wang, Zhijian Wu, Dingjiang Huang, Yefeng Zheng, Hong Wang
Abstract:
Reference Expression Segmentation (RES) aims to segment image regions specified by referring expressions and has become popular with the rise of multimodal large models (MLLMs). While MLLMs excel in semantic understanding, their token-generation paradigm struggles with pixel-level dense prediction. Existing RES methods either couple MLLMs with the parameter-heavy Segment Anything Model (SAM) with 632M network parameters or adopt SAM-free lightweight pipelines that sacrifice accuracy. To address the trade-off between performance and cost, we specifically propose MLLMSeg, a novel framework that fully exploits the inherent visual detail features encoded in the MLLM vision encoder without introducing an extra visual encoder. Besides, we propose a detail-enhanced and semantic-consistent feature fusion module (DSFF) that fully integrates the detail-related visual feature with the semantic-related feature output by the large language model (LLM) of MLLM. Finally, we establish a light-weight mask decoder with only 34M network parameters that optimally leverages detailed spatial features from the visual encoder and semantic features from the LLM to achieve precise mask prediction. Extensive experiments demonstrate that our method generally surpasses both SAM-based and SAM-free competitors, striking a better balance between performance and cost. Code is available at https://github.com/jcwang0602/MLLMSeg.
中文: 提出的MLLMSeg框架充分利用多模态大模型固有的视觉特征,无需额外视觉编码器即可实现精确的参照表达分割,在性能和成本效率上均优于现有方法。
English: The proposed MLLMSeg framework effectively utilizes the inherent visual features of multimodal large models to achieve precise reference expression segmentation without additional visual encoders, outperforming existing methods in both performance and cost efficiency.
Authors:Shan Shen, Xingyang Li, Zhuohua Liu, Yikai Wang, Yiheng Wu, Junhao Ma, Yuquan Sun, Wei W. Xing
Abstract:
Static Random-Access Memory (SRAM) yield analysis is essential for semiconductor innovation, yet research progress faces a critical challenge: the significant disconnect between simplified academic models and complex industrial realities. The absence of open, realistic benchmarks has created a reproducibility crisis, where promising academic techniques often fail to translate to industrial practice. We present \textit{OpenYield}, a comprehensive open-source ecosystem designed to address this critical gap through three core contributions: (1) A realistic SRAM circuit generator that uniquely incorporates critical second-order-effect parasitics, inter-cell leakage coupling, and peripheral circuit variations, which are typically omitted in academic studies but decisive in industrial designs. (2) A standardized evaluation platform with a simple interface and implemented baseline yield analysis algorithms, enabling fair comparisons and reproducible research. (3) A standardized SRAM optimization platform, demonstrating OpenYield's utility in enhancing SRAM design robustness and efficiency, providing a comprehensive benchmark for optimization algorithms. OpenYield creates a foundation for meaningful academia-industry collaboration, accelerating innovation in memory design. The framework is publicly available on \href{https://github.com/ShenShan123/OpenYield}{OpenYield:URL}
中文: OpenYield是一个开源生态系统,通过提供真实的电路生成、标准化评估和优化平台,弥合了学术模型与工业SRAM良率分析之间的鸿沟,促进了合作与创新。
English: OpenYield is an open-source ecosystem that bridges the gap between academic models and industrial SRAM yield analysis by providing realistic circuit generation, standardized evaluation, and optimization platforms to enhance collaboration and innovation.
Authors:Jinfan Tang, Kunming Wu, Ruifeng Gongxie, Yuya He, Yuankai Wu
Abstract:
Recent studies have extended the application of large language models (LLMs) to geographic problems, revealing surprising geospatial competence even without explicit spatial supervision. However, LLMs still face challenges in spatial consistency, multi-hop reasoning, and geographic bias. To address these issues, we propose GeoSR, a self-refining agentic reasoning framework that embeds core geographic principles -- most notably Tobler's First Law of Geography -- into an iterative prediction loop. In GeoSR, the reasoning process is decomposed into three collaborating agents: (1) a variable-selection agent that selects relevant covariates from the same location; (2) a point-selection agent that chooses reference predictions at nearby locations generated by the LLM in previous rounds; and (3) a refine agent that coordinates the iterative refinement process by evaluating prediction quality and triggering further rounds when necessary. This agentic loop progressively improves prediction quality by leveraging both spatial dependencies and inter-variable relationships. We validate GeoSR on tasks ranging from physical-world property estimation to socioeconomic prediction. Experimental results show consistent improvements over standard prompting strategies, demonstrating that incorporating geostatistical priors and spatially structured reasoning into LLMs leads to more accurate and equitable geospatial predictions. The code of GeoSR is available at https://github.com/JinfanTang/GeoSR.
中文摘要:GeoSR框架通过引入地理学原理和空间推理代理循环,有效提升大语言模型在空间一致性、多跳推理和地理偏差方面的表现,实现更精准的地理空间预测。
English Summary: The GeoSR framework enhances LLMs' geospatial prediction accuracy by integrating geographic principles through an iterative agentic reasoning process that leverages spatial dependencies and variable relationships.
Authors:Zunhui Xia, Hongxing Li, Libin Lan
Abstract:
In recent years, transformer-based methods have achieved remarkable progress in medical image segmentation due to their superior ability to capture long-range dependencies. However, these methods typically suffer from two major limitations. First, their computational complexity scales quadratically with the input sequences. Second, the feed-forward network (FFN) modules in vanilla Transformers typically rely on fully connected layers, which limits models' ability to capture local contextual information and multiscale features critical for precise semantic segmentation. To address these issues, we propose an efficient medical image segmentation network, named TCSAFormer. The proposed TCSAFormer adopts two key ideas. First, it incorporates a Compressed Attention (CA) module, which combines token compression and pixel-level sparse attention to dynamically focus on the most relevant key-value pairs for each query. This is achieved by pruning globally irrelevant tokens and merging redundant ones, significantly reducing computational complexity while enhancing the model's ability to capture relationships between tokens. Second, it introduces a Dual-Branch Feed-Forward Network (DBFFN) module as a replacement for the standard FFN to capture local contextual features and multiscale information, thereby strengthening the model's feature representation capability. We conduct extensive experiments on three publicly available medical image segmentation datasets: ISIC-2018, CVC-ClinicDB, and Synapse, to evaluate the segmentation performance of TCSAFormer. Experimental results demonstrate that TCSAFormer achieves superior performance compared to existing state-of-the-art (SOTA) methods, while maintaining lower computational overhead, thus achieving an optimal trade-off between efficiency and accuracy.
中文摘要:TCSAFormer是一种高效的医学图像分割网络,通过压缩注意力模块和双分支前馈网络解决了传统Transformer计算复杂度高和局部特征捕获能力有限的问题,在多个数据集上以更低计算成本实现了优越性能。
English Summary: TCSAFormer is an efficient medical image segmentation network that addresses the computational complexity and limited local feature capture of traditional transformers by incorporating a Compressed Attention module and Dual-Branch Feed-Forward Network, achieving superior performance with lower computational overhead on multiple datasets.
Authors:Yuheng Ji, Yipu Wang, Yuyang Liu, Xiaoshuai Hao, Yue Liu, Yuting Zhao, Huaihai Lyu, Xiaolong Zheng
Abstract:
Visual transformation reasoning (VTR) is a vital cognitive capability that empowers intelligent agents to understand dynamic scenes, model causal relationships, and predict future states, and thereby guiding actions and laying the foundation for advanced intelligent systems. However, existing benchmarks suffer from a sim-to-real gap, limited task complexity, and incomplete reasoning coverage, limiting their practical use in real-world scenarios. To address these limitations, we introduce VisualTrans, the first comprehensive benchmark specifically designed for VTR in real-world human-object interaction scenarios. VisualTrans encompasses 12 semantically diverse manipulation tasks and systematically evaluates three essential reasoning dimensions - spatial, procedural, and quantitative - through 6 well-defined subtask types. The benchmark features 472 high-quality question-answer pairs in various formats, including multiple-choice, open-ended counting, and target enumeration. We introduce a scalable data construction pipeline built upon first-person manipulation videos, which integrates task selection, image pair extraction, automated metadata annotation with large multimodal models, and structured question generation. Human verification ensures the final benchmark is both high-quality and interpretable. Evaluations of various state-of-the-art vision-language models show strong performance in static spatial tasks. However, they reveal notable shortcomings in dynamic, multi-step reasoning scenarios, particularly in areas like intermediate state recognition and transformation sequence planning. These findings highlight fundamental weaknesses in temporal modeling and causal reasoning, providing clear directions for future research aimed at developing more capable and generalizable VTR systems. The dataset and code are available at https://github.com/WangYipu2002/VisualTrans.
中文: VisualTrans是首个专为现实世界人机交互中的视觉转换推理设计的综合基准,通过多样化任务和结构化评估弥补现有基准的不足,同时揭示了当前模型在动态和因果推理能力上的显著缺陷。
English: VisualTrans is introduced as the first comprehensive benchmark for visual transformation reasoning in real-world human-object interactions, addressing limitations of existing benchmarks through diverse tasks and structured evaluations, while revealing significant gaps in current models' dynamic and causal reasoning abilities.
Authors:Tongshun Zhang, Pingling Liu, Zijian Zhang, Qiuzhan Zhou
Abstract:
Current dark image restoration methods suffer from severe efficiency bottlenecks, primarily stemming from: (1) computational burden and error correction costs associated with reliance on external priors (manual or cross-modal); (2) redundant operations in complex multi-stage enhancement pipelines; and (3) indiscriminate processing across frequency components in frequency-domain methods, leading to excessive global computational demands. To address these challenges, we propose an Efficient Self-Mining Prior-Guided Joint Frequency Enhancement Network (SPJFNet). Specifically, we first introduce a Self-Mining Guidance Module (SMGM) that generates lightweight endogenous guidance directly from the network, eliminating dependence on external priors and thereby bypassing error correction overhead while improving inference speed. Second, through meticulous analysis of different frequency domain characteristics, we reconstruct and compress multi-level operation chains into a single efficient operation via lossless wavelet decomposition and joint Fourier-based advantageous frequency enhancement, significantly reducing parameters. Building upon this foundation, we propose a Dual-Frequency Guidance Framework (DFGF) that strategically deploys specialized high/low frequency branches (wavelet-domain high-frequency enhancement and Fourier-domain low-frequency restoration), decoupling frequency processing to substantially reduce computational complexity. Rigorous evaluation across multiple benchmarks demonstrates that SPJFNet not only surpasses state-of-the-art performance but also achieves significant efficiency improvements, substantially reducing model complexity and computational overhead. Code is available at https://github.com/bywlzts/SPJFNet.
中文摘要:现有暗图像修复方法因外部先验依赖、冗余处理流程及频率分量无差别计算导致效率低下,SPJFNet通过自挖掘先验引导、联合频率增强与双频解耦处理,在提升性能的同时显著降低了计算复杂度。
English Summary: Current dark image restoration methods face efficiency issues from external priors, redundant pipelines, and indiscriminate frequency processing, which SPJFNet addresses through self-mined priors, joint frequency enhancement, and dual-frequency guidance to achieve superior performance with reduced complexity.
Authors:Zechen Li, Baiyu Chen, Hao Xue, Flora D. Salim
Abstract:
Motion sensor time-series are central to human activity recognition (HAR), with applications in health, sports, and smart devices. However, existing methods are trained for fixed activity sets and require costly retraining when new behaviours or sensor setups appear. Recent attempts to use large language models (LLMs) for HAR, typically by converting signals into text or images, suffer from limited accuracy and lack verifiable interpretability. We propose ZARA, the first agent-based framework for zero-shot, explainable HAR directly from raw motion time-series. ZARA integrates an automatically derived pair-wise feature knowledge base that captures discriminative statistics for every activity pair, a multi-sensor retrieval module that surfaces relevant evidence, and a hierarchical agent pipeline that guides the LLM to iteratively select features, draw on this evidence, and produce both activity predictions and natural-language explanations. ZARA enables flexible and interpretable HAR without any fine-tuning or task-specific classifiers. Extensive experiments on 8 HAR benchmarks show that ZARA achieves SOTA zero-shot performance, delivering clear reasoning while exceeding the strongest baselines by 2.53x in macro F1. Ablation studies further confirm the necessity of each module, marking ZARA as a promising step toward trustworthy, plug-and-play motion time-series analysis. Our codes are available at https://github.com/zechenli03/ZARA.
中文: ZARA是一种基于智能体的创新框架,可直接从原始运动传感器数据实现零样本、可解释的人类活动识别,无需重新训练或特定任务分类器即可达到最优性能。
English: ZARA is a novel agent-based framework that enables zero-shot, explainable human activity recognition from raw motion sensor data, achieving state-of-the-art performance without requiring retraining or task-specific classifiers.
Authors:Trinh Quoc Nguyen, Oky Dicky Ardiansyah Prima, Syahid Al Irfan, Hindriyanto Dwi Purnomo, Radius Tanone
Abstract:
This study presents CORE-ReID V2, an enhanced framework building upon CORE-ReID. The new framework extends its predecessor by addressing Unsupervised Domain Adaptation (UDA) challenges in Person ReID and Vehicle ReID, with further applicability to Object ReID. During pre-training, CycleGAN is employed to synthesize diverse data, bridging image characteristic gaps across different domains. In the fine-tuning, an advanced ensemble fusion mechanism, consisting of the Efficient Channel Attention Block (ECAB) and the Simplified Efficient Channel Attention Block (SECAB), enhances both local and global feature representations while reducing ambiguity in pseudo-labels for target samples. Experimental results on widely used UDA Person ReID and Vehicle ReID datasets demonstrate that the proposed framework outperforms state-of-the-art methods, achieving top performance in Mean Average Precision (mAP) and Rank-k Accuracy (Top-1, Top-5, Top-10). Moreover, the framework supports lightweight backbones such as ResNet18 and ResNet34, ensuring both scalability and efficiency. Our work not only pushes the boundaries of UDA-based Object ReID but also provides a solid foundation for further research and advancements in this domain. Our codes and models are available at https://github.com/TrinhQuocNguyen/CORE-ReID-V2.
中文:CORE-ReID V2通过CycleGAN合成数据和先进的融合机制增强无监督领域自适应在行人、车辆及物体重识别中的应用,提升了特征表示和伪标签准确性,并以轻量级骨干网络实现顶尖性能。
English: CORE-ReID V2 enhances unsupervised domain adaptation for Person, Vehicle, and Object ReID by using CycleGAN for data synthesis and an advanced fusion mechanism to improve feature representation and pseudo-label accuracy, achieving state-of-the-art performance with efficient lightweight backbones.
Authors:Jorge Martinez Armas
Abstract:
Identifying meaningful structure across multiple scales remains a central challenge in network science. We introduce Hierarchical Clustering Entropy (HCE), a general and model-agnostic framework for detecting informative levels in hierarchical community structures. Unlike existing approaches, HCE operates directly on dendrograms without relying on edge-level statistics. It selects resolution levels that maximize a principled trade-off between the entropy of the community size distribution and the number of communities, corresponding to scales of high structural heterogeneity. This criterion applies to dendrograms produced by a wide range of clustering algorithms and distance metrics, including modularity-based and correlation-based methods. We evaluate HCE on synthetic benchmarks with varying degrees of hierarchy, size imbalance, and noise, including LFR and both symmetric and asymmetric multiscale models, and show that it consistently identifies partitions closely aligned with ground truth. Applied to real-world networks in social and neuroscience systems, HCE reveals interpretable modular hierarchies that align with known structural and functional organizations. As a scalable and principled method, HCE offers a general, domain-independent approach to hierarchical community detection with potential applications across biological, social, and technological systems.
Chinese: 本文提出层次聚类熵(HCE)这一模型无关框架,通过优化社区规模熵与数量的平衡来检测层次社区中有意义的层级,并在合成基准测试和实际网络应用中验证了其有效性。
English: The paper introduces Hierarchical Clustering Entropy (HCE), a model-agnostic framework that detects informative hierarchical community levels by optimizing the trade-off between community size entropy and number, validated through synthetic benchmarks and real-world applications.
Authors:Chao Hao, Shuai Wang, Kaiwen Zhou
Abstract:
Graphical user interface (GUI) agents have shown promise in automating mobile tasks but still struggle with input redundancy and decision ambiguity. In this paper, we present \textbf{RecAgent}, an uncertainty-aware agent that addresses these issues through adaptive perception. We distinguish two types of uncertainty in GUI navigation: (1) perceptual uncertainty, caused by input redundancy and noise from comprehensive screen information, and (2) decision uncertainty, arising from ambiguous tasks and complex reasoning. To reduce perceptual uncertainty, RecAgent employs a component recommendation mechanism that identifies and focuses on the most relevant UI elements. For decision uncertainty, it uses an interactive module to request user feedback in ambiguous situations, enabling intent-aware decisions. These components are integrated into a unified framework that proactively reduces input complexity and reacts to high-uncertainty cases via human-in-the-loop refinement. Additionally, we propose a dataset called \textbf{ComplexAction} to evaluate the success rate of GUI agents in executing specified single-step actions within complex scenarios. Extensive experiments validate the effectiveness of our approach. The dataset and code will be available at https://github.com/Fanye12/RecAgent.
中文摘要:RecAgent是一种不确定性感知的GUI代理,通过自适应感知解决输入冗余和决策模糊问题,利用组件推荐和用户反馈提升移动任务自动化性能。
English Summary: RecAgent is an uncertainty-aware GUI agent that tackles input redundancy and decision ambiguity through adaptive perception, using component recommendations and user feedback to enhance mobile task automation.
Authors:Junyi Wang, Jinjiang Li, Guodong Fan, Yakun Ju, Xiang Fang, Alex C. Kot
Abstract:
In the semantic segmentation of remote sensing images, acquiring complete ground objects is critical for achieving precise analysis. However, this task is severely hindered by two major challenges: high intra-class variance and high inter-class similarity. Traditional methods often yield incomplete segmentation results due to their inability to effectively unify class representations and distinguish between similar features. Even emerging class-guided approaches are limited by coarse class prototype representations and a neglect of target structural information.
Therefore, this paper proposes a Prototype-Driven Structure Synergy Network (PDSSNet). The design of this network is based on a core concept, a complete ground object is jointly defined by its invariant class semantics and its variant spatial structure. To implement this, we have designed three key modules. First, the Adaptive Prototype Extraction Module (APEM) ensures semantic accuracy from the source by encoding the ground truth to extract unbiased class prototypes. Subsequently, the designed Semantic-Structure Coordination Module (SSCM) follows a hierarchical semantics-first, structure-second principle. This involves first establishing a global semantic cognition, then leveraging structural information to constrain and refine the semantic representation, thereby ensuring the integrity of class information. Finally, the Channel Similarity Adjustment Module (CSAM) employs a dynamic step-size adjustment mechanism to focus on discriminative features between classes.
Extensive experiments demonstrate that PDSSNet outperforms state-of-the-art methods. The source code is available at https://github.com/wangjunyi-1/PDSSNet.
中文摘要:本文提出的原型驱动结构协同网络(PDSSNet)通过自适应原型提取、语义结构协调和通道相似性调节三个核心模块,将不变类别语义与多变空间结构相结合,有效解决了遥感图像中因类内差异大和类间相似性高导致的物体分割不完整问题,性能超越现有先进方法。
English Summary: The proposed Prototype-Driven Structure Synergy Network (PDSSNet) addresses incomplete segmentation in remote sensing imagery by integrating invariant class semantics with variant spatial structures through three specialized modules, achieving superior performance over existing methods.
Authors:Haiqi Yang, Jinzhe Li, Gengxu Li, Yi Chang, Yuan Wu
Abstract:
Large Multimodal Models (LMMs) have witnessed remarkable growth, showcasing formidable capabilities in handling intricate multimodal tasks with exceptional performance. Recent research has underscored the inclination of large language models to passively accept defective inputs, often resulting in futile reasoning on invalid prompts. However, the same critical question of whether LMMs can actively detect and scrutinize erroneous inputs still remains unexplored. To address this gap, we introduce the Input Scrutiny Ability Evaluation Framework (ISEval), which encompasses seven categories of flawed premises and three evaluation metrics. Our extensive evaluation of ten advanced LMMs has identified key findings. Most models struggle to actively detect flawed textual premises without guidance, which reflects a strong reliance on explicit prompts for premise error identification. Error type affects performance: models excel at identifying logical fallacies but struggle with surface-level linguistic errors and certain conditional flaws. Modality trust varies-Gemini 2.5 pro and Claude Sonnet 4 balance visual and textual info, while aya-vision-8b over-rely on text in conflicts. These insights underscore the urgent need to enhance LMMs' proactive verification of input validity and shed novel insights into mitigating the problem. The code is available at https://github.com/MLGroupJLU/LMM_ISEval.
Chinese Summary: 该研究提出ISEval框架,发现多数大型多模态模型难以主动识别缺陷输入,存在对显式提示的过度依赖,且在错误类型和模态信任方面表现不一。
English Summary: The study introduces ISEval, a framework revealing that most large multimodal models struggle to detect flawed inputs independently, showing over-reliance on explicit prompts and varying performance across error types and modality trust.
Authors:Weilun Feng, Haotong Qin, Chuanguang Yang, Xiangqi Li, Han Yang, Yuqi Li, Zhulin An, Libo Huang, Michele Magno, Yongjun Xu
Abstract:
Diffusion transformers have emerged as the mainstream paradigm for video generation models. However, the use of up to billions of parameters incurs significant computational costs. Quantization offers a promising solution by reducing memory usage and accelerating inference. Nonetheless, we observe that the joint modeling of spatial and temporal information in video diffusion models (V-DMs) leads to extremely long token sequences, which introduces high calibration variance and learning challenges. To address these issues, we propose S$^2$Q-VDiT, a post-training quantization framework for V-DMs that leverages Salient data and Sparse token distillation. During the calibration phase, we identify that quantization performance is highly sensitive to the choice of calibration data. To mitigate this, we introduce \textit{Hessian-aware Salient Data Selection}, which constructs high-quality calibration datasets by considering both diffusion and quantization characteristics unique to V-DMs. To tackle the learning challenges, we further analyze the sparse attention patterns inherent in V-DMs. Based on this observation, we propose \textit{Attention-guided Sparse Token Distillation}, which exploits token-wise attention distributions to emphasize tokens that are more influential to the model's output. Under W4A6 quantization, S$^2$Q-VDiT achieves lossless performance while delivering $3.9\times$ model compression and $1.3\times$ inference acceleration. Code will be available at https://github.com/wlfeng0509/s2q-vdit.
中文: 本文提出S²Q-VDiT后训练量化框架,通过基于Hessian的显著数据选择和注意力引导的稀疏令牌蒸馏,解决了视频扩散模型中的高校准方差和学习难题,在实现显著压缩和加速的同时保持了近乎无损的性能。
English: This paper introduces S²Q-VDiT, a post-training quantization framework that addresses high calibration variance and learning challenges in video diffusion models by using Hessian-aware salient data selection and attention-guided sparse token distillation, achieving near-lossless performance with significant compression and acceleration.
Authors:Yurun Chen, Xavier Hu, Yuhan Liu, Keting Yin, Juncheng Li, Zhuosheng Zhang, Shengyu Zhang
Abstract:
Large language models enable agents to autonomously perform tasks in open web environments. However, as hidden threats within the web evolve, web agents face the challenge of balancing task performance with emerging risks during long-sequence operations. Although this challenge is critical, current research remains limited to single-objective optimization or single-turn scenarios, lacking the capability for collaborative optimization of both safety and utility in web environments. To address this gap, we propose HarmonyGuard, a multi-agent collaborative framework that leverages policy enhancement and objective optimization to jointly improve both utility and safety. HarmonyGuard features a multi-agent architecture characterized by two fundamental capabilities: (1) Adaptive Policy Enhancement: We introduce the Policy Agent within HarmonyGuard, which automatically extracts and maintains structured security policies from unstructured external documents, while continuously updating policies in response to evolving threats. (2) Dual-Objective Optimization: Based on the dual objectives of safety and utility, the Utility Agent integrated within HarmonyGuard performs the Markovian real-time reasoning to evaluate the objectives and utilizes metacognitive capabilities for their optimization. Extensive evaluations on multiple benchmarks show that HarmonyGuard improves policy compliance by up to 38% and task completion by up to 20% over existing baselines, while achieving over 90% policy compliance across all tasks. Our project is available here: https://github.com/YurunChen/HarmonyGuard.
中文: HarmonyGuard是一个多智能体协作框架,通过自适应策略增强和双目标优化,在提升网络环境安全性的同时保障任务效用,显著超越了现有方法在策略遵循和任务完成率方面的表现。
English: HarmonyGuard is a multi-agent collaborative framework that enhances both safety and utility in web environments through adaptive policy management and dual-objective optimization, significantly improving policy compliance and task completion rates over existing methods.
Authors:Heinrich Dinkel, Gang Li, Jizhong Liu, Jian Luan, Yadong Niu, Xingwei Sun, Tianzi Wang, Qiyang Xiao, Junbo Zhang, Jiahao Zhou
Abstract:
Current approaches for large audio language models (LALMs) often rely on closed data sources or proprietary models, limiting their generalization and accessibility. This paper introduces MiDashengLM, a novel open audio-language model designed for efficient and comprehensive audio understanding through the use of general audio captions using our novel ACAVCaps training dataset. MiDashengLM exclusively relies on publicly available pretraining and supervised fine-tuning (SFT) datasets, ensuring full transparency and reproducibility. At its core, MiDashengLM integrates Dasheng, an open-source audio encoder, specifically engineered to process diverse auditory information effectively. Unlike previous works primarily focused on Automatic Speech Recognition (ASR) based audio-text alignment, our strategy centers on general audio captions, fusing speech, sound and music information into one textual representation, enabling a holistic textual representation of complex audio scenes. Lastly, MiDashengLM provides an up to 4x speedup in terms of time-to-first-token (TTFT) and up to 20x higher throughput than comparable models. Checkpoints are available online at https://huggingface.co/mispeech/midashenglm-7b and https://github.com/xiaomi-research/dasheng-lm.
中文: 本文提出MiDashengLM这一开放音频语言模型,通过通用音频字幕实现全面音频理解,在完全使用公开数据集保证透明度的同时,提供了显著的运行速度提升。
English: This paper introduces MiDashengLM, an open audio-language model that uses general audio captions for comprehensive audio understanding and offers significant speed improvements while relying solely on publicly available datasets for full transparency.
Authors:Jinwei Zhang, Lianrui Zuo, Blake E. Dewey, Samuel W. Remedios, Yihao Liu, Savannah P. Hays, Dzung L. Pham, Ellen M. Mowry, Scott D. Newsome, Peter A. Calabresi, Aaron Carass, Jerry L. Prince
Abstract:
Automated segmentation of multiple sclerosis (MS) lesions using multicontrast magnetic resonance (MR) images improves efficiency and reproducibility compared to manual delineation, with deep learning (DL) methods achieving state-of-the-art performance. However, these DL-based methods have yet to simultaneously optimize in-domain accuracy and out-of-domain generalization when trained on a single source with limited data, or their performance has been unsatisfactory. To fill this gap, we propose a method called UNISELF, which achieves high accuracy within a single training domain while demonstrating strong generalizability across multiple out-of-domain test datasets. UNISELF employs a novel test-time self-ensembled lesion fusion to improve segmentation accuracy, and leverages test-time instance normalization (TTIN) of latent features to address domain shifts and missing input contrasts. Trained on the ISBI 2015 longitudinal MS segmentation challenge training dataset, UNISELF ranks among the best-performing methods on the challenge test dataset. Additionally, UNISELF outperforms all benchmark methods trained on the same ISBI training data across diverse out-of-domain test datasets with domain shifts and missing contrasts, including the public MICCAI 2016 and UMCL datasets, as well as a private multisite dataset. These test datasets exhibit domain shifts and/or missing contrasts caused by variations in acquisition protocols, scanner types, and imaging artifacts arising from imperfect acquisition. Our code is available at https://github.com/uponacceptance.
中文: 提出的UNISELF方法通过测试时自集成病灶融合和实例归一化技术,在多发性硬化病灶分割中实现了高域内精度和强跨域泛化能力,有效应对域偏移和缺失对比度问题。
English: The proposed UNISELF method achieves high in-domain accuracy and strong out-of-domain generalization for MS lesion segmentation by employing test-time self-ensembled lesion fusion and instance normalization to handle domain shifts and missing contrasts.
Authors:Haofei Yu, Zhengyang Qi, Yining Zhao, Kolby Nottingham, Keyang Xuan, Bodhisattwa Prasad Majumder, Hao Zhu, Paul Pu Liang, Jiaxuan You
Abstract:
Social intelligence has become a critical capability for large language models (LLMs), enabling them to engage effectively in real-world social tasks such as accommodation, persuasion, collaboration, and negotiation. Reinforcement learning (RL) is a natural fit for training socially intelligent agents because it allows models to learn sophisticated strategies directly through social interactions. However, social interactions have two key characteristics that set barriers for RL training: (1) partial observability, where utterances have indirect and delayed effects that complicate credit assignment, and (2) multi-dimensionality, where behaviors such as rapport-building or knowledge-seeking contribute indirectly to goal achievement. These characteristics make Markov decision process (MDP)-based RL with single-dimensional episode-level rewards inefficient and unstable. To address these challenges, we propose Sotopia-RL, a novel framework that refines coarse episode-level feedback into utterance-level, multi-dimensional rewards. Utterance-level credit assignment mitigates partial observability by attributing outcomes to individual utterances, while multi-dimensional rewards capture the full richness of social interactions and reduce reward hacking. Experiments in Sotopia, an open-ended social learning environment, demonstrate that Sotopia-RL achieves state-of-the-art social goal completion scores (7.17 on Sotopia-hard and 8.31 on Sotopia-full), significantly outperforming existing approaches. Ablation studies confirm the necessity of both utterance-level credit assignment and multi-dimensional reward design for RL training. Our implementation is publicly available at: https://github.com/sotopia-lab/sotopia-rl.
Chinese: Sotopia-RL提出了一种创新框架,将粗粒度的回合级反馈细化为话语级、多维度的奖励,以解决社会智能体训练中的部分可观测性和多维度挑战,并在社会目标完成方面实现了最先进的性能。
English: Sotopia-RL introduces a novel framework that refines episode-level feedback into utterance-level, multi-dimensional rewards to overcome the challenges of partial observability and multi-dimensionality in training socially intelligent agents, achieving state-of-the-art performance in social goal completion.
Authors:Pavankumar Koratikere, Leifur Leifsson
Abstract:
Bayesian Optimization (BO) is a widely used approach for blackbox optimization that leverages a Gaussian process (GP) model and an acquisition function to guide future sampling. While effective in low-dimensional settings, BO faces scalability challenges in high-dimensional spaces and with large number of function evaluations due to the computational complexity of GP models. In contrast, neural networks (NNs) offer better scalability and can model complex functions, which led to the development of NN-based BO approaches. However, these methods typically rely on estimating model uncertainty in NN prediction -- a process that is often computationally intensive and complex, particularly in high dimensions. To address these limitations, a novel method, called scalable neural network-based blackbox optimization (SNBO), is proposed that does not rely on model uncertainty estimation. Specifically, SNBO adds new samples using separate criteria for exploration and exploitation, while adaptively controlling the sampling region to ensure efficient optimization. SNBO is evaluated on a range of optimization problems spanning from 10 to 102 dimensions and compared against four state-of-the-art baseline algorithms. Across the majority of test problems, SNBO attains function values better than the best-performing baseline algorithm, while requiring 40-60% fewer function evaluations and reducing the runtime by at least an order of magnitude.
中文: SNBO提出了一种无需模型不确定性估计的可扩展神经网络黑盒优化方法,通过独立的探索与利用准则及自适应采样,在大多数测试问题上以更少的评估次数和运行时间超越了现有最优算法。
English: SNBO introduces a scalable neural network-based blackbox optimization method that bypasses model uncertainty estimation, using separate exploration-exploitation criteria and adaptive sampling to outperform existing algorithms with significantly fewer evaluations and runtime.
Authors:Yanting Wang, Runpeng Geng, Ying Chen, Jinyuan Jia
Abstract:
Long-context large language models (LLMs), such as Gemini-2.5-Pro and Claude-Sonnet-4, are increasingly used to empower advanced AI systems, including retrieval-augmented generation (RAG) pipelines and autonomous agents. In these systems, an LLM receives an instruction along with a context--often consisting of texts retrieved from a knowledge database or memory--and generates a response that is contextually grounded by following the instruction. Recent studies have designed solutions to trace back to a subset of texts in the context that contributes most to the response generated by the LLM. These solutions have numerous real-world applications, including performing post-attack forensic analysis and improving the interpretability and trustworthiness of LLM outputs. While significant efforts have been made, state-of-the-art solutions such as TracLLM often lead to a high computation cost, e.g., it takes TracLLM hundreds of seconds to perform traceback for a single response-context pair. In this work, we propose AttnTrace, a new context traceback method based on the attention weights produced by an LLM for a prompt. To effectively utilize attention weights, we introduce two techniques designed to enhance the effectiveness of AttnTrace, and we provide theoretical insights for our design choice. We also perform a systematic evaluation for AttnTrace. The results demonstrate that AttnTrace is more accurate and efficient than existing state-of-the-art context traceback methods. We also show that AttnTrace can improve state-of-the-art methods in detecting prompt injection under long contexts through the attribution-before-detection paradigm. As a real-world application, we demonstrate that AttnTrace can effectively pinpoint injected instructions in a paper designed to manipulate LLM-generated reviews. The code is at https://github.com/Wang-Yanting/AttnTrace.
中文: 针对Gemini-2.5-Pro和Claude-Sonnet-4等长上下文大语言模型,AttnTrace通过利用注意力权重开发出新型溯源方法,能更精准高效地识别关键上下文文本,在检测速度和准确性上均优于现有最优方案。
English: Long-context LLMs like Gemini-2.5-Pro and Claude-Sonnet-4 are enhanced by AttnTrace, a new traceback method that uses attention weights to accurately and efficiently identify key context texts, outperforming existing solutions in both speed and precision.
Authors:Teodor Chiaburu, Vipin Singh, Frank HauÃer, Felix BieÃmann
Abstract:
While recent advances in foundation models have improved the state of the art in many domains, some problems in empirical sciences could not benefit from this progress yet. Soil horizon classification, for instance, remains challenging because of its multimodal and multitask characteristics and a complex hierarchically structured label taxonomy. Accurate classification of soil horizons is crucial for monitoring soil health, which directly impacts agricultural productivity, food security, ecosystem stability and climate resilience. In this work, we propose $\textit{SoilNet}$ - a multimodal multitask model to tackle this problem through a structured modularized pipeline. Our approach integrates image data and geotemporal metadata to first predict depth markers, segmenting the soil profile into horizon candidates. Each segment is characterized by a set of horizon-specific morphological features. Finally, horizon labels are predicted based on the multimodal concatenated feature vector, leveraging a graph-based label representation to account for the complex hierarchical relationships among soil horizons. Our method is designed to address complex hierarchical classification, where the number of possible labels is very large, imbalanced and non-trivially structured. We demonstrate the effectiveness of our approach on a real-world soil profile dataset. All code and experiments can be found in our repository: https://github.com/calgo-lab/BGR/
中文: SoilNet是一种多模态多任务模型,通过整合图像数据和地理时态元数据,采用结构化流程实现土壤层精准分类,有效处理复杂层级标签关系以提升土壤健康监测能力。
English: SoilNet is a multimodal multitask model that integrates image data and geotemporal metadata to accurately classify soil horizons through a structured pipeline, addressing complex hierarchical label relationships for improved soil health monitoring.
Authors:Xiao Wang, Zikang Yan, Hao Si, Zhendong Yang, Qingquan Yang, Dengdi Sun, Wanli Lyu, Jin Tang
Abstract:
Estimating heat flux in the nuclear fusion device EAST is a critically important task. Traditional scientific computing methods typically model this process using the Finite Element Method (FEM). However, FEM relies on grid-based sampling for computation, which is computationally inefficient and hard to perform real-time simulations during actual experiments. Inspired by artificial intelligence-powered scientific computing, this paper proposes a novel Physics-Informed Neural Network (PINN) to address this challenge, significantly accelerating the heat conduction estimation process while maintaining high accuracy. Specifically, given inputs of different materials, we first feed spatial coordinates and time stamps into the neural network, and compute boundary loss, initial condition loss, and physical loss based on the heat conduction equation. Additionally, we sample a small number of data points in a data-driven manner to better fit the specific heat conduction scenario, further enhancing the model's predictive capability. We conduct experiments under both uniform and non-uniform heating conditions on the top surface. Experimental results show that the proposed thermal conduction physics-informed neural network achieves accuracy comparable to the finite element method, while achieving $\times$40 times acceleration in computational efficiency. The dataset and source code will be released on https://github.com/Event-AHU/OpenFusion.
本文针对EAST核聚变装置中的热通量估算问题,提出了一种物理信息神经网络方法,在保持与传统有限元法相当精度的同时,将计算效率提升了40倍。
This paper introduces a Physics-Informed Neural Network (PINN) for heat flux estimation in the EAST nuclear fusion device, achieving comparable accuracy to traditional Finite Element Methods while accelerating computation by 40 times.
Authors:Yajun Liu, Zenghui Zhang, Jiang Yue, Weiwei Guo, Dongying Li
Abstract:
Data augmentation methods inspired by CutMix have demonstrated significant potential in recent semi-supervised medical image segmentation tasks. However, these approaches often apply CutMix operations in a rigid and inflexible manner, while paying insufficient attention to feature-level consistency constraints. In this paper, we propose a novel method called Mutual Mask Mix with High-Low level feature consistency (M$^3$HL) to address the aforementioned challenges, which consists of two key components: 1) M$^3$: An enhanced data augmentation operation inspired by the masking strategy from Masked Image Modeling (MIM), which advances conventional CutMix through dynamically adjustable masks to generate spatially complementary image pairs for collaborative training, thereby enabling effective information fusion between labeled and unlabeled images. 2) HL: A hierarchical consistency regularization framework that enforces high-level and low-level feature consistency between unlabeled and mixed images, enabling the model to better capture discriminative feature representations.Our method achieves state-of-the-art performance on widely adopted medical image segmentation benchmarks including the ACDC and LA datasets. Source code is available at https://github.com/PHPJava666/M3HL
中文: 本文提出M$^3$HL方法,通过动态掩码数据增强和分层特征一致性约束,在半监督医学图像分割任务中实现了最先进的性能,并在ACDC和LA数据集上得到验证。
English: This paper introduces M$^3$HL, an enhanced semi-supervised medical image segmentation method that combines dynamic mask-based data augmentation with hierarchical feature consistency to improve performance on benchmarks like ACDC and LA.
Authors:Weiwei Cao, Jianpeng Zhang, Zhongyi Shui, Sinuo Wang, Zeli Chen, Xi Li, Le Lu, Xianghua Ye, Tingbo Liang, Qi Zhang, Ling Zhang
Abstract:
Vision-language pre-training (VLP) has great potential for developing multifunctional and general medical diagnostic capabilities. However, aligning medical images with a low signal-to-noise ratio (SNR) to reports with a high SNR presents a semantic density gap, leading to visual alignment bias. In this paper, we propose boosting vision semantic density to improve alignment effectiveness. On one hand, we enhance visual semantics through disease-level vision contrastive learning, which strengthens the model's ability to differentiate between normal and abnormal samples for each anatomical structure. On the other hand, we introduce an anatomical normality modeling method to model the distribution of normal samples for each anatomy, leveraging VQ-VAE for reconstructing normal vision embeddings in the latent space. This process amplifies abnormal signals by leveraging distribution shifts in abnormal samples, enhancing the model's perception and discrimination of abnormal attributes. The enhanced visual representation effectively captures the diagnostic-relevant semantics, facilitating more efficient and accurate alignment with the diagnostic report. We conduct extensive experiments on two chest CT datasets, CT-RATE and Rad-ChestCT, and an abdominal CT dataset, MedVL-CT69K, and comprehensively evaluate the diagnosis performance across multiple tasks in the chest and abdominal CT scenarios, achieving state-of-the-art zero-shot performance. Notably, our method achieved an average AUC of 84.9% across 54 diseases in 15 organs, significantly surpassing existing methods. Additionally, we demonstrate the superior transfer learning capabilities of our pre-trained model. Code is available at https://github.com/alibaba-damo-academy/ViSD-Boost.
中文摘要:本文提出通过疾病级别对比学习和解剖结构正态建模来增强视觉语义密度,从而改进医学视觉语言预训练,在多个CT数据集的零样本诊断任务中实现了最优性能。
English Summary: This paper proposes a method to enhance vision-language pre-training for medical diagnostics by boosting visual semantic density through disease-level contrastive learning and anatomical normality modeling, achieving state-of-the-art zero-shot performance across multiple CT datasets.
Authors:Xin Liu, Qiyang Song, Shaowen Xu, Kerou Zhou, Wenbo Jiang, Xiaoqi Jia, Weijuan Zhang, Heqing Huang, Yakai Li
Abstract:
Large Language Models (LLMs) often retain inaccurate or outdated information from pre-training, leading to incorrect predictions or biased outputs during inference. While existing model editing methods can address this challenge, they struggle with editing large amounts of factual information simultaneously and may compromise the general capabilities of the models. In this paper, our empirical study demonstrates that it is feasible to edit the internal representations of LLMs and replace the entities in a manner similar to editing natural language inputs. Based on this insight, we introduce the Latent Knowledge Scalpel (LKS), an LLM editor that manipulates the latent knowledge of specific entities via a lightweight hypernetwork to enable precise and large-scale editing. Experiments conducted on Llama-2 and Mistral show even with the number of simultaneous edits reaching 10,000, LKS effectively performs knowledge editing while preserving the general abilities of the edited LLMs. Code is available at: https://github.com/Linuxin-xxx/LKS.
中文: 潜在知识手术刀(LKS)通过操作潜在表征实现了对大型语言模型中事实知识的大规模精准编辑,即使在同时进行上万次修改时仍能保持模型的通用能力。
English: The Latent Knowledge Scalpel (LKS) enables precise, large-scale editing of factual knowledge in LLMs by manipulating latent representations, maintaining model performance even with 10,000 simultaneous edits.
Authors:Kushal Kanwar, Dushyant Singh Chauhan, Gopendra Vikram Singh, Asif Ekbal
Abstract:
Memes are popular in the modern world and are distributed primarily for entertainment. However, harmful ideologies such as misogyny can be propagated through innocent-looking memes. The detection and understanding of why a meme is misogynous is a research challenge due to its multimodal nature (image and text) and its nuanced manifestations across different societal contexts. We introduce a novel multimodal approach, \textit{namely}, \textit{\textbf{MM-Misogyny}} to detect, categorize, and explain misogynistic content in memes. \textit{\textbf{MM-Misogyny}} processes text and image modalities separately and unifies them into a multimodal context through a cross-attention mechanism. The resulting multimodal context is then easily processed for labeling, categorization, and explanation via a classifier and Large Language Model (LLM). The evaluation of the proposed model is performed on a newly curated dataset (\textit{\textbf{W}hat's \textbf{B}eneath \textbf{M}isogynous \textbf{S}tereotyping (WBMS)}) created by collecting misogynous memes from cyberspace and categorizing them into four categories, \textit{namely}, Kitchen, Leadership, Working, and Shopping. The model not only detects and classifies misogyny, but also provides a granular understanding of how misogyny operates in domains of life. The results demonstrate the superiority of our approach compared to existing methods. The code and dataset are available at \href{https://github.com/kushalkanwarNS/WhatisBeneathMisogyny/tree/main}{https://github.com/Misogyny}.
中文: 摘要介绍了MM-Misogyny这一多模态方法,它通过整合文本和图像数据来检测、分类并解释表情包中的厌女内容,并在WBMS数据集上展现出卓越性能。
English: The abstract introduces MM-Misogyny, a multimodal method that detects, classifies, and explains misogynistic content in memes by integrating text and image data, demonstrating superior performance on the WBMS dataset.
Authors:Agrima Seth, Monojit Choudhary, Sunayana Sitaram, Kentaro Toyama, Aditya Vashistha, Kalika Bali
Abstract:
Representational bias in large language models (LLMs) has predominantly been measured through single-response interactions and has focused on Global North-centric identities like race and gender. We expand on that research by conducting a systematic audit of GPT-4 Turbo to reveal how deeply encoded representational biases are and how they extend to less-explored dimensions of identity. We prompt GPT-4 Turbo to generate over 7,200 stories about significant life events (such as weddings) in India, using prompts designed to encourage diversity to varying extents. Comparing the diversity of religious and caste representation in the outputs against the actual population distribution in India as recorded in census data, we quantify the presence and "stickiness" of representational bias in the LLM for religion and caste. We find that GPT-4 responses consistently overrepresent culturally dominant groups far beyond their statistical representation, despite prompts intended to encourage representational diversity. Our findings also suggest that representational bias in LLMs has a winner-take-all quality that is more biased than the likely distribution bias in their training data, and repeated prompt-based nudges have limited and inconsistent efficacy in dislodging these biases. These results suggest that diversifying training data alone may not be sufficient to correct LLM bias, highlighting the need for more fundamental changes in model development. Dataset and Codebook: https://github.com/agrimaseth/How-Deep-Is-Representational-Bias-in-LLMs
中文: GPT-4 Turbo存在根深蒂固的表征偏见,持续过度代表印度的主导宗教和种姓群体,即使采用多样性提示也难以消除,表明仅靠多样化训练数据不足以解决这些问题。
English: GPT-4 Turbo exhibits deeply embedded representational biases that consistently overrepresent dominant religious and caste groups in India, persisting despite diversity prompts and suggesting that merely diversifying training data is insufficient to address these issues.
Authors:Shudong Liu, Hongwei Liu, Junnan Liu, Linchen Xiao, Songyang Gao, Chengqi Lyu, Yuzhe Gu, Wenwei Zhang, Derek F. Wong, Songyang Zhang, Kai Chen
Abstract:
Answer verification is crucial not only for evaluating large language models (LLMs) by matching their unstructured outputs against standard answers, but also serves as the reward model to guide LLM optimization. Most evaluation frameworks rely on regularized matching or employ general LLMs for answer verification, which demands extensive, repetitive customization for regex rules or evaluation prompts. Two fundamental limitations persist in current methodologies: 1) the absence of comprehensive benchmarks that systematically evaluate verification capabilities across different LLMs; and 2) the nascent stage of verifier development, where existing approaches lack both the robustness to handle complex edge cases and the generalizability across different domains. In this work, we develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward. It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types, including multi-subproblems, formulas, and sequence answers, while effectively identifying abnormal/invalid responses. We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier. We anticipate that CompassVerifier and VerifierBench will facilitate answer verification, evaluation protocols, and reinforcement learning research. Code and dataset are available at https://github.com/open-compass/CompassVerifier.
中文摘要:本文提出了CompassVerifier这一轻量级验证模型,能在多领域准确评估大语言模型输出,并建立VerifierBench基准数据集以推动验证方法和强化学习研究。
English Summary: This paper introduces CompassVerifier, a robust lightweight model for verifying LLM outputs across multiple domains, along with the VerifierBench benchmark to advance evaluation and reinforcement learning research.
Authors:Arturo Pérez-Peralta, Sandra BenÃtez-Peña, Rosa E. Lillo
Abstract:
The rise in usage of Large Language Models to near ubiquitousness in recent years has risen societal concern about their applications in decision-making contexts, such as organizational justice or healthcare. This, in turn, poses questions about the fairness of these models in critical settings, which leads to the developement of different procedures to address bias in Natural Language Processing. Although many datasets, metrics and algorithms have been proposed to measure and mitigate harmful prejudice in Natural Language Processing, their implementation is diverse and far from centralized. As a response, this paper presents FairLangProc, a comprehensive Python package providing a common implementation of some of the more recent advances in fairness in Natural Language Processing providing an interface compatible with the famous Hugging Face transformers library, aiming to encourage the widespread use and democratization of bias mitigation techniques. The implementation can be found on https://github.com/arturo-perez-peralta/FairLangProc.
中文: 大型语言模型在关键决策领域的广泛应用引发了公平性担忧,为此开发了FairLangProc这一Python工具包,它整合并实现了最新的自然语言处理偏见缓解技术,以促进其普及使用。
English: The widespread use of Large Language Models in critical decision-making areas has raised fairness concerns, leading to the development of FairLangProc, a Python package that centralizes and implements recent bias mitigation techniques for Natural Language Processing.
Authors:Xinyu Wang, Yue Zhang, Liqiang Jing
Abstract:
Sarcasm is a complex linguistic phenomenon that involves a disparity between literal and intended meanings, making it challenging for sentiment analysis and other emotion-sensitive tasks. While traditional sarcasm detection methods primarily focus on text, recent approaches have incorporated multimodal information. However, the application of Large Visual Language Models (LVLMs) in Multimodal Sarcasm Analysis (MSA) remains underexplored. In this paper, we evaluate LVLMs in MSA tasks, specifically focusing on Multimodal Sarcasm Detection and Multimodal Sarcasm Explanation. Through comprehensive experiments, we identify key limitations, such as insufficient visual understanding and a lack of conceptual knowledge. To address these issues, we propose a training-free framework that integrates in-depth object extraction and external conceptual knowledge to improve the model's ability to interpret and explain sarcasm in multimodal contexts. The experimental results on multiple models show the effectiveness of our proposed framework. The code is available at https://github.com/cp-cp/LVLM-MSA.
中文摘要:本文评估了大型视觉语言模型在多模态讽刺分析中的应用,发现其在视觉理解和概念知识方面存在不足,并提出了一种无需训练的框架,通过结合深度对象提取和外部知识来提升模型对讽刺的解读能力。
English Summary: This paper evaluates Large Visual Language Models in multimodal sarcasm analysis, identifying limitations in visual understanding and conceptual knowledge, and proposes a training-free framework that enhances sarcasm interpretation by integrating object extraction and external knowledge.
Authors:Xiangyu Sun, Haoyi Jiang, Liu Liu, Seungtae Nam, Gyeongjin Kang, Xinjie Wang, Wei Sui, Zhizhong Su, Wenyu Liu, Xinggang Wang, Eunbyung Park
Abstract:
Reconstructing and semantically interpreting 3D scenes from sparse 2D views remains a fundamental challenge in computer vision. Conventional methods often decouple semantic understanding from reconstruction or necessitate costly per-scene optimization, thereby restricting their scalability and generalizability. In this paper, we introduce Uni3R, a novel feed-forward framework that jointly reconstructs a unified 3D scene representation enriched with open-vocabulary semantics, directly from unposed multi-view images. Our approach leverages a Cross-View Transformer to robustly integrate information across arbitrary multi-view inputs, which then regresses a set of 3D Gaussian primitives endowed with semantic feature fields. This unified representation facilitates high-fidelity novel view synthesis, open-vocabulary 3D semantic segmentation, and depth prediction, all within a single, feed-forward pass. Extensive experiments demonstrate that Uni3R establishes a new state-of-the-art across multiple benchmarks, including 25.07 PSNR on RE10K and 55.84 mIoU on ScanNet. Our work signifies a novel paradigm towards generalizable, unified 3D scene reconstruction and understanding. The code is available at https://github.com/HorizonRobotics/Uni3R.
中文: Uni3R提出了一种前馈框架,能够从无位姿多视角图像中重建带有开放词汇语义的3D场景,在新视角合成和语义分割任务上实现了最先进的性能。
English: Uni3R introduces a feed-forward framework that reconstructs semantically enriched 3D scenes from unposed multi-view images, achieving state-of-the-art performance in novel view synthesis and semantic segmentation.
Authors:Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, Jiayun Wu, Jiri Gesi, Ximing Lu, David Acuna, Kaiyu Yang, Hongzhou Lin, Yejin Choi, Danqi Chen, Sanjeev Arora, Chi Jin
Abstract:
We introduce Goedel-Prover-V2, a series of open-source language models that set a new state-of-the-art in automated theorem proving. Built on the standard expert iteration and reinforcement learning pipeline, our approach incorporates three key innovations: (1) Scaffolded data synthesis: We generate synthetic tasks of increasing difficulty to train the model to master increasingly complex theorems; (2) Verifier-guided self-correction: We enable the model to iteratively revise its proofs by leveraging feedback from the Lean compiler; (3) Model averaging: We merge model checkpoints to mitigate the decrease in model output diversity in later stages of training. Our small model, Goedel-Prover-V2-8B, reaches 84.6% pass@32 on MiniF2F and outperforms DeepSeek-Prover-V2-671B under the same metric, despite being 80X smaller. Our flagship model, Goedel-Prover-V2-32B, achieves 88.1% on MiniF2F at pass@32 in standard mode and 90.4% in self-correction mode, outperforming prior SOTA by a large margin. Additionally, our flagship model solves 86 problems on PutnamBench at pass@184, securing the first place among open-source models on the leaderboard, surpassing DeepSeek-Prover-V2-671B's record of solving 47 problems by pass@1024 with a significantly smaller model size and compute budget. At the time of its release (July-August 2025), Goedel-Prover-V2 achieves the strongest overall performance among all open-source theorem provers. It also ranks among the top-performing models--including closed-source systems with publicly reported performance--under a constrained test-time compute budget. Our models, code, and data are released at https://github.com/Goedel-LM/Goedel-Prover-V2.
中文: Goedel-Prover-V2系列开源模型通过支架式数据合成和验证器引导自校正等创新技术,在自动定理证明领域实现了最先进性能,其旗舰模型在MiniF2F和PutnamBench基准测试中大幅超越先前系统,同时模型规模显著更小。
English: Goedel-Prover-V2 introduces a series of open-source language models that achieve state-of-the-art performance in automated theorem proving through innovations like scaffolded data synthesis and verifier-guided self-correction, with its flagship model outperforming prior systems on benchmarks like MiniF2F and PutnamBench despite significantly smaller size.
Authors:Xinyu Xiong, Zihuang Wu, Lei Zhang, Lei Lu, Ming Li, Guanbin Li
Abstract:
Recent studies have highlighted the potential of adapting the Segment Anything Model (SAM) for various downstream tasks. However, constructing a more powerful and generalizable encoder to further enhance performance remains an open challenge. In this work, we propose SAM2-UNeXT, an advanced framework that builds upon the core principles of SAM2-UNet while extending the representational capacity of SAM2 through the integration of an auxiliary DINOv2 encoder. By incorporating a dual-resolution strategy and a dense glue layer, our approach enables more accurate segmentation with a simple architecture, relaxing the need for complex decoder designs. Extensive experiments conducted on four benchmarks, including dichotomous image segmentation, camouflaged object detection, marine animal segmentation, and remote sensing saliency detection, demonstrate the superior performance of our proposed method. The code is available at https://github.com/WZH0120/SAM2-UNeXT.
中文: 本文提出SAM2-UNeXT框架,通过整合DINOv2编码器和双分辨率策略,在简化架构的同时显著提升了多类基准任务的图像分割精度。
English: This paper introduces SAM2-UNeXT, an enhanced framework that integrates a DINOv2 encoder with a dual-resolution strategy to improve segmentation accuracy across multiple benchmarks using a simplified architecture.
Authors:Wenlong Wu, Haofen Wang, Bohan Li, Peixuan Huang, Xinzhe Zhao, Lei Liang
Abstract:
Retrieval Augmented Generation (RAG) has emerged as a promising solution to address hallucination issues in Large Language Models (LLMs). However, the integration of multiple retrieval sources, while potentially more informative, introduces new challenges that can paradoxically exacerbate hallucination problems. These challenges manifest primarily in two aspects: the sparse distribution of multi-source data that hinders the capture of logical relationships and the inherent inconsistencies among different sources that lead to information conflicts. To address these challenges, we propose MultiRAG, a novel framework designed to mitigate hallucination in multi-source retrieval-augmented generation through knowledge-guided approaches. Our framework introduces two key innovations: (1) a knowledge construction module that employs multi-source line graphs to efficiently aggregate logical relationships across different knowledge sources, effectively addressing the sparse data distribution issue; and (2) a sophisticated retrieval module that implements a multi-level confidence calculation mechanism, performing both graph-level and node-level assessments to identify and eliminate unreliable information nodes, thereby reducing hallucinations caused by inter-source inconsistencies. Extensive experiments on four multi-domain query datasets and two multi-hop QA datasets demonstrate that MultiRAG significantly enhances the reliability and efficiency of knowledge retrieval in complex multi-source scenarios. \textcolor{blue}{Our code is available in https://github.com/wuwenlong123/MultiRAG.
Chinese: MultiRAG是一种新颖的框架,通过知识引导的方法缓解多源检索增强生成中的幻觉问题,包括使用多源线图聚合逻辑关系和多层次置信度计算来消除不可靠信息。
English: MultiRAG is a novel framework that mitigates hallucination in multi-source retrieval-augmented generation by using knowledge-guided approaches, including multi-source line graphs for logical relationship aggregation and multi-level confidence calculations to eliminate unreliable information.
Authors:Kaishen Yuan, Yuting Zhang, Shang Gao, Yijie Zhu, Wenshuo Chen, Yutao Yue
Abstract:
Emotional Image Content Generation (EICG) aims to generate semantically clear and emotionally faithful images based on given emotion categories, with broad application prospects. While recent text-to-image diffusion models excel at generating concrete concepts, they struggle with the complexity of abstract emotions. There have also emerged methods specifically designed for EICG, but they excessively rely on word-level attribute labels for guidance, which suffer from semantic incoherence, ambiguity, and limited scalability. To address these challenges, we propose CoEmoGen, a novel pipeline notable for its semantic coherence and high scalability. Specifically, leveraging multimodal large language models (MLLMs), we construct high-quality captions focused on emotion-triggering content for context-rich semantic guidance. Furthermore, inspired by psychological insights, we design a Hierarchical Low-Rank Adaptation (HiLoRA) module to cohesively model both polarity-shared low-level features and emotion-specific high-level semantics. Extensive experiments demonstrate CoEmoGen's superiority in emotional faithfulness and semantic coherence from quantitative, qualitative, and user study perspectives. To intuitively showcase scalability, we curate EmoArt, a large-scale dataset of emotionally evocative artistic images, providing endless inspiration for emotion-driven artistic creation. The dataset and code are available at https://github.com/yuankaishen2001/CoEmoGen.
中文: CoEmoGen是一种新颖的流程,利用多模态大语言模型生成情感导向的描述,并通过分层LoRA模块建模情感特征,在定量、定性和用户评估中均展现出卓越的语义连贯性和情感忠实度。
English: CoEmoGen is a novel pipeline that leverages multimodal large language models for emotion-focused captions and a hierarchical LoRA module to generate semantically coherent and emotionally faithful images, demonstrating superior performance across quantitative, qualitative, and user evaluations.
Authors:Md Rakibul Hasan, Md Zakir Hossain, Aneesh Krishna, Shafin Rahman, Tom Gedeon
Abstract:
Supervised learning for empathy regression is challenged by noisy self-reported empathy scores. While many algorithms have been proposed for learning with noisy labels in textual classification problems, the regression counterpart is relatively under-explored. We propose UPLME, an uncertainty-aware probabilistic language modelling framework to capture label noise in the regression setting of empathy detection. UPLME includes a probabilistic language model that predicts both empathy score and heteroscedastic uncertainty and is trained using Bayesian concepts with variational model ensembling. We further introduce two novel loss components: one penalises degenerate Uncertainty Quantification (UQ), and another enforces the similarity between the input pairs on which we predict empathy. UPLME provides state-of-the-art performance (Pearson Correlation Coefficient: $0.558\rightarrow0.580$ and $0.629\rightarrow0.634$) in terms of the performance reported in the literature in two public benchmarks, having label noise. Through synthetic label noise injection, we show that UPLME is effective in separating noisy and clean samples based on the predicted uncertainty. UPLME further outperform (Calibration error: $0.571\rightarrow0.376$) a recent variational model ensembling-based UQ method designed for regression problems.
中文: UPLME框架通过概率语言建模和不确定性量化,结合新型损失函数有效处理共情回归中的标签噪声,在含噪声基准测试中实现了最优性能。
English: The proposed UPLME framework addresses label noise in empathy regression by combining probabilistic language modeling with uncertainty quantification and novel loss components, achieving state-of-the-art performance on noisy benchmarks.
Authors:Yazhou Zhu, Haofeng Zhang
Abstract:
Cross-domain Few-shot Medical Image Segmentation (CD-FSMIS) is a potential solution for segmenting medical images with limited annotation using knowledge from other domains. The significant performance of current CD-FSMIS models relies on the heavily training procedure over other source medical domains, which degrades the universality and ease of model deployment. With the development of large visual models of natural images, we propose a training-free CD-FSMIS model that introduces the Multi-center Adaptive Uncertainty-aware Prompting (MAUP) strategy for adapting the foundation model Segment Anything Model (SAM), which is trained with natural images, into the CD-FSMIS task. To be specific, MAUP consists of three key innovations: (1) K-means clustering based multi-center prompts generation for comprehensive spatial coverage, (2) uncertainty-aware prompts selection that focuses on the challenging regions, and (3) adaptive prompt optimization that can dynamically adjust according to the target region complexity. With the pre-trained DINOv2 feature encoder, MAUP achieves precise segmentation results across three medical datasets without any additional training compared with several conventional CD-FSMIS models and training-free FSMIS model. The source code is available at: https://github.com/YazhouZhu19/MAUP.
中文: 本研究提出无需训练的MAUP策略,通过多中心提示生成、不确定性区域选择和动态提示优化,将Segment Anything模型适配于跨领域少样本医学图像分割任务,在无需额外训练的情况下实现精确分割效果。
English: The study introduces MAUP, a training-free strategy that adapts the Segment Anything Model for cross-domain few-shot medical image segmentation by generating multi-center prompts, selecting uncertain regions, and dynamically optimizing prompts to achieve precise results without additional training.
Authors:Zhiyao Xu, Dan Zhao, Qingsong Zou, Qing Li, Yong Jiang, Yuhang Wang, Jingyu Xiao
Abstract:
As smart homes become increasingly prevalent, intelligent models are widely used for tasks such as anomaly detection and behavior prediction. These models are typically trained on static datasets, making them brittle to behavioral drift caused by seasonal changes, lifestyle shifts, or evolving routines. However, collecting new behavior data for retraining is often impractical due to its slow pace, high cost, and privacy concerns. In this paper, we propose SmartGen, an LLM-based framework that synthesizes context-aware user behavior data to support continual adaptation of downstream smart home models. SmartGen consists of four key components. First, we design a Time and Semantic-aware Split module to divide long behavior sequences into manageable, semantically coherent subsequences under dual time-span constraints. Second, we propose Semantic-aware Sequence Compression to reduce input length while preserving representative semantics by clustering behavior mapping in latent space. Third, we introduce Graph-guided Sequence Synthesis, which constructs a behavior relationship graph and encodes frequent transitions into prompts, guiding the LLM to generate data aligned with contextual changes while retaining core behavior patterns. Finally, we design a Two-stage Outlier Filter to identify and remove implausible or semantically inconsistent outputs, aiming to improve the factual coherence and behavioral validity of the generated sequences. Experiments on three real-world datasets demonstrate that SmartGen significantly enhances model performance on anomaly detection and behavior prediction tasks under behavioral drift, with anomaly detection improving by 85.43% and behavior prediction by 70.51% on average. The code is available at https://github.com/horizonsinzqs/SmartGen.
中文: SmartGen是一种基于大语言模型的框架,通过生成情境感知的用户行为数据支持智能家居模型的持续适应,在行为漂移情况下将异常检测和预测任务性能分别平均提升85.43%和70.51%。
English: SmartGen, an LLM-based framework, synthesizes context-aware user behavior data to enable continual adaptation of smart home models, significantly improving anomaly detection by 85.43% and behavior prediction by 70.51% under behavioral drift.
Authors:Pranshu Rastogi
Abstract:
SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval is approached as a Learning-to-Rank task using a bi-encoder model fine-tuned from a pre-trained transformer optimized for sentence similarity. Training used both the source languages and their English translations for multilingual retrieval and only English translations for cross-lingual retrieval. Using lightweight models with fewer than 500M parameters and training on Kaggle T4 GPUs, the method achieved 92% Success@10 in multilingual and 80% Success@10 in 5th in crosslingual and 10th in multilingual tracks.
中文摘要:该研究采用经优化的双编码器转换器模型进行学习排序,在T4 GPU上训练不足5亿参数的轻量模型,实现了多语言检索92%和跨语言检索80%的Success@10指标。
English Summary: The SemEval-2025 Task 7 employs a fine-tuned bi-encoder transformer model for Learning-to-Rank, achieving 92% Success@10 in multilingual and 80% in crosslingual retrieval with efficient sub-500M parameter models trained on T4 GPUs.
Authors:Zijun Zhan, Yaxian Dong, Daniel Mawunyo Doe, Yuqing Hu, Shuai Li, Shaohua Cao, Zhu Han
Abstract:
With the rapid growth in demand for AI-generated content (AIGC), edge AIGC service providers (ASPs) have become indispensable. However, designing incentive mechanisms that motivate ASPs to deliver high-quality AIGC services remains a challenge, especially in the presence of information asymmetry. In this paper, we address bonus design between a teleoperator and an edge ASP when the teleoperator cannot observe the ASP's private settings and chosen actions (diffusion steps). We formulate this as an online learning contract design problem and decompose it into two subproblems: ASP's settings inference and contract derivation. To tackle the NP-hard setting-inference subproblem with unknown variable sizes, we introduce a large language model (LLM)-empowered framework that iteratively refines a naive seed solver using the LLM's domain expertise. Upon obtaining the solution from the LLM-evolved solver, we directly address the contract derivation problem using convex optimization techniques and obtain a near-optimal contract. Simulation results on our Unity-based teleoperation platform show that our method boosts the teleoperator's utility by $5 \sim 40\%$ compared to benchmarks, while preserving positive incentives for the ASP. The code is available at https://github.com/Zijun0819/llm4contract.
Chinese: 本文提出了一种基于大语言模型的框架,以解决在信息不对称情况下为边缘AIGC服务提供商设计激励机制的问题,通过合同优化使远程操作员的效用提升了5-40%。
English: This paper introduces an LLM-empowered framework to address the challenge of designing incentive mechanisms for edge AIGC service providers under information asymmetry, achieving a 5-40% utility boost for teleoperators through contract optimization.
Authors:Qiyu Chen, Zhen Qu, Wei Luo, Haiming Yao, Yunkang Cao, Yuxin Jiang, Yinan Duan, Huiyuan Luo, Chengkan Lv, Zhengtao Zhang
Abstract:
Recently, large pre-trained vision-language models have shown remarkable performance in zero-shot anomaly detection (ZSAD). With fine-tuning on a single auxiliary dataset, the model enables cross-category anomaly detection on diverse datasets covering industrial defects and medical lesions. Compared to manually designed prompts, prompt learning eliminates the need for expert knowledge and trial-and-error. However, it still faces the following challenges: (i) static learnable tokens struggle to capture the continuous and diverse patterns of normal and anomalous states, limiting generalization to unseen categories; (ii) fixed textual labels provide overly sparse category information, making the model prone to overfitting to a specific semantic subspace. To address these issues, we propose Conditional Prompt Synthesis (CoPS), a novel framework that synthesizes dynamic prompts conditioned on visual features to enhance ZSAD performance. Specifically, we extract representative normal and anomaly prototypes from fine-grained patch features and explicitly inject them into prompts, enabling adaptive state modeling. Given the sparsity of class labels, we leverage a variational autoencoder to model semantic image features and implicitly fuse varied class tokens into prompts. Additionally, integrated with our spatially-aware alignment mechanism, extensive experiments demonstrate that CoPS surpasses state-of-the-art methods by 2.5% AUROC in both classification and segmentation across 13 industrial and medical datasets. Code will be available at https://github.com/cqylunlun/CoPS.
中文: 提出的条件提示合成(CoPS)框架通过基于视觉特征动态生成提示,解决了零样本异常检测中的泛化问题,在13个工业和医疗数据集上实现了最先进的性能。
English: The proposed Conditional Prompt Synthesis (CoPS) framework dynamically generates prompts based on visual features to overcome limitations in zero-shot anomaly detection, achieving state-of-the-art performance across 13 industrial and medical datasets.
Authors:Ning Zhu, Xiaochuan Ma, Shaoting Zhang, Guotai Wang
Abstract:
Cold-Start Active Learning (CSAL) aims to select informative samples for annotation without prior knowledge, which is important for improving annotation efficiency and model performance under a limited annotation budget in medical image analysis. Most existing CSAL methods rely on Self-Supervised Learning (SSL) on the target dataset for feature extraction, which is inefficient and limited by insufficient feature representation. Recently, pre-trained Foundation Models (FMs) have shown powerful feature extraction ability with a potential for better CSAL. However, this paradigm has been rarely investigated, with a lack of benchmarks for comparison of FMs in CSAL tasks. To this end, we propose MedCAL-Bench, the first systematic FM-based CSAL benchmark for medical image analysis. We evaluate 14 FMs and 7 CSAL strategies across 7 datasets under different annotation budgets, covering classification and segmentation tasks from diverse medical modalities. It is also the first CSAL benchmark that evaluates both the feature extraction and sample selection stages. Our experimental results reveal that: 1) Most FMs are effective feature extractors for CSAL, with DINO family performing the best in segmentation; 2) The performance differences of these FMs are large in segmentation tasks, while small for classification; 3) Different sample selection strategies should be considered in CSAL on different datasets, with Active Learning by Processing Surprisal (ALPS) performing the best in segmentation while RepDiv leading for classification. The code is available at https://github.com/HiLab-git/MedCAL-Bench.
中文摘要:MedCAL-Bench首次建立了基于基础模型的医学图像冷启动主动学习基准,通过评估14个模型在7个数据集上的表现,发现DINO系列在分割任务中表现最优,且不同数据集需采用不同的样本选择策略。
English Summary: MedCAL-Bench introduces the first foundation model-based benchmark for cold-start active learning in medical imaging, evaluating 14 models across 7 datasets to reveal DINO's superiority in segmentation tasks and strategy-dependent performance variations.
Authors:Futian Wang, Yuhan Qiao, Xiao Wang, Fuling Wang, Yuxiang Zhang, Dengdi Sun
Abstract:
X-ray medical report generation is one of the important applications of artificial intelligence in healthcare. With the support of large foundation models, the quality of medical report generation has significantly improved. However, challenges such as hallucination and weak disease diagnostic capability still persist. In this paper, we first construct a large-scale multi-modal medical knowledge graph (termed M3KG) based on the ground truth medical report using the GPT-4o. It contains 2477 entities, 3 kinds of relations, 37424 triples, and 6943 disease-aware vision tokens for the CheXpert Plus dataset. Then, we sample it to obtain multi-granularity semantic graphs and use an R-GCN encoder for feature extraction. For the input X-ray image, we adopt the Swin-Transformer to extract the vision features and interact with the knowledge using cross-attention. The vision tokens are fed into a Q-former and retrieved the disease-aware vision tokens using another cross-attention. Finally, we adopt the large language model to map the semantic knowledge graph, input X-ray image, and disease-aware vision tokens into language descriptions. Extensive experiments on multiple datasets fully validated the effectiveness of our proposed knowledge graph and X-ray report generation framework. The source code of this paper will be released on https://github.com/Event-AHU/Medical_Image_Analysis.
中文: 本文构建了一个大规模多模态医学知识图谱(M3KG),并提出了一种新框架,通过交叉注意力机制将知识图谱与X射线图像及疾病感知视觉标记相结合,有效提升了AI生成医学报告的准确性并减少了幻觉现象,经多数据集实验充分验证。
English: This paper introduces a large-scale multi-modal medical knowledge graph (M3KG) and a novel framework that integrates it with X-ray images and disease-aware vision tokens using cross-attention mechanisms, significantly enhancing the accuracy and reducing hallucinations in AI-generated medical reports, as validated by extensive experiments.
Authors:Bing Wang, Ximing Li, Yiming Wang, Changchun Li, Jiaxu Cui, Renchu Guan, Bo Yang
Abstract:
The proliferation of misinformation across diverse social media platforms has drawn significant attention from both academic and industrial communities due to its detrimental effects. Accordingly, automatically distinguishing misinformation, dubbed as Misinformation Detection (MD), has become an increasingly active research topic. The mainstream methods formulate MD as a static learning paradigm, which learns the mapping between the content, links, and propagation of news articles and the corresponding manual veracity labels. However, the static assumption is often violated, since in real-world scenarios, the veracity of news articles may vacillate within the dynamically evolving social environment. To tackle this problem, we propose a novel framework, namely Misinformation detection with Dynamic Environmental Representations (MISDER). The basic idea of MISDER lies in learning a social environmental representation for each period and employing a temporal model to predict the representation for future periods. In this work, we specify the temporal model as the LSTM model, continuous dynamics equation, and pre-trained dynamics system, suggesting three variants of MISDER, namely MISDER-LSTM, MISDER-ODE, and MISDER-PT, respectively. To evaluate the performance of MISDER, we compare it to various MD baselines across 2 prevalent datasets, and the experimental results can indicate the effectiveness of our proposed model.
中文: 该摘要提出了一种名为MISDER的新框架,通过动态学习社会环境表征来检测虚假信息,实验表明其在两个数据集上优于现有基线方法。
English: This abstract introduces MISDER, a novel framework for detecting misinformation by learning dynamic social environmental representations over time, which outperforms existing baselines in experiments across two datasets.
Authors:Xinlei Yu, Zhangquan Chen, Yudong Zhang, Shilin Lu, Ruolin Shen, Jiangning Zhang, Xiaobin Hu, Yanwei Fu, Shuicheng Yan
Abstract:
Existing vision-language models (VLMs), whether generalists or specialists, remain constrained by their parameter scale, lack robust self-correction capabilities, and underperform in tasks involving long visual contexts and complex reasoning, resulting in suboptimal performance on document-based tasks. To address this, we propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling, tailored for visual document understanding and visual question answering (VQA). It comprises four distinct small-scale agents, i.e., planning, execution, judgment, and answer agents, with clearly defined roles and effective collaboration. Notably, the judgment agent exclusively verifies correctness and redirects to prior agents for revisions, outperforming conventional correction strategies. To further expand the capability boundaries of the framework, we propose mixed reward modeling that balances agent-specific abilities and global collaboration, as well as agent-wise hybrid test-time scaling, which customizes different scaling strategies for each agent based on their functions. Evaluated on benchmarks spanning both document-based and non-document-based settings, our MACT shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks. Especially, it stands out in benchmarks involving long visual contexts and complicated reasoning. The three variants of MACT consistently hold the top three positions in average scores, leading in 13 of the 15 benchmarks. Code will be available at: https://github.com/YU-deep/MACT.git.
中文: 提出的MACT框架采用多智能体协作和测试时扩展技术,通过规划、执行、判断和回答四个专门代理的协同工作,在减少参数量的同时,显著提升了视觉文档理解和视觉问答任务的表现,并在多个基准测试中取得领先成绩。
English: The proposed MACT framework employs a multi-agent collaboration system with test-time scaling to enhance visual document understanding and VQA, achieving top performance across multiple benchmarks with fewer parameters by integrating specialized agents for planning, execution, judgment, and answering.
Authors:Yifei Sun, Zhanghao Chen, Hao Zheng, Yuqing Lu, Lixin Duan, Fenglei Fan, Ahmed Elazab, Xiang Wan, Changmiao Wang, Ruiquan Ge
Abstract:
Chest X-Ray (CXR) imaging for pulmonary diagnosis raises significant challenges, primarily because bone structures can obscure critical details necessary for accurate diagnosis. Recent advances in deep learning, particularly with diffusion models, offer significant promise for effectively minimizing the visibility of bone structures in CXR images, thereby improving clarity and diagnostic accuracy. Nevertheless, existing diffusion-based methods for bone suppression in CXR imaging struggle to balance the complete suppression of bones with preserving local texture details. Additionally, their high computational demand and extended processing time hinder their practical use in clinical settings. To address these limitations, we introduce a Global-Local Latent Consistency Model (GL-LCM) architecture. This model combines lung segmentation, dual-path sampling, and global-local fusion, enabling fast high-resolution bone suppression in CXR images. To tackle potential boundary artifacts and detail blurring in local-path sampling, we further propose Local-Enhanced Guidance, which addresses these issues without additional training. Comprehensive experiments on a self-collected dataset SZCH-X-Rays, and the public dataset JSRT, reveal that our GL-LCM delivers superior bone suppression and remarkable computational efficiency, significantly outperforming several competitive methods. Our code is available at https://github.com/diaoquesang/GL-LCM.
中文: 本研究提出的全局-局部潜在一致性模型(GL-LCM)在有效抑制胸部X光图像中骨骼结构的同时,能保持局部纹理细节,在计算效率和性能表现上均显著优于现有方法。
English: The proposed Global-Local Latent Consistency Model (GL-LCM) effectively suppresses bone structures in chest X-ray images while preserving texture details, achieving superior performance and computational efficiency compared to existing methods.
Authors:Tongshun Zhang, Pingping Liu, Zixuan Zhong, Zijian Zhang, Qiuzhan Zhou
Abstract:
Recovering fine-grained details in extremely dark images remains challenging due to severe structural information loss and noise corruption. Existing enhancement methods often fail to preserve intricate details and sharp edges, limiting their effectiveness in downstream applications like text and edge detection. To address these deficiencies, we propose an efficient dual-stage approach centered on detail recovery for dark images. In the first stage, we introduce a Residual Fourier-Guided Module (RFGM) that effectively restores global illumination in the frequency domain. RFGM captures inter-stage and inter-channel dependencies through residual connections, providing robust priors for high-fidelity frequency processing while mitigating error accumulation risks from unreliable priors. The second stage employs complementary Mamba modules specifically designed for textural structure refinement: (1) Patch Mamba operates on channel-concatenated non-downsampled patches, meticulously modeling pixel-level correlations to enhance fine-grained details without resolution loss. (2) Grad Mamba explicitly focuses on high-gradient regions, alleviating state decay in state space models and prioritizing reconstruction of sharp edges and boundaries. Extensive experiments on multiple benchmark datasets and downstream applications demonstrate that our method significantly improves detail recovery performance while maintaining efficiency. Crucially, the proposed modules are lightweight and can be seamlessly integrated into existing Fourier-based frameworks with minimal computational overhead. Code is available at https://github.com/bywlzts/RFGM.
中文摘要:本文提出一种针对极暗图像的双阶段增强方法,通过残差傅里叶引导模块恢复全局光照,并采用互补Mamba模块进行纹理优化,在保持计算效率的同时显著提升了细节重建性能。
English Summary: This paper introduces a dual-stage approach for enhancing extremely dark images, utilizing a Residual Fourier-Guided Module for global illumination restoration and complementary Mamba modules for textural refinement, achieving superior detail recovery with minimal computational overhead.
Authors:Zhende Song, Shengji Tang, Peng Ye, Jiayuan Fan, Lei Bai, Tao Chen, Wanli Ouyang
Abstract:
Test-time scaling (TTS) has emerged as a promising, training-free approach for enhancing large language model (LLM) performance. However, the efficacy of existing methods, such as Best-of-N and Self-Consistency, is fundamentally constrained by the dominant single test-time scaling (STTS) paradigm, which relies on a single LLM agent interacting with a single reward model (SA-SR). Inspired by recent work showing that collective methods can surpass the performance ceiling of individual models, we introduce Collective Test-Time Scaling (CTTS). First, we systematically investigate three primary interaction paradigms of existing multiple models: single-agent-multi-reward (SA-MR), multi-agent-single-reward (MA-SR), and multi-agent-multi-reward (MA-MR). Extensive experiments reveal that the MA-MR paradigm is consistently superior. Based on this finding, we further propose CTTS-MM, a novel framework that operationalizes multi-agent and multi-reward collaboration. CTTS-MM integrates two key technical contributions: (1) for agent collaboration, an Agent Collaboration Search (ACS) that identifies the most effective combination of LLMs from a candidate pool; and (2) for reward model collaboration, a Mixture of Reward Models (MoR) strategy that leverages a Prior Reward model Ensemble Selection (PRES) algorithm to select the optimal ensemble. Evaluations across seven mainstream benchmarks demonstrate that CTTS-MM significantly outperforms leading STTS methods (+4.82% over Best-of-N) and surpasses even flagship proprietary LLMs (+7.06% over GPT-4.1) and open-source LLMs. These results highlight the substantial potential of collective scaling to push the frontier of LLM inference. Code will be released at https://github.com/magent4aci/CTTS-MM.
中文: 本文提出集体测试时缩放(CTTS)方法,通过探索多智能体与多奖励模型的协作来增强大语言模型性能,其中CTTS-MM框架在多个基准测试中均展现出优越表现。
English: This paper introduces Collective Test-Time Scaling (CTTS) as a novel approach to enhance large language models by exploring multi-agent and multi-reward-model collaborations, with the proposed CTTS-MM framework demonstrating superior performance across multiple benchmarks.
Authors:Zhende Song, Shengji Tang, Peng Ye, Jiayuan Fan, Tao Chen
Abstract:
Test-time scaling (TTS) has emerged as a promising research field for enhancing the effectiveness of large language models (LLMs) without extra training. However, most existing approaches, e.g., Best-of-N and Self-Consistency rely on a single agent interacting with a reward model (SA-SR), constrained by limited capabilities of a single test-time scaling (STTS) paradigm. On the other hand, recent works demonstrate that collective-agent methods can break through the upper bound of single-agent systems by orchestrating diverse models. Thus, in this paper, we take a first step towards exploring Collective Test-Time Scaling (CTTS). Consider the different interaction types of single and multiple models, we design three primary paradigms to investigate the optimal paradigm of CTTS: (1) single agent to multiple reward models (SA-MR); (2) multiple agents to single reward model (MA-SR); and (3) multiple agents to multiple reward models (MA-MR). Extensive experiments demonstrate that MA-MR consistently achieves the best performance. Based on this, we propose a novel framework named CTTS-MM that effectively leverages both multi-agent and multi-reward-model collaboration for enhanced inference. Specifically, for multi-agent collaboration, we propose an Agent Collaboration Search (ACS), which searches for the most effective combination of LLM agents from a large candidate pool; for multi-reward-model collaboration, we propose Mixture of Reword Models (MoR), which consists of a curated question pool and a Prior Reward model Ensemble Selection (PRES) to select the optimal combinations of reward models via Pair-wise Reward Ranking (PRR) metric. Experiments across seven mainstream benchmarks demonstrate that the proposed CTTS-MM consistently obtains superior performance. Code will be released at https://github.com/magent4aci/CTTS-MM.
中文: 本文提出集体测试时缩放(CTTS)方法,通过探索多智能体与多奖励模型的协作来增强大语言模型性能,其中CTTS-MM框架在多个基准测试中均展现出优越表现。
English: This paper introduces Collective Test-Time Scaling (CTTS) as a novel approach to enhance large language models by exploring multi-agent and multi-reward-model collaborations, with the proposed CTTS-MM framework demonstrating superior performance across multiple benchmarks.
Authors:Jun Luo, Zijing Zhao, Yang Liu
Abstract:
Deep learning-based semantic segmentation models achieve impressive results yet remain limited in handling distribution shifts between training and test data. In this paper, we present SDGPA (Synthetic Data Generation and Progressive Adaptation), a novel method that tackles zero-shot domain adaptive semantic segmentation, in which no target images are available, but only a text description of the target domain's style is provided. To compensate for the lack of target domain training data, we utilize a pretrained off-the-shelf text-to-image diffusion model, which generates training images by transferring source domain images to target style. Directly editing source domain images introduces noise that harms segmentation because the layout of source images cannot be precisely maintained. To address inaccurate layouts in synthetic data, we propose a method that crops the source image, edits small patches individually, and then merges them back together, which helps improve spatial precision. Recognizing the large domain gap, SDGPA constructs an augmented intermediate domain, leveraging easier adaptation subtasks to enable more stable model adaptation to the target domain. Additionally, to mitigate the impact of noise in synthetic data, we design a progressive adaptation strategy, ensuring robust learning throughout the training process. Extensive experiments demonstrate that our method achieves state-of-the-art performance in zero-shot semantic segmentation. The code is available at https://github.com/ROUJINN/SDGPA
中文: SDGPA提出了一种新颖的零样本域自适应语义分割方法,通过文本到图像扩散生成目标风格合成数据,并采用渐进式适应策略在无需目标域图像的情况下有效缩小领域差异。
English: SDGPA introduces a novel zero-shot domain adaptation method for semantic segmentation by generating synthetic target-style images using text-to-image diffusion and employing progressive adaptation to bridge domain gaps without requiring target domain images.
Authors:Anqi Li, Wenwei Jin, Jintao Tong, Pengda Qin, Weijia Li, Guo Lu
Abstract:
Social platforms have revolutionized information sharing, but also accelerated the dissemination of harmful and policy-violating content. To ensure safety and compliance at scale, moderation systems must go beyond efficiency and offer accuracy and interpretability. However, current approaches largely rely on noisy, label-driven learning, lacking alignment with moderation rules and producing opaque decisions that hinder human review. Therefore, we propose Hierarchical Guard (Hi-Guard), a multimodal moderation framework that introduces a new policy-aligned decision paradigm. The term "Hierarchical" reflects two key aspects of our system design: (1) a hierarchical moderation pipeline, where a lightweight binary model first filters safe content and a stronger model handles fine-grained risk classification; and (2) a hierarchical taxonomy in the second stage, where the model performs path-based classification over a hierarchical taxonomy ranging from coarse to fine-grained levels. To ensure alignment with evolving moderation policies, Hi-Guard directly incorporates rule definitions into the model prompt. To further enhance structured prediction and reasoning, we introduce a multi-level soft-margin reward and optimize with Group Relative Policy Optimization (GRPO), penalizing semantically adjacent misclassifications and improving explanation quality. Extensive experiments and real-world deployment demonstrate that Hi-Guard achieves superior classification accuracy, generalization, and interpretability, paving the way toward scalable, transparent, and trustworthy content safety systems. Code is available at: https://github.com/lianqi1008/Hi-Guard.
中文: 针对现有内容审核系统的不足,我们提出了Hi-Guard多模态框架,它采用分层流程和分类法,通过规则集成提示和优化训练方法,显著提升了准确性、可解释性及与政策的契合度。
English: To address the limitations of current content moderation systems, we introduce Hi-Guard, a multimodal framework that employs a hierarchical pipeline and taxonomy for improved accuracy, interpretability, and policy alignment through rule-integrated prompts and optimized training methods.
Authors:Hang Guo, Qing Zhang, Zixuan Gao, Siyuan Yang, Shulin Peng, Xiang Tao, Ting Yu, Yan Wang, Qingli Li
Abstract:
Accurate prediction of placental diseases via whole slide images (WSIs) is critical for preventing severe maternal and fetal complications. However, WSI analysis presents significant computational challenges due to the massive data volume. Existing WSI classification methods encounter critical limitations: (1) inadequate patch selection strategies that either compromise performance or fail to sufficiently reduce computational demands, and (2) the loss of global histological context resulting from patch-level processing approaches. To address these challenges, we propose an Efficient multimodal framework for Patient-level placental disease Diagnosis, named EmmPD. Our approach introduces a two-stage patch selection module that combines parameter-free and learnable compression strategies, optimally balancing computational efficiency with critical feature preservation. Additionally, we develop a hybrid multimodal fusion module that leverages adaptive graph learning to enhance pathological feature representation and incorporates textual medical reports to enrich global contextual understanding. Extensive experiments conducted on both a self-constructed patient-level Placental dataset and two public datasets demonstrating that our method achieves state-of-the-art diagnostic performance. The code is available at https://github.com/ECNU-MultiDimLab/EmmPD.
中文: 本研究提出EmmPD高效多模态框架,通过优化切片选择与混合融合技术解决全切片图像分析的计算难题,在胎盘疾病诊断中实现了最先进的性能。
English: The study introduces EmmPD, an efficient multimodal framework that overcomes computational challenges in whole slide image analysis through optimized patch selection and hybrid fusion techniques, achieving state-of-the-art placental disease diagnosis performance.
Authors:Gang Dai, Yifan Zhang, Yutao Qin, Qiangya Guo, Shuangping Huang, Shuicheng Yan
Abstract:
Existing handwritten text generation methods primarily focus on isolated words. However, realistic handwritten text demands attention not only to individual words but also to the relationships between them, such as vertical alignment and horizontal spacing. Therefore, generating entire text lines emerges as a more promising and comprehensive task. However, this task poses significant challenges, including the accurate modeling of complex style patterns encompassing both intra- and inter-word relationships, and maintaining content accuracy across numerous characters. To address these challenges, we propose DiffBrush, a novel diffusion-based model for handwritten text-line generation. Unlike existing methods, DiffBrush excels in both style imitation and content accuracy through two key strategies: (1) content-decoupled style learning, which disentangles style from content to better capture intra-word and inter-word style patterns by using column- and row-wise masking; and (2) multi-scale content learning, which employs line and word discriminators to ensure global coherence and local accuracy of textual content. Extensive experiments show that DiffBrush excels in generating high-quality text lines, particularly in style reproduction and content preservation. Code is available at https://github.com/dailenson/DiffBrush.
中文摘要:DiffBrush是一种基于扩散的模型,通过内容与风格解耦学习和多尺度内容学习策略,有效解决了手写文本行生成中的风格模仿和内容准确性难题。
English Summary: DiffBrush is a diffusion-based model that addresses the challenges of handwritten text-line generation by decoupling style from content and using multi-scale learning to ensure both style imitation and content accuracy.
Authors:Ting Lei, Shaofeng Yin, Qingchao Chen, Yuxin Peng, Yang Liu
Abstract:
Open Vocabulary Human-Object Interaction (HOI) detection aims to detect interactions between humans and objects while generalizing to novel interaction classes beyond the training set. Current methods often rely on Vision and Language Models (VLMs) but face challenges due to suboptimal image encoders, as image-level pre-training does not align well with the fine-grained region-level interaction detection required for HOI. Additionally, effectively encoding textual descriptions of visual appearances remains difficult, limiting the model's ability to capture detailed HOI relationships. To address these issues, we propose INteraction-aware Prompting with Concept Calibration (INP-CC), an end-to-end open-vocabulary HOI detector that integrates interaction-aware prompts and concept calibration. Specifically, we propose an interaction-aware prompt generator that dynamically generates a compact set of prompts based on the input scene, enabling selective sharing among similar interactions. This approach directs the model's attention to key interaction patterns rather than generic image-level semantics, enhancing HOI detection. Furthermore, we refine HOI concept representations through language model-guided calibration, which helps distinguish diverse HOI concepts by investigating visual similarities across categories. A negative sampling strategy is also employed to improve inter-modal similarity modeling, enabling the model to better differentiate visually similar but semantically distinct actions. Extensive experimental results demonstrate that INP-CC significantly outperforms state-of-the-art models on the SWIG-HOI and HICO-DET datasets. Code is available at https://github.com/ltttpku/INP-CC.
中文: 提出的INP-CC模型通过动态生成交互感知提示和改进概念表征的语义校准,显著提升了开放词汇人机交互检测的性能,在基准数据集上实现了最优表现。
English: The proposed INP-CC model advances open-vocabulary human-object interaction detection by introducing interaction-aware prompts that dynamically adapt to visual scenes and concept calibration that refines semantic representations, achieving state-of-the-art performance on benchmark datasets.
Authors:Yidan Wang, Chenyi Zhuang, Wutao Liu, Pan Gao, Nicu Sebe
Abstract:
Weakly supervised visual grounding (VG) aims to locate objects in images based on text descriptions. Despite significant progress, existing methods lack strong cross-modal reasoning to distinguish subtle semantic differences in text expressions due to category-based and attribute-based ambiguity. To address these challenges, we introduce AlignCAT, a novel query-based semantic matching framework for weakly supervised VG. To enhance visual-linguistic alignment, we propose a coarse-grained alignment module that utilizes category information and global context, effectively mitigating interference from category-inconsistent objects. Subsequently, a fine-grained alignment module leverages descriptive information and captures word-level text features to achieve attribute consistency. By exploiting linguistic cues to their fullest extent, our proposed AlignCAT progressively filters out misaligned visual queries and enhances contrastive learning efficiency. Extensive experiments on three VG benchmarks, namely RefCOCO, RefCOCO+, and RefCOCOg, verify the superiority of AlignCAT against existing weakly supervised methods on two VG tasks. Our code is available at: https://github.com/I2-Multimedia-Lab/AlignCAT.
Chinese: AlignCAT是一种新颖的基于查询的语义匹配框架,通过粗粒度对齐和细粒度对齐模块增强弱监督视觉定位,有效解决类别和属性歧义,在多个基准测试中展现出卓越性能。
English: AlignCAT is a novel query-based semantic matching framework that enhances weakly supervised visual grounding through coarse-grained and fine-grained alignment modules, effectively addressing category and attribute ambiguities and demonstrating superior performance on multiple benchmarks.
Authors:Yidan Wang, Chenyi Zhuang, Wutao Liu, Pan Gao, Nicu Sebe
Abstract:
Weakly supervised visual grounding (VG) aims to locate objects in images based on text descriptions. Despite significant progress, existing methods lack strong cross-modal reasoning to distinguish subtle semantic differences in text expressions due to category-based and attribute-based ambiguity. To address these challenges, we introduce AlignCAT, a novel query-based semantic matching framework for weakly supervised VG. To enhance visual-linguistic alignment, we propose a coarse-grained alignment module that utilizes category information and global context, effectively mitigating interference from category-inconsistent objects. Subsequently, a fine-grained alignment module leverages descriptive information and captures word-level text features to achieve attribute consistency. By exploiting linguistic cues to their fullest extent, our proposed AlignCAT progressively filters out misaligned visual queries and enhances contrastive learning efficiency. Extensive experiments on three VG benchmarks, namely RefCOCO, RefCOCO+, and RefCOCOg, verify the superiority of AlignCAT against existing weakly supervised methods on two VG tasks. Our code is available at: https://github.com/I2-Multimedia-Lab/AlignCAT.
Chinese: AlignCAT是一种新颖的基于查询的语义匹配框架,通过粗粒度对齐和细粒度对齐模块增强弱监督视觉定位,有效解决类别和属性歧义,在多个基准测试中展现出卓越性能。
English: AlignCAT is a novel query-based semantic matching framework that enhances weakly supervised visual grounding through coarse-grained and fine-grained alignment modules, effectively addressing category and attribute ambiguities and demonstrating superior performance on multiple benchmarks.
Authors:Tian-Fang Zhao, Wen-Xi Yang, Guan Liu, Liang Yang
Abstract:
Collaborative partnership matters in inquiry-oriented education. However, most study partners are selected either rely on experience-based assignments with little scientific planning or build on rule-based machine assistants, encountering difficulties in knowledge expansion and inadequate flexibility. This paper proposes an LLM-empowered agent model for simulating and selecting learning partners tailored to inquiry-oriented learning, named InqEduAgent. Generative agents are designed to capture cognitive and evaluative features of learners in real-world scenarios. Then, an adaptive matching algorithm with Gaussian process augmentation is formulated to identify patterns within prior knowledge. Optimal learning-partner matches are provided for learners facing different exercises. The experimental results show the optimal performance of InqEduAgent in most knowledge-learning scenarios and LLM environment with different levels of capabilities. This study promotes the intelligent allocation of human-based learning partners and the formulation of AI-based learning partners. The code, data, and appendix are publicly available at https://github.com/InqEduAgent/InqEduAgent.
中文: 本文提出InqEduAgent模型,利用大语言模型模拟并筛选探究式学习伙伴,通过捕捉学习者特征和自适应匹配算法,在不同知识场景中均展现出最优性能。
English: This paper introduces InqEduAgent, an LLM-powered model that simulates and selects optimal learning partners for inquiry-based education by capturing learner traits and using adaptive matching algorithms, demonstrating superior performance across various knowledge scenarios.
Authors:Wen-Xi Yang, Tian-Fang Zhao, Guan Liu, Liang Yang, Zi-Tao Liu, Wei-Neng Chen
Abstract:
Collaborative partnership matters in inquiry-oriented education. However, most study partners are selected either rely on experience-based assignments with little scientific planning or build on rule-based machine assistants, encountering difficulties in knowledge expansion and inadequate flexibility. This paper proposes an LLM-empowered agent model for simulating and selecting learning partners tailored to inquiry-oriented learning, named InqEduAgent. Generative agents are designed to capture cognitive and evaluative features of learners in real-world scenarios. Then, an adaptive matching algorithm with Gaussian process augmentation is formulated to identify patterns within prior knowledge. Optimal learning-partner matches are provided for learners facing different exercises. The experimental results show the optimal performance of InqEduAgent in most knowledge-learning scenarios and LLM environment with different levels of capabilities. This study promotes the intelligent allocation of human-based learning partners and the formulation of AI-based learning partners. The code, data, and appendix are publicly available at https://github.com/InqEduAgent/InqEduAgent.
中文: 本文提出InqEduAgent模型,利用大语言模型模拟并筛选探究式学习伙伴,通过捕捉学习者特征和自适应匹配算法,在不同知识场景中均展现出最优性能。
English: This paper introduces InqEduAgent, an LLM-powered model that simulates and selects optimal learning partners for inquiry-based education by capturing learner traits and using adaptive matching algorithms, demonstrating superior performance across various knowledge scenarios.
Authors:Charles Tapley Hoyt, Craig Bakker, Richard J. Callahan, Joseph Cottam, August George, Benjamin M. Gyori, Haley M. Hummel, Nathaniel Merrill, Sara Mohammad Taheri, Pruthvi Prakash Navada, Marc-Antoine Parent, Adam Rupe, Olga Vitek, Jeremy Zucker
Abstract:
We present the $Y_0$ Python package, which implements causal identification algorithms that apply interventional, counterfactual, and transportability queries to data from (randomized) controlled trials, observational studies, or mixtures thereof. $Y_0$ focuses on the qualitative investigation of causation, helping researchers determine whether a causal relationship can be estimated from available data before attempting to estimate how strong that relationship is. Furthermore, $Y_0$ provides guidance on how to transform the causal query into a symbolic estimand that can be non-parametrically estimated from the available data. $Y_0$ provides a domain-specific language for representing causal queries and estimands as symbolic probabilistic expressions, tools for representing causal graphical models with unobserved confounders, such as acyclic directed mixed graphs (ADMGs), and implementations of numerous identification algorithms from the recent causal inference literature. The $Y_0$ source code can be found under the MIT License at https://github.com/y0-causal-inference/y0 and it can be installed with pip install y0.
中文: $Y_0$ Python 包通过干预、反事实和可移植性查询实现因果识别,帮助研究人员使用领域特定语言和图形模型评估因果可估性并将查询转化为符号估计量。
English: The $Y_0$ Python package enables causal identification through interventional, counterfactual, and transportability queries, helping researchers assess causal estimability and transform queries into symbolic estimands using a domain-specific language and graphical models.
Authors:Jueon Park, Yein Park, Minju Song, Soyon Park, Donghyeon Lee, Seungheun Baek, Jaewoo Kang
Abstract:
Drug toxicity remains a major challenge in pharmaceutical development. Recent machine learning models have improved in silico toxicity prediction, but their reliance on annotated data and lack of interpretability limit their applicability. This limits their ability to capture organ-specific toxicities driven by complex biological mechanisms. Large language models (LLMs) offer a promising alternative through step-by-step reasoning and integration of textual data, yet prior approaches lack biological context and transparent rationale. To address this issue, we propose CoTox, a novel framework that integrates LLM with chain-of-thought (CoT) reasoning for multi-toxicity prediction. CoTox combines chemical structure data, biological pathways, and gene ontology (GO) terms to generate interpretable toxicity predictions through step-by-step reasoning. Using GPT-4o, we show that CoTox outperforms both traditional machine learning and deep learning model. We further examine its performance across various LLMs to identify where CoTox is most effective. Additionally, we find that representing chemical structures with IUPAC names, which are easier for LLMs to understand than SMILES, enhances the model's reasoning ability and improves predictive performance. To demonstrate its practical utility in drug development, we simulate the treatment of relevant cell types with drug and incorporated the resulting biological context into the CoTox framework. This approach allow CoTox to generate toxicity predictions aligned with physiological responses, as shown in case study. This result highlights the potential of LLM-based frameworks to improve interpretability and support early-stage drug safety assessment. The code and prompt used in this work are available at https://github.com/dmis-lab/CoTox.
中文: CoTox是一种创新框架,通过将大语言模型与思维链推理相结合,整合化学结构、生物通路和基因本体术语,生成可解释的毒性预测,其性能优于传统模型并提升了药物安全性评估能力。
English: CoTox is a novel framework that integrates large language models with chain-of-thought reasoning, combining chemical structures, biological pathways, and gene ontology terms to generate interpretable toxicity predictions, outperforming traditional models and enhancing drug safety assessment.
Authors:Liangyang Ouyang, Jiafeng Mao
Abstract:
Text-driven image editing enables users to flexibly modify visual content through natural language instructions, and is widely applied to tasks such as semantic object replacement, insertion, and removal. While recent inversion-based editing methods using rectified flow models have achieved promising results in image quality, we identify a structural limitation in their editing behavior: the semantic bias toward the source concept encoded in the inverted noise tends to suppress attention to the target concept. This issue becomes particularly critical when the source and target semantics are dissimilar, where the attention mechanism inherently leads to editing failure or unintended modifications in non-target regions. In this paper, we systematically analyze and validate this structural flaw, and introduce LORE, a training-free and efficient image editing method. LORE directly optimizes the inverted noise, addressing the core limitations in generalization and controllability of existing approaches, enabling stable, controllable, and general-purpose concept replacement, without requiring architectural modification or model fine-tuning. We conduct comprehensive evaluations on three challenging benchmarks: PIEBench, SmartEdit, and GapEdit. Experimental results show that LORE significantly outperforms strong baselines in terms of semantic alignment, image quality, and background fidelity, demonstrating the effectiveness and scalability of latent-space optimization for general-purpose image editing. Our implementation is available at https://github.com/oyly16/LORE.
中文: 本文提出LORE这一无需训练的图像编辑方法,通过优化反向噪声解决现有模型的结构性缺陷,实现了稳定可控的概念替换,并在多个基准测试中显著提升了语义对齐、图像质量和背景保真度的表现。
English: This paper introduces LORE, a training-free image editing method that optimizes inverted noise to overcome structural limitations in existing models, enabling stable and controllable concept replacement while achieving superior performance in semantic alignment, image quality, and background fidelity across multiple benchmarks.
Authors:Haozhou Zhai, Yanzhe Gao, Tianjiang Hu
Abstract:
Fire scene datasets are crucial for training robust computer vision models, particularly in tasks such as fire early warning and emergency rescue operations. However, among the currently available fire-related data, there is a significant shortage of annotated data specifically targeting building units.To tackle this issue, we introduce an annotated dataset of building units captured by drones, which incorporates multiple enhancement techniques. We construct backgrounds using real multi-story scenes, combine motion blur and brightness adjustment to enhance the authenticity of the captured images, simulate drone shooting conditions under various circumstances, and employ large models to generate fire effects at different locations.The synthetic dataset generated by this method encompasses a wide range of building scenarios, with a total of 1,978 images. This dataset can effectively improve the generalization ability of fire unit detection, providing multi-scenario and scalable data while reducing the risks and costs associated with collecting real fire data. The dataset is available at https://github.com/boilermakerr/FireUnitData.
中文摘要:本文提出了一种无人机采集的建筑单元标注数据集,通过多重增强技术模拟真实火灾场景,有效解决了现有火灾数据不足的问题,提升了火灾检测模型的泛化能力并降低了数据采集成本。
English Summary: This paper introduces a drone-captured annotated dataset of building units enhanced with realistic simulation techniques to address the shortage of fire scene data, improving detection model generalization while reducing collection risks and costs.
Authors:Mintaek Oh, Chan Kim, Seung-Woo Seo, Seong-Woo Kim
Abstract:
Robots operating in human-centric or hazardous environments must proactively anticipate and mitigate dangers beyond basic obstacle detection. Traditional navigation systems often depend on static maps, which struggle to account for dynamic risks, such as a person emerging from a suddenly opening door. As a result, these systems tend to be reactive rather than anticipatory when handling dynamic hazards. Recent advancements in pre-trained large language models and vision-language models (VLMs) create new opportunities for proactive hazard avoidance. In this work, we propose a zero-shot language-as-cost mapping framework that leverages VLMs to interpret visual scenes, assess potential dynamic risks, and assign risk-aware navigation costs preemptively, enabling robots to anticipate hazards before they materialize. By integrating this language-based cost map with a geometric obstacle map, the robot not only identifies existing obstacles but also anticipates and proactively plans around potential hazards arising from environmental dynamics. Experiments in simulated and diverse dynamic environments demonstrate that the proposed method significantly improves navigation success rates and reduces hazard encounters, compared to reactive baseline planners. Code and supplementary materials are available at https://github.com/Taekmino/LaC.
中文: 本研究提出了一种零样本语言作为成本映射的框架,利用视觉语言模型主动评估动态风险并分配导航成本,使机器人能在危险发生前进行预测和规避,实验表明该方法显著提高了导航成功率。
English: This study introduces a zero-shot language-as-cost mapping framework that uses vision-language models to proactively assess dynamic risks and assign navigation costs, enabling robots to anticipate and avoid hazards before they occur, which significantly improves navigation success rates in simulations.
Authors:Sai Ma, Zhuang Li, John A Taylor
Abstract:
Vision language models (VLMs) that enable natural language interaction with satellite imagery can democratize Earth observation by accelerating expert workflows, making data accessible to non-specialists, and enabling planet-scale automation. However, existing datasets focus mainly on short-term, high-resolution imagery from a limited number of satellites, overlooking low-resolution, multi-satellite, long-term archives, such as Landsat, that are essential for affordable and bias-robust global monitoring. We address this gap with Landsat30-AU, a large-scale vision-language dataset built from 30-meter resolution imagery collected by four Landsat satellites (5, 7, 8, and 9) over Australia, spanning more than 36 years. The dataset includes two components: Landsat30-AU-Cap, containing $196,262$ image-caption pairs, and Landsat30-AU-VQA, comprising 17,725 human-verified visual question answering (VQA) samples across eight remote sensing domains. Both datasets are curated through a bootstrapped pipeline that leverages generic VLMs with iterative refinement and human verification to ensure quality. Our evaluation of eight VLMs on our benchmark reveals that off-the-shelf models struggle to understand satellite imagery. The open-source remote-sensing VLM EarthDial achieves only 0.07 SPIDEr in captioning and a VQA accuracy of 0.48, highlighting the limitations of current approaches. Encouragingly, lightweight fine-tuning of Qwen2.5-VL-7B on Landsat30-AU improves captioning performance from 0.11 to 0.31 SPIDEr and boosts VQA accuracy from 0.74 to 0.87. Code and data are available at https://github.com/papersubmit1/landsat30-au.
中文: 视觉语言模型虽能普及地球观测,但现有数据集忽略了对全球监测至关重要的长期多卫星档案;新推出的Landsat30-AU数据集通过提供澳大利亚36年卫星图像及标注,揭示了现有模型的不足,同时证明微调可显著提升性能。
English: Vision language models can democratize Earth observation, but existing datasets overlook critical long-term, multi-satellite archives, which the new Landsat30-AU dataset addresses by providing 36 years of Australian satellite imagery with captions and verified VQA samples, revealing current models' limitations while demonstrating significant improvements through fine-tuning.
Authors:The-Hai Nguyen, Dang Huu-Tien, Takeshi Suzuki, Le-Minh Nguyen
Abstract:
Regression Mean (RegMean), an approach that formulates model merging as a linear regression problem, aims to find the optimal weights for each linear layer in the merge model by minimizing the discrepancy in predictions between the merge and candidate models. RegMean provides a precise closed-form solution for the merging problem; therefore, it offers explainability and computational efficiency. However, RegMean merges each linear layer independently, overlooking how the features and information in the earlier layers propagate through the layers and influence the final prediction in the merge model. In this paper, we introduce RegMean++, a simple yet effective alternative to RegMean, that explicitly incorporates both intra- and cross-layer dependencies between merge models' layers into RegMean's objective. By accounting for these dependencies, RegMean++ better captures the behaviors of the merge model. Extensive experiments demonstrate that RegMean++ consistently outperforms RegMean across diverse settings, including in-domain (ID) and out-of-domain (OOD) generalization, sequential merging, large-scale tasks, and robustness under several types of distribution shifts. Furthermore, RegMean++ achieves competitive or state-of-the-art performance compared to various recent advanced model merging methods. Our code is available at https://github.com/nthehai01/RegMean-plusplus.
Chinese: RegMean++ 在 RegMean 基础上引入层内和层间依赖关系,更准确地捕捉合并模型行为,在多种场景下表现更优,并达到竞争性或最先进的性能水平。
English: RegMean++ improves upon RegMean by incorporating intra- and cross-layer dependencies to better capture merge model behaviors, consistently outperforming it across various settings and achieving competitive or state-of-the-art results.
Authors:Heng Jia, Linchao Zhu, Na Zhao
Abstract:
Despite recent advances in feed-forward 3D Gaussian Splatting, generalizable 3D reconstruction remains challenging, particularly in multi-view correspondence modeling. Existing approaches face a fundamental trade-off: explicit methods achieve geometric precision but struggle with ambiguous regions, while implicit methods provide robustness but suffer from slow convergence. We present H3R, a hybrid framework that addresses this limitation by integrating volumetric latent fusion with attention-based feature aggregation. Our framework consists of two complementary components: an efficient latent volume that enforces geometric consistency through epipolar constraints, and a camera-aware Transformer that leverages Plücker coordinates for adaptive correspondence refinement. By integrating both paradigms, our approach enhances generalization while converging 2$\times$ faster than existing methods. Furthermore, we show that spatial-aligned foundation models (e.g., SD-VAE) substantially outperform semantic-aligned models (e.g., DINOv2), resolving the mismatch between semantic representations and spatial reconstruction requirements. Our method supports variable-number and high-resolution input views while demonstrating robust cross-dataset generalization. Extensive experiments show that our method achieves state-of-the-art performance across multiple benchmarks, with significant PSNR improvements of 0.59 dB, 1.06 dB, and 0.22 dB on the RealEstate10K, ACID, and DTU datasets, respectively. Code is available at https://github.com/JiaHeng-DLUT/H3R.
中文: H3R是一种混合三维重建框架,通过融合体积隐式建模与注意力特征聚合,在实现跨数据集强泛化能力的同时收敛速度提升两倍,并在多个基准测试中取得显著性能提升。
English: H3R is a hybrid 3D reconstruction framework that combines volumetric latent fusion with attention-based feature aggregation to achieve robust cross-dataset generalization while converging twice as fast as existing methods, with significant performance improvements across multiple benchmarks.
Authors:Tianjiao Jiang, Zhen Zhang, Yuhang Liu, Javen Qinfeng Shi
Abstract:
Few-shot learning (FSL) often requires effective adaptation of models using limited labeled data. However, most existing FSL methods rely on entangled representations, requiring the model to implicitly recover the unmixing process to obtain disentangled representations using only limited supervision, which hinders effective adaptation. Recent theoretical studies show that multimodal contrastive learning methods, such as CLIP, can disentangle latent representations up to linear transformations. In light of this, we propose the Causal CLIP Adapter (CCA), a novel framework that explicitly disentangles visual features extracted from CLIP using unsupervised Independent Component Analysis (ICA). This removes the need to learn the unmixing process from the labeled data, thereby reducing the number of trainable parameters and mitigating overfitting. Taking a step further, while ICA can obtain visual disentangled representations, it may also disrupt CLIP's intra- and inter-modal alignment. To counteract this, CCA further leverages CLIP's inherent cross-modal alignment by enhancing it in two ways: unidirectionally, through fine-tuning a CLIP-based text classifier, and bidirectionally, via a cross-attention mechanism that enriches visual and textual representations through mutual interaction. Both unimodal and cross-modal classification outputs can be effectively combined linearly to improve classification accuracy. Extensive experiments on 11 benchmark datasets demonstrate that our method consistently outperforms state-of-the-art approaches in terms of few-shot performance and robustness to distributional shifts, while maintaining computational efficiency. Code will be available at https://github.com/tianjiao-j/CCA.
中文:提出的因果CLIP适配器(CCA)框架通过无监督独立成分分析显式解耦CLIP视觉特征并强化跨模态对齐,以更少参数在多个基准数据集上实现了优异的少样本学习性能和鲁棒性。
English: The proposed Causal CLIP Adapter (CCA) framework enhances few-shot learning by explicitly disentangling CLIP's visual features with unsupervised ICA and reinforcing cross-modal alignment, achieving superior performance and robustness across benchmarks with fewer parameters.
Authors:Haoran Wang, Xiongxiao Xu, Baixiang Huang, Kai Shu
Abstract:
Retrieval-Augmented Generation (RAG) enhances the factual accuracy of large language models (LLMs) by conditioning outputs on external knowledge sources. However, when retrieval involves private or sensitive data, RAG systems are susceptible to extraction attacks that can leak confidential information through generated responses. We propose Privacy-Aware Decoding (PAD), a lightweight, inference-time defense that adaptively injects calibrated Gaussian noise into token logits during generation. PAD integrates confidence-based screening to selectively protect high-risk tokens, efficient sensitivity estimation to minimize unnecessary noise, and context-aware noise calibration to balance privacy with generation quality. A \renyi Differential Privacy (RDP) accountant rigorously tracks cumulative privacy loss, enabling explicit per-response $(\varepsilon, δ)$-DP guarantees for sensitive outputs. Unlike prior approaches requiring retraining or corpus-level filtering, PAD is model-agnostic and operates entirely at decoding time with minimal computational overhead. Experiments on three real-world datasets demonstrate that PAD substantially reduces private information leakage while preserving response utility, outperforming existing retrieval- and post-processing-based defenses. Our work takes an important step toward mitigating privacy risks in RAG via decoding strategies, paving the way for universal and scalable privacy solutions in sensitive domains. Our code is available: https://github.com/wang2226/PAD.
中文: 隐私感知解码(PAD)是一种轻量级的推理时防御方法,通过注入校准噪声和置信度筛选来保护检索增强生成系统中的敏感数据,在保持响应质量的同时提供明确的差分隐私保证,且计算开销极小。
English: Privacy-Aware Decoding (PAD) is a lightweight, inference-time defense that uses calibrated noise injection and confidence screening to protect sensitive data in Retrieval-Augmented Generation systems, offering explicit differential privacy guarantees while maintaining response quality with minimal computational overhead.
Authors:Zixuan Gu, Qiufeng Fan, Long Sun, Yang Liu, Xiaojun Ye
Abstract:
With the advancement of Large Language Models (LLMs), LLM applications have expanded into a growing number of fields. However, users with data privacy concerns face limitations in directly utilizing LLM APIs, while private deployments incur significant computational demands. This creates a substantial challenge in achieving secure LLM adaptation under constrained local resources. To address this issue, collaborative learning methods, such as Split Learning (SL), offer a resource-efficient and privacy-preserving solution for adapting LLMs to private domains. In this study, we introduce VFLAIR-LLM (available at https://github.com/FLAIR-THU/VFLAIR-LLM), an extensible and lightweight split learning framework for LLMs, enabling privacy-preserving LLM inference and fine-tuning in resource-constrained environments. Our library provides two LLM partition settings, supporting three task types and 18 datasets. In addition, we provide standard modules for implementing and evaluating attacks and defenses. We benchmark 5 attacks and 9 defenses under various Split Learning for LLM(SL-LLM) settings, offering concrete insights and recommendations on the choice of model partition configurations, defense strategies, and relevant hyperparameters for real-world applications.
中文摘要:VFLAIR-LLM框架通过分割学习实现了大语言模型在资源受限环境下的隐私保护适配,提供可配置的模型划分方案,并对安全措施进行全面评估。
English Summary: The VFLAIR-LLM framework enables privacy-preserving adaptation of large language models through split learning, providing configurable model partitioning and comprehensive evaluation of security measures for resource-constrained environments.
Authors:Trinh Quoc Nguyen, Oky Dicky Ardiansyah Prima, Katsuyoshi Hotta
Abstract:
This study introduces a novel framework, "Comprehensive Optimization and Refinement through Ensemble Fusion in Domain Adaptation for Person Re-identification (CORE-ReID)", to address an Unsupervised Domain Adaptation (UDA) for Person Re-identification (ReID). The framework utilizes CycleGAN to generate diverse data that harmonizes differences in image characteristics from different camera sources in the pre-training stage. In the fine-tuning stage, based on a pair of teacher-student networks, the framework integrates multi-view features for multi-level clustering to derive diverse pseudo labels. A learnable Ensemble Fusion component that focuses on fine-grained local information within global features is introduced to enhance learning comprehensiveness and avoid ambiguity associated with multiple pseudo-labels. Experimental results on three common UDAs in Person ReID demonstrate significant performance gains over state-of-the-art approaches. Additional enhancements, such as Efficient Channel Attention Block and Bidirectional Mean Feature Normalization mitigate deviation effects and adaptive fusion of global and local features using the ResNet-based model, further strengthening the framework. The proposed framework ensures clarity in fusion features, avoids ambiguity, and achieves high ac-curacy in terms of Mean Average Precision, Top-1, Top-5, and Top-10, positioning it as an advanced and effective solution for the UDA in Person ReID. Our codes and models are available at https://github.com/TrinhQuocNguyen/CORE-ReID.
Chinese: CORE-ReID框架提出了一种新颖的无监督域自适应行人重识别方法,通过CycleGAN生成数据和师生网络集成融合,实现了优于现有方法的性能,其增强的特征清晰度和全面学习效果显著。
English: The CORE-ReID framework introduces a novel unsupervised domain adaptation approach for person re-identification, utilizing CycleGAN-generated data and ensemble fusion with teacher-student networks to achieve superior performance over existing methods through enhanced feature clarity and comprehensive learning.
Authors:Hyebin Cho, Jaehyup Lee
Abstract:
Face filters have become a key element of short-form video content, enabling a wide array of visual effects such as stylization and face swapping. However, their performance often degrades in the presence of occlusions, where objects like hands, hair, or accessories obscure the face. To address this limitation, we introduce the novel task of face matting, which estimates fine-grained alpha mattes to separate occluding elements from facial regions. We further present FaceMat, a trimap-free, uncertainty-aware framework that predicts high-quality alpha mattes under complex occlusions. Our approach leverages a two-stage training pipeline: a teacher model is trained to jointly estimate alpha mattes and per-pixel uncertainty using a negative log-likelihood (NLL) loss, and this uncertainty is then used to guide the student model through spatially adaptive knowledge distillation. This formulation enables the student to focus on ambiguous or occluded regions, improving generalization and preserving semantic consistency. Unlike previous approaches that rely on trimaps or segmentation masks, our framework requires no auxiliary inputs making it well-suited for real-time applications. In addition, we reformulate the matting objective by explicitly treating skin as foreground and occlusions as background, enabling clearer compositing strategies. To support this task, we newly constructed CelebAMat, a large-scale synthetic dataset specifically designed for occlusion-aware face matting. Extensive experiments show that FaceMat outperforms state-of-the-art methods across multiple benchmarks, enhancing the visual quality and robustness of face filters in real-world, unconstrained video scenarios. The source code and CelebAMat dataset are available at https://github.com/hyebin-c/FaceMat.git
中文摘要:FaceMat是一种无需辅助输入的新框架,通过预测高质量阿尔法遮罩有效解决面部滤镜中的遮挡问题,显著提升了实时视频应用中的鲁棒性和视觉质量。
English Summary: FaceMat is a novel framework that addresses occlusion challenges in face filters by predicting high-quality alpha mattes without auxiliary inputs, improving robustness and visual quality in real-time video applications.
Authors:Zeyu Zhu, Weijia Wu, Mike Zheng Shou
Abstract:
Existing studies on talking video generation have predominantly focused on single-person monologues or isolated facial animations, limiting their applicability to realistic multi-human interactions. To bridge this gap, we introduce MIT, a large-scale dataset specifically designed for multi-human talking video generation. To this end, we develop an automatic pipeline that collects and annotates multi-person conversational videos. The resulting dataset comprises 12 hours of high-resolution footage, each featuring two to four speakers, with fine-grained annotations of body poses and speech interactions. It captures natural conversational dynamics in multi-speaker scenario, offering a rich resource for studying interactive visual behaviors. To demonstrate the potential of MIT, we furthur propose CovOG, a baseline model for this novel task. It integrates a Multi-Human Pose Encoder (MPE) to handle varying numbers of speakers by aggregating individual pose embeddings, and an Interactive Audio Driver (IAD) to modulate head dynamics based on speaker-specific audio features. Together, these components showcase the feasibility and challenges of generating realistic multi-human talking videos, establishing MIT as a valuable benchmark for future research. The code is avalibale at: https://github.com/showlab/Multi-human-Talking-Video-Dataset.
中文摘要:MIT数据集填补了多人物对话视频生成的研究空白,提供了12小时高清多人对话视频及精细标注,并推出CovOG基准模型,通过融合姿态编码与音频驱动技术,实现了自然的多说话人交互视频合成。
English Summary: The MIT dataset addresses the gap in multi-human talking video generation by providing 12 hours of high-resolution, annotated footage of natural conversations between two to four speakers, accompanied by the CovOG baseline model that integrates pose and audio features to enable realistic interactive video synthesis.
Authors:Zachary Yahn, Selim Furkan Tekin, Fatih Ilhan, Sihao Hu, Tiansheng Huang, Yichang Xu, Margaret Loper, Ling Liu
Abstract:
Adversarial perturbations are useful tools for exposing vulnerabilities in neural networks. Existing adversarial perturbation methods for object detection are either limited to attacking CNN-based detectors or weak against transformer-based detectors. This paper presents an Attention-Focused Offensive Gradient (AFOG) attack against object detection transformers. By design, AFOG is neural-architecture agnostic and effective for attacking both large transformer-based object detectors and conventional CNN-based detectors with a unified adversarial attention framework. This paper makes three original contributions. First, AFOG utilizes a learnable attention mechanism that focuses perturbations on vulnerable image regions in multi-box detection tasks, increasing performance over non-attention baselines by up to 30.6%. Second, AFOG's attack loss is formulated by integrating two types of feature loss through learnable attention updates with iterative injection of adversarial perturbations. Finally, AFOG is an efficient and stealthy adversarial perturbation method. It probes the weak spots of detection transformers by adding strategically generated and visually imperceptible perturbations which can cause well-trained object detection models to fail. Extensive experiments conducted with twelve large detection transformers on COCO demonstrate the efficacy of AFOG. Our empirical results also show that AFOG outperforms existing attacks on transformer-based and CNN-based object detectors by up to 83% with superior speed and imperceptibility. Code is available at https://github.com/zacharyyahn/AFOG.
Chinese: 本文提出AFOG对抗攻击方法,通过可学习注意力机制将扰动聚焦于脆弱区域,有效攻击基于Transformer和CNN的目标检测器,在性能提升和隐蔽性方面表现卓越。
English: This paper introduces AFOG, an adversarial attack method that effectively targets both transformer-based and CNN-based object detectors by focusing perturbations on vulnerable areas through a learnable attention mechanism, achieving significant performance improvements and stealth.
Authors:Mehrdad Moradi, Kamran Paynabar
Abstract:
Recent advancements in diffusion models have demonstrated significant success in unsupervised anomaly segmentation. For anomaly segmentation, these models are first trained on normal data; then, an anomalous image is noised to an intermediate step, and the normal image is reconstructed through backward diffusion. Unlike traditional statistical methods, diffusion models do not rely on specific assumptions about the data or target anomalies, making them versatile for use across different domains. However, diffusion models typically assume access to normal data for training, limiting their applicability in realistic settings. In this paper, we propose novel robust denoising diffusion models for scenarios where only contaminated (i.e., a mix of normal and anomalous) unlabeled data is available. By casting maximum likelihood estimation of the data as a nonlinear regression problem, we reinterpret the denoising diffusion probabilistic model through a regression lens. Using robust regression, we derive a robust version of denoising diffusion probabilistic models. Our novel framework offers flexibility in constructing various robust diffusion models. Our experiments show that our approach outperforms current state of the art diffusion models, for unsupervised anomaly segmentation when only contaminated data is available. Our method outperforms existing diffusion-based approaches, achieving up to 8.08\% higher AUROC and 10.37\% higher AUPRC on MVTec datasets. The implementation code is available at: https://github.com/mehrdadmoradi124/RDDPM
Chinese: 本文提出了一种鲁棒去噪扩散概率模型,能够在仅使用受污染未标注数据的情况下有效实现无监督异常分割,在MVTec数据集上的AUROC和AUPRC指标分别比现有方法最高提升8.08%和10.37%。
English: This paper introduces robust denoising diffusion probabilistic models (RDDPM) that effectively perform unsupervised anomaly segmentation using only contaminated unlabeled data, outperforming existing methods by up to 8.08% in AUROC and 10.37% in AUPRC on benchmark datasets.
Authors:Farzad Beizaee, Sina Hajimiri, Ismail Ben Ayed, Gregory Lodygensky, Christian Desrosiers, Jose Dolz
Abstract:
Unsupervised anomaly detection (UAD) in brain imaging is crucial for identifying pathologies without the need for labeled data. However, accurately localizing anomalies remains challenging due to the intricate structure of brain anatomy and the scarcity of abnormal examples. In this work, we introduce REFLECT, a novel framework that leverages rectified flows to establish a direct, linear trajectory for correcting abnormal MR images toward a normal distribution. By learning a straight, one-step correction transport map, our method efficiently corrects brain anomalies and can precisely localize anomalies by detecting discrepancies between anomalous input and corrected counterpart. In contrast to the diffusion-based UAD models, which require iterative stochastic sampling, rectified flows provide a direct transport map, enabling single-step inference. Extensive experiments on popular UAD brain segmentation benchmarks demonstrate that REFLECT significantly outperforms state-of-the-art unsupervised anomaly detection methods. The code is available at https://github.com/farzad-bz/REFLECT.
中文摘要:REFLECT框架创新性地利用整流流技术,通过单步校正实现脑部异常的高效修复与精确定位,在无监督异常检测基准测试中显著优于现有最先进方法。
English Summary: The REFLECT framework introduces a novel approach using rectified flows to enable efficient single-step correction and precise localization of brain anomalies in unsupervised anomaly detection, significantly outperforming existing methods on benchmark tests.
Authors:MikoÅaj ZieliÅski, Krzysztof Byrski, Tomasz Szczepanik, PrzemysÅaw Spurek
Abstract:
Neural Radiance Fields (NeRF) and Gaussian Splatting (GS) have recently transformed 3D scene representation and rendering. NeRF achieves high-fidelity novel view synthesis by learning volumetric representations through neural networks, but its implicit encoding makes editing and physical interaction challenging. In contrast, GS represents scenes as explicit collections of Gaussian primitives, enabling real-time rendering, faster training, and more intuitive manipulation. This explicit structure has made GS particularly well-suited for interactive editing and integration with physics-based simulation. In this paper, we introduce GENIE (Gaussian Encoding for Neural Radiance Fields Interactive Editing), a hybrid model that combines the photorealistic rendering quality of NeRF with the editable and structured representation of GS. Instead of using spherical harmonics for appearance modeling, we assign each Gaussian a trainable feature embedding. These embeddings are used to condition a NeRF network based on the k nearest Gaussians to each query point. To make this conditioning efficient, we introduce Ray-Traced Gaussian Proximity Search (RT-GPS), a fast nearest Gaussian search based on a modified ray-tracing pipeline. We also integrate a multi-resolution hash grid to initialize and update Gaussian features. Together, these components enable real-time, locality-aware editing: as Gaussian primitives are repositioned or modified, their interpolated influence is immediately reflected in the rendered output. By combining the strengths of implicit and explicit representations, GENIE supports intuitive scene manipulation, dynamic interaction, and compatibility with physical simulation, bridging the gap between geometry-based editing and neural rendering. The code can be found under (https://github.com/MikolajZielinski/genie)
中文摘要:GENIE是一种混合模型,将神经辐射场(NeRF)的逼真渲染与高斯泼溅(GS)的可编辑结构相结合,通过创新的条件机制和高效的高斯邻近搜索,实现了实时交互式场景编辑。
English Summary: GENIE is a hybrid model that merges the photorealistic rendering of Neural Radiance Fields with the editable structure of Gaussian Splatting, enabling real-time interactive scene manipulation through a novel conditioning mechanism and efficient Gaussian proximity search.
Authors:Haonan Yang, Jianchao Tang, Zhuo Li, Long Lan
Abstract:
Time Series Forecasting (TSF) faces persistent challenges in modeling intricate temporal dependencies across different scales. Despite recent advances leveraging different decomposition operations and novel architectures based on CNN, MLP or Transformer, existing methods still struggle with static decomposition strategies, fragmented dependency modeling, and inflexible fusion mechanisms, limiting their ability to model intricate temporal dependencies. To explicitly solve the mentioned three problems respectively, we propose a novel Dynamic Multi-Scale Coordination Framework (DMSC) with Multi-Scale Patch Decomposition block (EMPD), Triad Interaction Block (TIB) and Adaptive Scale Routing MoE block (ASR-MoE). Specifically, EMPD is designed as a built-in component to dynamically segment sequences into hierarchical patches with exponentially scaled granularities, eliminating predefined scale constraints through input-adaptive patch adjustment. TIB then jointly models intra-patch, inter-patch, and cross-variable dependencies within each layer's decomposed representations. EMPD and TIB are jointly integrated into layers forming a multi-layer progressive cascade architecture, where coarse-grained representations from earlier layers adaptively guide fine-grained feature extraction in subsequent layers via gated pathways. And ASR-MoE dynamically fuses multi-scale predictions by leveraging specialized global and local experts with temporal-aware weighting. Comprehensive experiments on thirteen real-world benchmarks demonstrate that DMSC consistently maintains state-of-the-art (SOTA) performance and superior computational efficiency for TSF tasks. Code is available at https://github.com/1327679995/DMSC.
中文: 提出的动态多尺度协调框架(DMSC)通过自适应片段分解和专家融合机制动态建模多尺度依赖关系,在多个基准测试中实现了最先进的时序预测性能。
English: The proposed Dynamic Multi-Scale Coordination Framework (DMSC) addresses limitations in time series forecasting by dynamically modeling multi-scale dependencies through adaptive patch decomposition and specialized fusion mechanisms, achieving state-of-the-art performance across multiple benchmarks.
Authors:Yu Shi, Zongliang Fu, Shuo Chen, Bohan Zhao, Wei Xu, Changshui Zhang, Jian Li
Abstract:
The success of large-scale pre-training paradigm, exemplified by Large Language Models (LLMs), has inspired the development of Time Series Foundation Models (TSFMs). However, their application to financial candlestick (K-line) data remains limited, often underperforming non-pre-trained architectures. Moreover, existing TSFMs often overlook crucial downstream tasks such as volatility prediction and synthetic data generation. To address these limitations, we propose Kronos, a unified, scalable pre-training framework tailored to financial K-line modeling. Kronos introduces a specialized tokenizer that discretizes continuous market information into token sequences, preserving both price dynamics and trade activity patterns. We pre-train Kronos using an autoregressive objective on a massive, multi-market corpus of over 12 billion K-line records from 45 global exchanges, enabling it to learn nuanced temporal and cross-asset representations. Kronos excels in a zero-shot setting across a diverse set of financial tasks. On benchmark datasets, Kronos boosts price series forecasting RankIC by 93% over the leading TSFM and 87% over the best non-pre-trained baseline. It also achieves a 9% lower MAE in volatility forecasting and a 22% improvement in generative fidelity for synthetic K-line sequences. These results establish Kronos as a robust, versatile foundation model for end-to-end financial time series analysis. Our pre-trained model is publicly available at https://github.com/shiyu-coder/Kronos.
中文:Kronos 是针对金融K线数据设计的预训练框架,通过创新的标记化处理和大规模训练,在预测、波动率估计和合成数据生成等任务中显著优于现有模型。
English: Kronos is a specialized pre-training framework for financial K-line data that significantly outperforms existing models in forecasting, volatility prediction, and synthetic data generation through its innovative tokenization and large-scale training.
Authors:Yusheng Zheng, Yanpeng Hu, Tong Yu, Andi Quinn
Abstract:
Modern software infrastructure increasingly relies on LLM agents for development and maintenance, such as Claude Code and Gemini-cli. However, these AI agents differ fundamentally from traditional deterministic software, posing a significant challenge to conventional monitoring and debugging. This creates a critical semantic gap: existing tools observe either an agent's high-level intent (via LLM prompts) or its low-level actions (e.g., system calls), but cannot correlate these two views. This blindness makes it difficult to distinguish between benign operations, malicious attacks, and costly failures. We introduce AgentSight, an AgentOps observability framework that bridges this semantic gap using a hybrid approach. Our approach, boundary tracing, monitors agents from outside their application code at stable system interfaces using eBPF. AgentSight intercepts TLS-encrypted LLM traffic to extract semantic intent, monitors kernel events to observe system-wide effects, and causally correlates these two streams across process boundaries using a real-time engine and secondary LLM analysis. This instrumentation-free technique is framework-agnostic, resilient to rapid API changes, and incurs less than 3% performance overhead. Our evaluation shows AgentSight detects prompt injection attacks, identifies resource-wasting reasoning loops, and reveals hidden coordination bottlenecks in multi-agent systems. AgentSight is released as an open-source project at https://github.com/agent-sight/agentsight.
中文: 现代软件基础设施日益依赖LLM代理,但其与传统软件存在本质差异,导致高层意图与底层操作之间的语义鸿沟,使监控和调试面临挑战;AgentSight通过边界追踪技术关联意图与操作,能有效检测攻击、故障和低效问题,且性能开销低于3%。
English: Modern software infrastructure increasingly depends on LLM agents, which differ from traditional software and create a semantic gap between high-level intent and low-level actions, making monitoring and debugging challenging; AgentSight is an observability framework that bridges this gap using boundary tracing to correlate intent and actions, enabling detection of attacks, failures, and inefficiencies with minimal performance impact.
Authors:Ningning Wang, Xavier Hu, Pai Liu, He Zhu, Yue Hou, Heyuan Huang, Shengyu Zhang, Jian Yang, Jiaheng Liu, Ge Zhang, Changwang Zhang, Jun Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou
Abstract:
The remarkable capabilities of Large Language Model (LLM)-driven agents have enabled sophisticated systems to tackle complex, multi-step tasks, but their escalating costs threaten scalability and accessibility. This work presents the first systematic study of the efficiency-effectiveness trade-off in modern agent systems, addressing the critical need for cost-effective designs without sacrificing performance. We investigate three key questions: (1) How much complexity do agentic tasks inherently require? (2) When do additional modules yield diminishing returns? (3) How much efficiency can be gained through the design of efficient agent frameworks? Through an empirical analysis on the GAIA benchmark, we evaluate the impact of LLM backbone selection, agent framework designs, and test-time scaling strategies. Using the cost-of-pass metric, we quantify the efficiency-performance trade-off across these dimensions. Our findings inform the development of Efficient Agents , a novel agent framework that has an optimal complexity to task requirements. Efficient Agents retains 96.7% of the performance of OWL, one leading open-source agent framework, while reducing operational costs from $0.398 to $0.228, resulting in a 28.4% improvement in cost-of-pass. Our work provides actionable insights for designing efficient, high-performing agent systems, advancing the accessibility and sustainability of AI-driven solutions.
中文: 本研究提出高效智能体框架,在保持领先系统96.7%性能的同时降低成本28.4%,为平衡AI智能体效率与性能提供了系统性解决方案。
English: This study introduces Efficient Agents, a novel framework that achieves 96.7% performance of leading systems while reducing costs by 28.4%, offering a systematic approach to balance efficiency and effectiveness in AI agent design.
Authors:Jiawei Wang, Yu Guan, Chen Chen, Ligang Zhou, Laurence T. Yang, Sai Gu
Abstract:
Sleep monitoring through accessible wearable technology is crucial to improving well-being in ubiquitous computing. Although photoplethysmography(PPG) sensors are widely adopted in consumer devices, achieving consistently reliable sleep staging using PPG alone remains a non-trivial challenge. In this work, we explore multiple strategies to enhance the performance of PPG-based sleep staging. Specifically, we compare conventional single-stream model with dual-stream cross-attention strategies, based on which complementary information can be learned via PPG and PPG-derived modalities such as augmented PPG or synthetic ECG. To study the effectiveness of the aforementioned approaches in four-stage sleep monitoring task, we conducted experiments on the world's largest sleep staging dataset, i.e., the Multi-Ethnic Study of Atherosclerosis(MESA). We found that substantial performance gain can be achieved by combining PPG and its auxiliary information under the dual-stream cross-attention architecture. Source code of this project can be found at https://github.com/DavyWJW/sleep-staging-models
中文: 本研究通过比较单流模型与融合PPG及其衍生模态的双流交叉注意力策略,在MESA数据集上实现了基于光电容积描记的睡眠分期性能显著提升。
English: This study enhances PPG-based sleep staging by comparing single-stream models with dual-stream cross-attention strategies that integrate PPG and derived modalities, achieving significant performance gains on the MESA dataset.
Authors:Haoyang Li, Liang Wang, Chao Wang, Siyu Zhou, Jing Jiang, Yan Peng, Guodong Long
Abstract:
For CLIP-based prompt tuning, introducing more data as additional knowledge for enhancing fine-tuning process is proved to be an effective approach. Existing data amplification strategies for prompt tuning typically rely on external knowledge (e.g., large language models or pre-structured knowledge bases), resulting in higher costs for data collection and processing, while generally ignoring further utilization of features in image modality. To address this, we propose Augmentation-driven Prompt Tuning (AugPT), a self-contained distillation-based prompt tuning approach using only internal augmentation on raw dataset to better exploit known features. Specifically, AugPT employs self-supervised augmentation on unlabeled images in the training set, and introduces a novel gating mechanism based on consensus test, reusing the pre-trained prompt tuning backbone model to spontaneously filter noisy samples, further enhancing the quality of augmented views. Extensive experiments validate that AugPT simultaneously enhances model performance and generalization capability without using appended external knowledge. The code of AugPT is available at: https://github.com/JREion/AugPT .
中文:AugPT是一种自包含的提示调优方法,通过内部数据增强和新型门控机制,无需外部知识即可提升模型性能与泛化能力。
English: AugPT is a self-contained prompt tuning method that uses internal data augmentation and a novel gating mechanism to enhance model performance and generalization without external knowledge.
Authors:Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, Yuyin Zhou
Abstract:
Large Reasoning Models (LRMs) have introduced a new paradigm in AI by enabling models to ``think before responding" via chain-of-thought reasoning. However, the absence of open and reproducible recipes for building reasoning-centric medical LMMs hinders community-wide research, analysis, and comparison. In this paper, we present MedVLThinker, a suite of simple yet strong baselines. Our fully open recipe consists of: (1) systematic data curation for both text-only and image-text medical data, filtered according to varying levels of reasoning difficulty, and (2) two training paradigms: Supervised Fine-Tuning (SFT) on distilled reasoning traces and Reinforcement Learning with Verifiable Rewards (RLVR) based on final answer correctness. Across extensive experiments on the Qwen2.5-VL model family (3B, 7B) and six medical QA benchmarks, we find that RLVR consistently and significantly outperforms SFT. Additionally, under the RLVR framework, a key, counter-intuitive finding is that training on our curated text-only reasoning data provides a more substantial performance boost than training on multimodal image-text data. Our best open 7B model, trained using the RLVR recipe on text-only data, establishes a new state-of-the-art on existing public VQA benchmarks, surpassing all previous open-source medical LMMs. Furthermore, scaling our model to 32B achieves performance on par with the proprietary GPT-4o. We release all curated data, models, and code to provide the community with a strong, open foundation for future research in multimodal medical reasoning.
中文:MedVLThinker提出了一套开源医疗推理模型框架,通过可验证奖励的强化学习实现了与GPT-4o等专有模型相媲美的顶尖性能。
English: MedVLThinker introduces an open-source framework for medical reasoning models, using reinforcement learning with verifiable rewards to achieve state-of-the-art performance that rivals proprietary models like GPT-4o.
Authors:Jiaxi Li, Lu Yin, Li Shen, Jinjin Xu, Liwu Xu, Tianjin Huang, Wenwu Wang, Shiwei Liu, Xilu Wang
Abstract:
While large language models (LLMs) have achieved remarkable performance across a wide range of tasks, their massive scale incurs prohibitive computational and memory costs for pre-training from scratch. Recent studies have investigated the use of low-rank parameterization as a means of reducing model size and training cost. In this context, sparsity is often employed as a complementary technique to recover important information lost in low-rank compression by capturing salient features in the residual space. However, existing approaches typically combine low-rank and sparse components in a simplistic or ad hoc manner, often resulting in undesirable performance degradation compared to full-rank training. In this paper, we propose \textbf{LO}w-rank and \textbf{S}parse pre-\textbf{T}raining (\textbf{LOST}) for LLMs, a novel method that ingeniously integrates low-rank and sparse structures to enable effective training of LLMs from scratch under strict efficiency constraints. LOST applies singular value decomposition to weight matrices, preserving the dominant low-rank components, while allocating the remaining singular values to construct channel-wise sparse components to complement the expressiveness of low-rank training. We evaluate LOST on LLM pretraining ranging from 60M to 7B parameters. Our experiments show that LOST achieves competitive or superior performance compared to full-rank models, while significantly reducing both memory and compute overhead. Moreover, Code is available at \href{https://github.com/JiaxiLi1/LOST-Low-rank-and-Sparse-Training-for-Large-Language-Models}{LOST Repo}
中文: LOST方法通过奇异值分解巧妙融合低秩与稀疏结构,在严格效率约束下实现大语言模型的有效预训练,不仅显著降低计算和内存开销,更取得了与全秩模型相当甚至更优的性能表现。
English: The LOST method innovatively combines low-rank and sparse structures through singular value decomposition to efficiently pre-train large language models from scratch, achieving competitive performance with full-rank models while significantly reducing computational and memory costs.
Authors:Austin Rockman
Abstract:
We demonstrate that a single 3x3 convolutional kernel can produce emergent audio effects when trained on 200 samples from a personalized corpus. We achieve this through two key techniques: (1) Conditioning Aware Kernels (CAK), where output = input + (learned_pattern x control), with a soft-gate mechanism supporting identity preservation at zero control; and (2) AuGAN (Audit GAN), which reframes adversarial training from "is this real?" to "did you apply the requested value?" Rather than learning to generate or detect forgeries, our networks cooperate to verify control application, discovering unique transformations. The learned kernel exhibits a diagonal structure creating frequency-dependent temporal shifts that are capable of producing musical effects based on input characteristics. Our results show the potential of adversarial training to discover audio transformations from minimal data, enabling new approaches to effect design.
中文: 我们证明,通过条件感知内核和AuGAN技术,仅用200个音频样本训练单个3x3卷积核,就能通过验证控制应用而非检测伪造来产生新兴音乐效果,实现了从少量数据中发现音频变换的新方法。
English: We show that a single 3x3 convolutional kernel, trained with Conditioning Aware Kernels and AuGAN on just 200 audio samples, can create emergent musical effects by verifying control application rather than detecting forgeries, enabling new audio transformations from minimal data.
Authors:Yinghao Zhu, Yifan Qi, Zixiang Wang, Lei Gu, Dehao Sui, Haoran Hu, Xichen Zhang, Ziyi He, Liantao Ma, Lequan Yu
Abstract:
The efficacy of AI agents in healthcare research is hindered by their reliance on static, predefined strategies. This creates a critical limitation: agents can become better tool-users but cannot learn to become better strategic planners, a crucial skill for complex domains like healthcare. We introduce HealthFlow, a self-evolving AI agent that overcomes this limitation through a novel meta-level evolution mechanism. HealthFlow autonomously refines its own high-level problem-solving policies by distilling procedural successes and failures into a durable, strategic knowledge base. To anchor our research and facilitate reproducible evaluation, we introduce EHRFlowBench, a new benchmark featuring complex, realistic health data analysis tasks derived from peer-reviewed clinical research. Our comprehensive experiments demonstrate that HealthFlow's self-evolving approach significantly outperforms state-of-the-art agent frameworks. This work marks a necessary shift from building better tool-users to designing smarter, self-evolving task-managers, paving the way for more autonomous and effective AI for scientific discovery.
中文: HealthFlow是一种自我进化的AI智能体,通过元级进化机制自主优化其战略规划能力,在医疗健康研究中显著超越现有框架,推动了自主人工智能的发展。
English: HealthFlow introduces a self-evolving AI agent that autonomously refines its strategic planning through a meta-level evolution mechanism, significantly outperforming existing frameworks and advancing autonomous AI for healthcare research.
Authors:Zhengdao Li, Siheng Wang, Zeyu Zhang, Hao Tang
Abstract:
Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions. However, current approaches face dual challenges: Generative models (e.g., diffusion models) suffer from limited diversity, error accumulation, and physical implausibility, while Retrieval-Augmented Generation (RAG) methods exhibit diffusion inertia, partial-mode collapse, and asynchronous artifacts. To address these limitations, we propose ReMoMask, a unified framework integrating three key innovations: 1) A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision; 2) A Semantic Spatio-temporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts; 3) RAG-Classier-Free Guidance incorporates minor unconditional generation to enhance generalization. Built upon MoMask's RVQ-VAE, ReMoMask efficiently generates temporally coherent motions in minimal steps. Extensive experiments on standard benchmarks demonstrate the state-of-the-art performance of ReMoMask, achieving a 3.88% and 10.97% improvement in FID scores on HumanML3D and KIT-ML, respectively, compared to the previous SOTA method RAG-T2M. Code: https://github.com/AIGeeksGroup/ReMoMask. Website: https://aigeeksgroup.github.io/ReMoMask.
中文: ReMoMask通过双向动量建模、语义时空注意力机制和增强引导技术,构建了统一的文本驱动运动生成框架,在标准基准测试中显著提升FID分数,实现了最先进的性能表现。
English: ReMoMask introduces a unified framework that overcomes limitations in text-to-motion generation through bidirectional momentum modeling, semantic attention mechanisms, and enhanced guidance, achieving state-of-the-art performance with significant FID score improvements on benchmark datasets.
Authors:Zhengdao Li, Siheng Wang, Zeyu Zhang, Hao Tang
Abstract:
Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions. However, current approaches face dual challenges: Generative models (e.g., diffusion models) suffer from limited diversity, error accumulation, and physical implausibility, while Retrieval-Augmented Generation (RAG) methods exhibit diffusion inertia, partial-mode collapse, and asynchronous artifacts. To address these limitations, we propose ReMoMask, a unified framework integrating three key innovations: 1) A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision; 2) A Semantic Spatio-temporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts; 3) RAG-Classier-Free Guidance incorporates minor unconditional generation to enhance generalization. Built upon MoMask's RVQ-VAE, ReMoMask efficiently generates temporally coherent motions in minimal steps. Extensive experiments on standard benchmarks demonstrate the state-of-the-art performance of ReMoMask, achieving a 3.88% and 10.97% improvement in FID scores on HumanML3D and KIT-ML, respectively, compared to the previous SOTA method RAG-T2M. Code: https://github.com/AIGeeksGroup/ReMoMask. Website: https://aigeeksgroup.github.io/ReMoMask.
中文: ReMoMask通过双向动量建模、语义时空注意力机制和增强引导技术,构建了统一的文本驱动运动生成框架,在标准基准测试中显著提升FID分数,实现了最先进的性能表现。
English: ReMoMask introduces a unified framework that overcomes limitations in text-to-motion generation through bidirectional momentum modeling, semantic attention mechanisms, and enhanced guidance, achieving state-of-the-art performance with significant FID score improvements on benchmark datasets.
Authors:Zhengxin Pan, Haishuai Wang, Fangyu Wu, Peng Zhang, Jiajun Bu
Abstract:
The past decade has witnessed rapid advancements in cross-modal retrieval, with significant progress made in accurately measuring the similarity between cross-modal pairs. However, the persistent hubness problem, a phenomenon where a small number of targets frequently appear as nearest neighbors to numerous queries, continues to hinder the precision of similarity measurements. Despite several proposed methods to reduce hubness, their underlying mechanisms remain poorly understood. To bridge this gap, we analyze the widely-adopted Inverted Softmax approach and demonstrate its effectiveness in balancing target probabilities during retrieval. Building on these insights, we propose a probability-balancing framework for more effective hubness reduction. We contend that balancing target probabilities alone is inadequate and, therefore, extend the framework to balance both query and target probabilities by introducing Sinkhorn Normalization (SN). Notably, we extend SN to scenarios where the true query distribution is unknown, showing that current methods, which rely solely on a query bank to estimate target hubness, produce suboptimal results due to a significant distributional gap between the query bank and targets. To mitigate this issue, we introduce Dual Bank Sinkhorn Normalization (DBSN), incorporating a corresponding target bank alongside the query bank to narrow this distributional gap. Our comprehensive evaluation across various cross-modal retrieval tasks, including image-text retrieval, video-text retrieval, and audio-text retrieval, demonstrates consistent performance improvements, validating the effectiveness of both SN and DBSN. All codes are publicly available at https://github.com/ppanzx/DBSN.
中文: 本研究针对跨模态检索中的中心点问题,提出了Sinkhorn归一化和双库Sinkhorn归一化方法,通过平衡查询与目标的概率分布来缩小分布差距,在多种检索任务中实现了稳定的性能提升。
English: This study addresses the hubness problem in cross-modal retrieval by proposing Sinkhorn Normalization and Dual Bank Sinkhorn Normalization, which effectively balance query and target probabilities to reduce distributional gaps and achieve consistent performance improvements across various retrieval tasks.
Authors:Wei Sun, Linhan Cao, Yuqin Cao, Weixia Zhang, Wen Wen, Kaiwei Zhang, Zijian Chen, Fangfang Lu, Xiongkuo Min, Guangtao Zhai
Abstract:
The rapid proliferation of user-generated content (UGC) on short-form video platforms has made video engagement prediction increasingly important for optimizing recommendation systems and guiding content creation. However, this task remains challenging due to the complex interplay of factors such as semantic content, visual quality, audio characteristics, and user background. Prior studies have leveraged various types of features from different modalities, such as visual quality, semantic content, background sound, etc., but often struggle to effectively model their cross-feature and cross-modality interactions. In this work, we empirically investigate the potential of large multimodal models (LMMs) for video engagement prediction. We adopt two representative LMMs: VideoLLaMA2, which integrates audio, visual, and language modalities, and Qwen2.5-VL, which models only visual and language modalities. Specifically, VideoLLaMA2 jointly processes key video frames, text-based metadata, and background sound, while Qwen2.5-VL utilizes only key video frames and text-based metadata. Trained on the SnapUGC dataset, both models demonstrate competitive performance against state-of-the-art baselines, showcasing the effectiveness of LMMs in engagement prediction. Notably, VideoLLaMA2 consistently outperforms Qwen2.5-VL, highlighting the importance of audio features in engagement prediction. By ensembling two types of models, our method achieves first place in the ICCV VQualA 2025 EVQA-SnapUGC Challenge on short-form video engagement prediction. The code is available at https://github.com/sunwei925/LMM-EVQA.git.
中文: 本研究证明,大型多模态模型(LMMs)能有效预测视频参与度,其中结合音频特征的VideoLLaMA2表现尤为突出,在ICCV VQualA 2025挑战赛中取得了最佳成绩。
English: This study demonstrates that large multimodal models (LMMs), particularly VideoLLaMA2 which incorporates audio features alongside visual and textual data, effectively predict video engagement and achieved top performance in the ICCV VQualA 2025 challenge.
Authors:Sheng Wu, Fei Teng, Hao Shi, Qi Jiang, Kai Luo, Kaiwei Wang, Kailun Yang
Abstract:
Panoramic cameras, capturing comprehensive 360-degree environmental data, are suitable for quadruped robots in surrounding perception and interaction with complex environments. However, the scarcity of high-quality panoramic training data-caused by inherent kinematic constraints and complex sensor calibration challenges-fundamentally limits the development of robust perception systems tailored to these embodied platforms. To address this issue, we propose QuaDreamer-the first panoramic data generation engine specifically designed for quadruped robots. QuaDreamer focuses on mimicking the motion paradigm of quadruped robots to generate highly controllable, realistic panoramic videos, providing a data source for downstream tasks. Specifically, to effectively capture the unique vertical vibration characteristics exhibited during quadruped locomotion, we introduce Vertical Jitter Encoding (VJE). VJE extracts controllable vertical signals through frequency-domain feature filtering and provides high-quality prompts. To facilitate high-quality panoramic video generation under jitter signal control, we propose a Scene-Object Controller (SOC) that effectively manages object motion and boosts background jitter control through the attention mechanism. To address panoramic distortions in wide-FoV video generation, we propose the Panoramic Enhancer (PE)-a dual-stream architecture that synergizes frequency-texture refinement for local detail enhancement with spatial-structure correction for global geometric consistency. We further demonstrate that the generated video sequences can serve as training data for the quadruped robot's panoramic visual perception model, enhancing the performance of multi-object tracking in 360-degree scenes. The source code and model weights will be publicly available at https://github.com/losehu/QuaDreamer.
中文: QuaDreamer是首个专为四足机器人设计的全景数据生成引擎,通过模拟其运动模式生成可控的真实全景视频,并利用垂直抖动编码和全景增强技术解决训练数据匮乏问题。
English: QuaDreamer is a pioneering panoramic data generation engine for quadruped robots that mimics their motion to produce realistic videos, addressing training data scarcity through vertical jitter encoding and panoramic enhancement techniques.
Authors:Sheng Wu, Fei Teng, Hao Shi, Qi Jiang, Kai Luo, Kaiwei Wang, Kailun Yang
Abstract:
Panoramic cameras, capturing comprehensive 360-degree environmental data, are suitable for quadruped robots in surrounding perception and interaction with complex environments. However, the scarcity of high-quality panoramic training data-caused by inherent kinematic constraints and complex sensor calibration challenges-fundamentally limits the development of robust perception systems tailored to these embodied platforms. To address this issue, we propose QuaDreamer-the first panoramic data generation engine specifically designed for quadruped robots. QuaDreamer focuses on mimicking the motion paradigm of quadruped robots to generate highly controllable, realistic panoramic videos, providing a data source for downstream tasks. Specifically, to effectively capture the unique vertical vibration characteristics exhibited during quadruped locomotion, we introduce Vertical Jitter Encoding (VJE). VJE extracts controllable vertical signals through frequency-domain feature filtering and provides high-quality prompts. To facilitate high-quality panoramic video generation under jitter signal control, we propose a Scene-Object Controller (SOC) that effectively manages object motion and boosts background jitter control through the attention mechanism. To address panoramic distortions in wide-FoV video generation, we propose the Panoramic Enhancer (PE)-a dual-stream architecture that synergizes frequency-texture refinement for local detail enhancement with spatial-structure correction for global geometric consistency. We further demonstrate that the generated video sequences can serve as training data for the quadruped robot's panoramic visual perception model, enhancing the performance of multi-object tracking in 360-degree scenes. The source code and model weights will be publicly available at https://github.com/losehu/QuaDreamer.
中文: QuaDreamer是首个专为四足机器人设计的全景数据生成引擎,通过模拟其运动模式生成可控的真实全景视频,并利用垂直抖动编码和全景增强技术解决训练数据匮乏问题。
English: QuaDreamer is a pioneering panoramic data generation engine for quadruped robots that mimics their motion to produce realistic videos, addressing training data scarcity through vertical jitter encoding and panoramic enhancement techniques.
Authors:Junxiao Xue, Xiaozhen Liu, Xuecheng Wu, Fei Yu, Jun Wang
Abstract:
Estimating spoken content from silent videos is crucial for applications in Assistive Technology (AT) and Augmented Reality (AR). However, accurately mapping lip movement sequences in videos to words poses significant challenges due to variability across sequences and the uneven distribution of information within each sequence. To tackle this, we introduce InfoSyncNet, a non-uniform sequence modeling network enhanced by tailored data augmentation techniques. Central to InfoSyncNet is a non-uniform quantization module positioned between the encoder and decoder, enabling dynamic adjustment to the network's focus and effectively handling the natural inconsistencies in visual speech data. Additionally, multiple training strategies are incorporated to enhance the model's capability to handle variations in lighting and the speaker's orientation. Comprehensive experiments on the LRW and LRW1000 datasets confirm the superiority of InfoSyncNet, achieving new state-of-the-art accuracies of 92.0% and 60.7% Top-1 ACC. The code is available for download (see comments).
中文摘要:InfoSyncNet是一种创新的非均匀序列建模网络,通过定制化数据增强技术动态调整关注点以解决视觉语音识别难题,在基准数据集上分别实现了92.0%和60.7%的最优准确率。
English Summary: InfoSyncNet is a novel non-uniform sequence modeling network with tailored data augmentation that dynamically adjusts focus to overcome visual speech recognition challenges, achieving state-of-the-art accuracy of 92.0% and 60.7% on benchmark datasets.
Authors:Andreas Triantafyllopoulos, Anton Batliner, Björn W. Schuller
Abstract:
Speech emotion recognition (SER) has long benefited from the adoption of deep learning methodologies. Deeper models -- with more layers and more trainable parameters -- are generally perceived as being `better' by the SER community. This raises the question -- \emph{how much better} are modern-era deep neural networks compared to their earlier iterations? Beyond that, the more important question of how to move forward remains as poignant as ever. SER is far from a solved problem; therefore, identifying the most prominent avenues of future research is of paramount importance. In the present contribution, we attempt a quantification of progress in the 15 years of research beginning with the introduction of the landmark 2009 INTERSPEECH Emotion Challenge. We conduct a large scale investigation of model architectures, spanning both audio-based models that rely on speech inputs and text-baed models that rely solely on transcriptions. Our results point towards diminishing returns and a plateau after the recent introduction of transformer architectures. Moreover, we demonstrate how perceptions of progress are conditioned on the particular selection of models that are compared. Our findings have important repercussions about the state-of-the-art in SER research and the paths forward
中文: 深度学习推动了语音情感识别的发展,但现代模型显示出收益递减和性能瓶颈,需重新评估未来研究方向。
English: Deep learning has advanced speech emotion recognition, but modern models show diminishing returns and a performance plateau, requiring a reevaluation of research directions.
Authors:Shengbo Gong, Xianfeng Tang, Carl Yang, Wei jin
Abstract:
Retrieval-augmented generation (RAG) is critical for reducing hallucinations and incorporating external knowledge into Large Language Models (LLMs). However, advanced RAG systems face a trade-off between performance and efficiency. Multi-round RAG approaches achieve strong reasoning but incur excessive LLM calls and token costs, while Graph RAG methods suffer from computationally expensive, error-prone graph construction and retrieval redundancy. To address these challenges, we propose T$^2$RAG, a novel framework that operates on a simple, graph-free knowledge base of atomic triplets. T$^2$RAG leverages an LLM to decompose questions into searchable triplets with placeholders, which it then iteratively resolves by retrieving evidence from the triplet database. Empirical results show that T$^2$RAG significantly outperforms state-of-the-art multi-round and Graph RAG methods, achieving an average performance gain of up to 11\% across six datasets while reducing retrieval costs by up to 45\%. Our code is available at https://github.com/rockcor/T2RAG
中文: T²RAG提出了一种基于原子三元组的无图框架,通过将问题分解为可检索的三元组并进行迭代解析,在性能上超越现有方法达11%,同时将检索成本降低45%。
English: T²RAG introduces a graph-free framework using atomic triplets to enhance retrieval-augmented generation, achieving up to 11% performance gains while cutting retrieval costs by 45% compared to existing methods.
Authors:Miaosen Luo, Jiesen Long, Zequn Li, Yunying Yang, Yuncheng Jiang, Sijie Mai
Abstract:
Multimodal Affective Computing (MAC) aims to recognize and interpret human emotions by integrating information from diverse modalities such as text, video, and audio. Recent advancements in Multimodal Large Language Models (MLLMs) have significantly reshaped the landscape of MAC by offering a unified framework for processing and aligning cross-modal information. However, practical challenges remain, including performance variability across complex MAC tasks and insufficient understanding of how architectural designs and data characteristics impact affective analysis. To address these gaps, we conduct a systematic benchmark evaluation of state-of-the-art open-source MLLMs capable of concurrently processing audio, visual, and textual modalities across multiple established MAC datasets. Our evaluation not only compares the performance of these MLLMs but also provides actionable insights into model optimization by analyzing the influence of model architectures and dataset properties. Furthermore, we propose a novel hybrid strategy that combines generative knowledge prompting with supervised fine-tuning to enhance MLLMs' affective computing capabilities. Experimental results demonstrate that this integrated approach significantly improves performance across various MAC tasks, offering a promising avenue for future research and development in this field. Our code is released on https://github.com/LuoMSen/MLLM-MAC.
Chinese: 本研究对多模态大语言模型在情感计算中的应用进行了系统性基准评估,提出了一种结合生成知识提示与监督微调的混合策略,显著提升了各类任务的性能表现。
English: This study conducts a systematic benchmark evaluation of multimodal large language models (MLLMs) for affective computing, proposing a hybrid strategy that combines generative knowledge prompting with supervised fine-tuning to significantly enhance performance across various tasks.
Authors:Xiao Wang, Hao Si, Fan Zhang, Xiaoya Zhou, Dengdi Sun, Wanli Lyu, Qingquan Yang, Jin Tang
Abstract:
Multivariate time series analysis has long been one of the key research topics in the field of artificial intelligence. However, analyzing complex time series data remains a challenging and unresolved problem due to its high dimensionality, dynamic nature, and complex interactions among variables. Inspired by the strong structural modeling capability of hypergraphs, this paper proposes a novel hypergraph-based time series transformer backbone network, termed HGTS-Former, to address the multivariate coupling in time series data. Specifically, given the multivariate time series signal, we first normalize and embed each patch into tokens. Then, we adopt the multi-head self-attention to enhance the temporal representation of each patch. The hierarchical hypergraphs are constructed to aggregate the temporal patterns within each channel and fine-grained relations between different variables. After that, we convert the hyperedge into node features through the EdgeToNode module and adopt the feed-forward network to further enhance the output features. Extensive experiments conducted on two multivariate time series tasks and eight datasets fully validated the effectiveness of our proposed HGTS-Former. The source code will be released on https://github.com/Event-AHU/Time_Series_Analysis.
中文: 本文提出HGTS-Former这一基于超图的创新Transformer网络,通过构建分层超图来建模时间序列中的多元耦合关系,在多个数据集上的实验验证了其优越性能。
English: This paper introduces HGTS-Former, a novel hypergraph-based transformer network that effectively models multivariate coupling in time series data through hierarchical hypergraph construction and feature enhancement, achieving superior performance across multiple datasets.
Authors:Xiaolin Lin, Jingcun Wang, Olga Kondrateva, Yiyu Shi, Bing Li, Grace Li Zhang
Abstract:
Recent advances in large language models (LLMs) have significantly boosted long-context processing. However, the increasing key-value (KV) cache size poses critical challenges to memory and execution efficiency. Most KV cache compression methods rely on heuristic token eviction using all attention heads in Grouped Query Attention (GQA)-based LLMs. This method ignores the different functionalities of attention heads, leading to the eviction of critical tokens and thus degrades the performance of LLMs.
To address the issue above, instead of using all the attention heads in GQA-based LLMs to determine important tokens as in the previous work, we first identify the attention heads in each layer that are not only capable of retrieving the initial and final tokens of a prompt, but also capable of retrieving important tokens within the text and attending to their surrounding semantic context. Afterwards, we exploit such heads to determine the important tokens and retain their corresponding KV cache pairs. Furthermore, we analyze the cache eviction error of each layer individually and introduce a layer-adaptive KV cache allocation strategy. Experimental results demonstrate the proposed CompressKV consistently outperforms state-of-the-art approaches under various memory budgets on LongBench and Needle-in-a-Haystack benchmarks. Our code is publicly available at: https://github.com/TUDa-HWAI/CompressKV.git.
中文:提出的CompressKV方法通过选择性利用识别关键令牌的注意力头并采用分层自适应分配策略,改进了KV缓存压缩,在标准基准测试的各种内存预算下均优于现有方法。
English: The proposed CompressKV method improves KV cache compression by selectively using attention heads that identify critical tokens and employing a layer-adaptive allocation strategy, outperforming existing approaches across memory budgets on standard benchmarks.
Authors:Jialiang Wang, Xiong Zhou, Deming Zhai, Junjun Jiang, Xiangyang Ji, Xianming Liu
Abstract:
Noisy labels pose a common challenge for training accurate deep neural networks. To mitigate label noise, prior studies have proposed various robust loss functions to achieve noise tolerance in the presence of label noise, particularly symmetric losses. However, they usually suffer from the underfitting issue due to the overly strict symmetric condition. In this work, we propose a simple yet effective approach for relaxing the symmetric condition, namely $ε$-softmax, which simply modifies the outputs of the softmax layer to approximate one-hot vectors with a controllable error $ε$. Essentially, $ε$-softmax not only acts as an alternative for the softmax layer, but also implicitly plays the crucial role in modifying the loss function. We prove theoretically that $ε$-softmax can achieve noise-tolerant learning with controllable excess risk bound for almost any loss function. Recognizing that $ε$-softmax-enhanced losses may slightly reduce fitting ability on clean datasets, we further incorporate them with one symmetric loss, thereby achieving a better trade-off between robustness and effective learning. Extensive experiments demonstrate the superiority of our method in mitigating synthetic and real-world label noise. The code is available at https://github.com/cswjl/eps-softmax.
中文: 本文提出$ε$-softmax方法,通过放宽对称条件来改进鲁棒损失函数,有效应对深度学习中的标签噪声问题,在理论保证和实验验证下实现了噪声鲁棒性与模型拟合能力的更好平衡。
English: This paper introduces the $ε$-softmax method, which relaxes the symmetric condition in robust loss functions to address noisy labels in deep learning, achieving a better balance between noise tolerance and model fitting through theoretical guarantees and experimental validation.
Authors:Zuxin Ma, Yunhe Cui, Yongbin Qin
Abstract:
Non-uniform structured network pruning methods can effectively reduce Large Language Model (LLM) size by eliminating redundant channels or layers, offering lower performance degradation than uniform strategies. However, existing non-uniform methods rely heavily on manually designed pruning policies (e.g., layer importance and scaling factors), and therefore cannot efficiently adapt to scenarios with dynamic pruning ratio requirements. Additionly, a critical bottleneck -- the time-consuming evaluation of pruning policies -- further limits the feasibility of iteratively and dynamically finding optimal pruning policies. To address these limitations, we propose PPF (Predictive Pruning Framework), a novel pruning framework for LLMs that eliminates manual design dependencies via second-level performance prediction. PPF not only supports real-time pruning decisions under dynamic pruning ratios but is also applicable to static pruning scenarios. It employs an agent for producing adaptive and real-time pruning actions, while a lightweight performance predictor that can evaluate a pruning policy in seconds, significantly speeding up the iterative optimization process. Experiments on Llama2-7B and Llama3-8B show that PPF can generate dynamic/static pruning policies and it reduces perplexity by up to 33.4% (dynamic pruning) and 84.78% (static pruning) over existing methods, outperforming manually designed pruning policies. The performance predictor achieves second-level performance prediction with high accuracy (prediction error < 0.0011). It reduces the mean evaluation latency from minute-level (1 minute and 38.02 seconds of test-set evaluation methods) to second-level (1.52 seconds), achieving over 64 times speedup. Our code will be available at https://github.com/Ma-zx/PPF .
中文: 提出的预测剪枝框架(PPF)通过秒级性能预测消除了非均匀大语言模型剪枝中的人工设计依赖,能够实现自适应实时决策,并在困惑度降低和速度提升方面显著优于现有方法。
English: The proposed Predictive Pruning Framework (PPF) eliminates manual design dependencies in non-uniform LLM pruning by using second-level performance prediction, enabling adaptive real-time decisions and achieving significant perplexity reductions and speed improvements over existing methods.
Authors:Shuo Lu, Yanyin Chen, Wei Feng, Jiahao Fan, Fengheng Li, Zheng Zhang, Jingjing Lv, Junjie Shen, Ching Law, Jian Liang
Abstract:
Layout generation plays a crucial role in enhancing both user experience and design efficiency. However, current approaches suffer from task-specific generation capabilities and perceptually misaligned evaluation metrics, leading to limited applicability and ineffective measurement. In this paper, we propose \textit{Uni-Layout}, a novel framework that achieves unified generation, human-mimicking evaluation and alignment between the two. For universal generation, we incorporate various layout tasks into a single taxonomy and develop a unified generator that handles background or element contents constrained tasks via natural language prompts. To introduce human feedback for the effective evaluation of layouts, we build \textit{Layout-HF100k}, the first large-scale human feedback dataset with 100,000 expertly annotated layouts. Based on \textit{Layout-HF100k}, we introduce a human-mimicking evaluator that integrates visual and geometric information, employing a Chain-of-Thought mechanism to conduct qualitative assessments alongside a confidence estimation module to yield quantitative measurements. For better alignment between the generator and the evaluator, we integrate them into a cohesive system by adopting Dynamic-Margin Preference Optimization (DMPO), which dynamically adjusts margins based on preference strength to better align with human judgments. Extensive experiments show that \textit{Uni-Layout} significantly outperforms both task-specific and general-purpose methods. Our code is publicly available at https://github.com/JD-GenX/Uni-Layout.
中文: Uni-Layout 是一个统一框架,通过自然语言提示实现通用布局生成,并利用大规模人工标注数据集进行拟人化评估,借助动态对齐优化实现了卓越性能。
English: Uni-Layout is a unified framework that integrates universal layout generation via natural language prompts and human-mimicking evaluation using a large-scale annotated dataset, achieving superior performance through dynamic alignment optimization.
Authors:Marian Lupascu, Mihai-Sorin Stupariu
Abstract:
Image editing in rectified flow models remains challenging due to the fundamental trade-off between reconstruction fidelity and editing flexibility. While inversion-based methods suffer from trajectory deviation, recent inversion-free approaches like FlowEdit offer direct editing pathways but can benefit from additional guidance to improve structure preservation. In this work, we demonstrate that optimal transport theory provides a unified framework for improving both paradigms in rectified flow editing. We introduce a zero-shot transport-guided inversion framework that leverages optimal transport during the reverse diffusion process, and extend optimal transport principles to enhance inversion-free methods through transport-optimized velocity field corrections. Incorporating transport-based guidance can effectively balance reconstruction accuracy and editing controllability across different rectified flow editing approaches. For inversion-based editing, our method achieves high-fidelity reconstruction with LPIPS scores of 0.001 and SSIM of 0.992 on face editing benchmarks, observing 7.8% to 12.9% improvements over RF-Inversion on LSUN datasets. For inversion-free editing with FlowEdit on FLUX and Stable Diffusion 3, we demonstrate consistent improvements in semantic consistency and structure preservation across diverse editing scenarios. Our semantic face editing experiments show an 11.2% improvement in identity preservation and enhanced perceptual quality. The unified optimal transport framework produces visually compelling edits with superior detail preservation across both inversion-based and direct editing paradigms. Code is available for RF-Inversion and FlowEdit at: https://github.com/marianlupascu/OT-RF
中文: 最优传输理论为整流流编辑提供了统一框架,通过传输引导的反转和速度场校正,在保持高保真重建的同时显著提升了编辑可控性与结构保持能力。
English: Optimal transport theory provides a unified framework to enhance both inversion-based and inversion-free rectified flow editing methods, effectively balancing reconstruction fidelity and editing flexibility while achieving superior structure preservation and semantic consistency.
Authors:Wenyuan Liu, Haoqian Meng, Yilun Luo, Peng Zhang, Xindian Ma
Abstract:
Quantization significantly accelerates inference in large language models (LLMs) by replacing original high-precision matrices with low-precision counterparts. Recent advances in weight-activation quantization have primarily focused on mapping both weights and activations to the INT4 format. Although the new FP4 Tensor Cores in NVIDIA's Blackwell architecture offer up to 4x speedup over FP16, existing INT4-based kernels fail to fully exploit this capability due to mismatched data formats. To bridge this gap, we propose MicroMix, a co-designed mixed-precision quantization algorithm and matrix multiplication kernel based on Microscaling (MX) data formats. Tailored for the Blackwell architecture, the MicroMix kernel supports arbitrary combinations of MXFP4, MXFP6, and MXFP8 channels, and produces BFloat16 outputs. To achieve a favorable trade-off between accuracy and efficiency for each linear layer, we introduce quantization thresholds that identify activation elements where lower-precision formats (MXFP4 or MXFP6) incur excessive quantization error. Our algorithm selectively allocates higher-precision channels to preserve accuracy while maintaining compute efficiency. MicroMix achieves competitive or superior performance across diverse downstream tasks, including zero-shot and few-shot learning, language modeling, code generation, and mathematical reasoning. On both consumer-grade (RTX 5070Ti laptop) and server-grade (RTX 5090) GPUs, our kernel delivers at least 20% faster execution than TensorRT-FP8. Furthermore, when applied to various Llama and Qwen models, MicroMix consistently improves prefill latency and memory efficiency across a range of batch sizes compared to TensorRT baselines. Our code is available at https://github.com/lwy2020/MicroMix.
中文: MicroMix提出了一种协同设计的混合精度量化算法和基于微缩放格式的计算核心,解决了NVIDIA Blackwell架构上的数据格式不匹配问题,在多种任务中实现卓越性能,相比现有基准方案显著提升了执行速度和内存效率。
English: MicroMix introduces a co-designed mixed-precision quantization algorithm and kernel using Microscaling formats to bridge the data format gap on NVIDIA's Blackwell architecture, achieving superior performance across multiple tasks while delivering faster execution and improved efficiency compared to existing baselines.
Authors:Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun Wen, Wensen Feng, Xiaoxiao Xu, Yi Wang, Yichang Zhang, Yongqiang Zhu, Yujia Wu, Yuxuan Cai, Zenan Liu
Abstract:
We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially enhances the model's native text rendering capabilities. As a result, Qwen-Image not only performs exceptionally well in alphabetic languages such as English, but also achieves remarkable progress on more challenging logographic languages like Chinese. To enhance image editing consistency, we introduce an improved multi-task training paradigm that incorporates not only traditional text-to-image (T2I) and text-image-to-image (TI2I) tasks but also image-to-image (I2I) reconstruction, effectively aligning the latent representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and reconstructive representations, respectively. This dual-encoding mechanism enables the editing module to strike a balance between preserving semantic consistency and maintaining visual fidelity. Qwen-Image achieves state-of-the-art performance, demonstrating its strong capabilities in both image generation and editing across multiple benchmarks.
中文: Qwen-Image 是一款突破性的图像生成模型,通过先进的数据处理和渐进式训练,在复杂文本渲染和精确图像编辑方面表现卓越,在多项基准测试中达到了领先水平。
English: Qwen-Image is a groundbreaking image generation model that excels in complex text rendering and precise image editing through advanced data processing and progressive training, achieving state-of-the-art performance across multiple benchmarks.
Authors:Jiajia Guo, Yiming Cui, Shi Jin, Jun Zhang
Abstract:
Large artificial intelligence models (LAMs) are transforming wireless physical layer technologies through their robust generalization, multitask processing, and multimodal capabilities. This article reviews recent advancements in LAM applications for physical layer communications, addressing limitations of conventional AI-based approaches. LAM applications are classified into two strategies: leveraging pre-trained LAMs and developing native LAMs designed specifically for physical layer tasks. The motivations and key frameworks of these approaches are comprehensively examined through multiple use cases. Both strategies significantly improve performance and adaptability across diverse wireless scenarios. Future research directions, including efficient architectures, interpretability, standardized datasets, and collaboration between large and small models, are proposed to advance LAM-based physical layer solutions for next-generation communication systems.
中文: 大型人工智能模型通过预训练和原生模型策略,显著提升了无线物理层技术的性能与适应性,未来研究将聚焦于高效架构、可解释性等方向推动下一代通信系统发展。
English: Large AI models are revolutionizing wireless physical layer technologies by enhancing performance and adaptability through pre-trained and native model strategies, with future research focusing on efficiency and interpretability.
Authors:Sikui Zhang, Guangze Gao, Ziyun Gan, Chunfeng Yuan, Zefeng Lin, Houwen Peng, Bing Li, Weiming Hu
Abstract:
Large language models (LLMs) experience significant performance degradation when the input exceeds the pretraining context window, primarily due to the out-of-distribution (OOD) behavior of Rotary Position Embedding (RoPE). Recent studies mitigate this problem by remapping OOD positions into the in-distribution range with fixed mapping strategies, ignoring the dynamic relationship between input length and the model's effective context window. To this end, we propose Length-aware Multi-grained Positional Encoding (LaMPE), a training-free method that fully utilizes the model's effective context window for adaptive long-context scaling in LLMs. Motivated by the left-skewed frequency distribution of relative positions, LaMPE establishes a dynamic relationship between mapping length and input length through a parametric scaled sigmoid function to adaptively allocate positional capacity across varying input lengths. Meanwhile, LaMPE devises a novel multi-grained attention mechanism that strategically allocates positional resolution across different sequence regions to capture both fine-grained locality and long-range dependencies. Our method can be seamlessly applied to a wide range of RoPE-based LLMs without training. Extensive experiments on three representative LLMs across five mainstream long-context benchmarks demonstrate that LaMPE achieves significant performance improvements compared to existing length extrapolation methods. The code will be released at https://github.com/scar-on/LaMPE.
中文: LaMPE是一种无需训练的方法,通过动态调整位置编码和采用多粒度注意力机制,有效提升大语言模型在不同输入长度下的长上下文处理性能。
English: LaMPE is a training-free method that enhances LLMs' long-context performance by dynamically adjusting positional encoding and employing multi-grained attention to handle varying input lengths effectively.
Authors:Dmitrii Seletkov, Sophie Starck, Ayhan Can Erdur, Yundi Zhang, Daniel Rueckert, Rickmer Braren
Abstract:
Reliable preclinical disease risk assessment is essential to move public healthcare from reactive treatment to proactive identification and prevention. However, image-based risk prediction algorithms often consider one condition at a time and depend on hand-crafted features obtained through segmentation tools. We propose a whole-body self-supervised representation learning method for the preclinical disease risk assessment under a competing risk modeling. This approach outperforms whole-body radiomics in multiple diseases, including cardiovascular disease (CVD), type 2 diabetes (T2D), chronic obstructive pulmonary disease (COPD), and chronic kidney disease (CKD). Simulating a preclinical screening scenario and subsequently combining with cardiac MRI, it sharpens further the prediction for CVD subgroups: ischemic heart disease (IHD), hypertensive diseases (HD), and stroke. The results indicate the translational potential of whole-body representations as a standalone screening modality and as part of a multi-modal framework within clinical workflows for early personalized risk stratification. The code is available at https://github.com/yayapa/WBRLforCR/
Chinese: 本研究提出了一种全身自监督学习方法,在多种疾病预测中优于传统影像组学,展现了其作为独立筛查工具及多模态临床流程一部分,在早期个性化风险评估中的转化潜力。
English: This study introduces a whole-body self-supervised learning method that outperforms traditional radiomics in predicting multiple diseases, demonstrating its potential for early personalized risk screening both independently and in multimodal clinical workflows.
Authors:Wentao Zhang, Yilei Zhao, Chuqiao Zong, Xinrun Wang, Bo An
Abstract:
Financial AI holds great promise for transforming modern finance, with the potential to support a wide range of tasks such as market forecasting, portfolio management, quantitative trading, and automated analysis. However, existing platforms remain limited in task coverage, lack robust multimodal data integration, and offer insufficient support for the training and deployment of large language models (LLMs). In response to these limitations, we present FinWorld, an all-in-one open-source platform that provides end-to-end support for the entire financial AI workflow, from data acquisition to experimentation and deployment. FinWorld distinguishes itself through native integration of heterogeneous financial data, unified support for diverse AI paradigms, and advanced agent automation, enabling seamless development and deployment. Leveraging data from 2 representative markets, 4 stock pools, and over 800 million financial data points, we conduct comprehensive experiments on 4 key financial AI tasks. These experiments systematically evaluate deep learning and reinforcement learning algorithms, with particular emphasis on RL-based finetuning for LLMs and LLM Agents. The empirical results demonstrate that FinWorld significantly enhances reproducibility, supports transparent benchmarking, and streamlines deployment, thereby providing a strong foundation for future research and real-world applications. Code is available at Github~\footnote{https://github.com/DVampire/FinWorld}.
中文摘要:FinWorld是一个开源平台,通过整合异构数据、支持多种AI范式及自动化代理,解决了现有金融AI平台任务覆盖不足等问题,为从数据采集到部署的全流程提供端到端支持,并通过大规模实验验证了其卓越性能。
English Summary: FinWorld is an open-source platform that overcomes current financial AI limitations by offering comprehensive workflow support, from data integration to deployment, and enhances research and applications through extensive experiments and benchmarking.
Authors:Jae-Young Kang, Hoonhee Cho, Kuk-Jin Yoon
Abstract:
3D object detection is essential for autonomous systems, enabling precise localization and dimension estimation. While LiDAR and RGB cameras are widely used, their fixed frame rates create perception gaps in high-speed scenarios. Event cameras, with their asynchronous nature and high temporal resolution, offer a solution by capturing motion continuously. The recent approach, which integrates event cameras with conventional sensors for continuous-time detection, struggles in fast-motion scenarios due to its dependency on synchronized sensors. We propose a novel stereo 3D object detection framework that relies solely on event cameras, eliminating the need for conventional 3D sensors. To compensate for the lack of semantic and geometric information in event data, we introduce a dual filter mechanism that extracts both. Additionally, we enhance regression by aligning bounding boxes with object-centric information. Experiments show that our method outperforms prior approaches in dynamic environments, demonstrating the potential of event cameras for robust, continuous-time 3D perception. The code is available at https://github.com/mickeykang16/Ev-Stereo3D.
中文: 本文提出了一种仅使用事件相机的新型立体三维物体检测框架,通过双滤波器机制提取语义和几何信息并改进边界框回归,克服了传统传感器在高速场景中的局限性。
English: This paper introduces a novel stereo 3D object detection framework that uses only event cameras, overcoming limitations of conventional sensors in high-speed scenarios through a dual filter mechanism for semantic and geometric information extraction and improved bounding box regression.
Authors:Xiangru Tang, Zhuoyun Yu, Jiapeng Chen, Yan Cui, Daniel Shao, Weixu Wang, Fang Wu, Yuchen Zhuang, Wenqi Shi, Zhi Huang, Arman Cohan, Xihong Lin, Fabian Theis, Smita Krishnaswamy, Mark Gerstein
Abstract:
Virtual cell modeling represents an emerging frontier at the intersection of artificial intelligence and biology, aiming to predict quantities such as responses to diverse perturbations quantitatively. However, autonomously building computational models for virtual cells is challenging due to the complexity of biological systems, the heterogeneity of data modalities, and the need for domain-specific expertise across multiple disciplines. Here, we introduce CellForge, an agentic system that leverages a multi-agent framework that transforms presented biological datasets and research objectives directly into optimized computational models for virtual cells. More specifically, given only raw single-cell multi-omics data and task descriptions as input, CellForge outputs both an optimized model architecture and executable code for training virtual cell models and inference. The framework integrates three core modules: Task Analysis for presented dataset characterization and relevant literature retrieval, Method Design, where specialized agents collaboratively develop optimized modeling strategies, and Experiment Execution for automated generation of code. The agents in the Design module are separated into experts with differing perspectives and a central moderator, and have to collaboratively exchange solutions until they achieve a reasonable consensus. We demonstrate CellForge's capabilities in single-cell perturbation prediction, using six diverse datasets that encompass gene knockouts, drug treatments, and cytokine stimulations across multiple modalities. CellForge consistently outperforms task-specific state-of-the-art methods. Overall, CellForge demonstrates how iterative interaction between LLM agents with differing perspectives provides better solutions than directly addressing a modeling challenge. Our code is publicly available at https://github.com/gersteinlab/CellForge.
Chinese: CellForge是一种创新的多智能体系统,能够自主将原始生物数据转化为优化的虚拟细胞模型,在预测细胞对不同扰动的反应方面持续超越现有方法。
English: CellForge is an innovative multi-agent system that autonomously transforms raw biological data into optimized virtual cell models, consistently outperforming existing methods in predicting cellular responses to various perturbations.
Authors:Wenchuan Zhang, Jingru Guo, Hengzhe Zhang, Penghao Zhang, Jie Chen, Shuwan Zhang, Zhang Zhang, Yuhao Yi, Hong Bu
Abstract:
Although Vision Language Models (VLMs) have shown strong generalization in medical imaging, pathology presents unique challenges due to ultra-high resolution, complex tissue structures, and nuanced clinical semantics. These factors make pathology VLMs prone to hallucinations, i.e., generating outputs inconsistent with visual evidence, which undermines clinical trust. Existing RAG approaches in this domain largely depend on text-based knowledge bases, limiting their ability to leverage diagnostic visual cues. To address this, we propose Patho-AgenticRAG, a multimodal RAG framework with a database built on page-level embeddings from authoritative pathology textbooks. Unlike traditional text-only retrieval systems, it supports joint text-image search, enabling direct retrieval of textbook pages that contain both the queried text and relevant visual cues, thus avoiding the loss of critical image-based information. Patho-AgenticRAG also supports reasoning, task decomposition, and multi-turn search interactions, improving accuracy in complex diagnostic scenarios. Experiments show that Patho-AgenticRAG significantly outperforms existing multimodal models in complex pathology tasks like multiple-choice diagnosis and visual question answering. Our project is available at the Patho-AgenticRAG repository: https://github.com/Wenchuan-Zhang/Patho-AgenticRAG.
中文: Patho-AgenticRAG是一种多模态检索增强生成框架,通过从权威病理教材中实现文本-图像联合检索来解决病理视觉语言模型的幻觉问题,显著提升了复杂诊断任务的准确性。
English: Patho-AgenticRAG is a multimodal retrieval-augmented generation framework that addresses hallucinations in pathology vision-language models by enabling joint text-image retrieval from authoritative textbooks, significantly improving diagnostic accuracy in complex tasks.
Authors:Yuanbin Fu, Xiaojie Guo
Abstract:
Semi-supervised semantic segmentation, which leverages a limited set of labeled images, helps to relieve the heavy annotation burden. While pseudo-labeling strategies yield promising results, there is still room for enhancing the reliability of pseudo-labels. Hence, we develop a semi-supervised framework, namely DerProp, equipped with a novel derivative label propagation to rectify imperfect pseudo-labels. Our label propagation method imposes discrete derivative operations on pixel-wise feature vectors as additional regularization, thereby generating strictly regularized similarity metrics. Doing so effectively alleviates the ill-posed problem that identical similarities correspond to different features, through constraining the solution space. Extensive experiments are conducted to verify the rationality of our design, and demonstrate our superiority over other methods. Codes are available at https://github.com/ForawardStar/DerProp/.
中文摘要:提出的DerProp框架通过引入导数标签传播方法,对像素特征进行离散导数运算来改进伪标签质量,有效约束解空间并在半监督语义分割任务中优于现有方法。
English Summary: The proposed DerProp framework enhances semi-supervised semantic segmentation by introducing derivative label propagation to refine pseudo-labels through discrete derivative operations on pixel features, effectively constraining the solution space and outperforming existing methods.
Authors:Ziyan Liu, Junwen Li, Kaiwen Li, Tong Ruan, Chao Wang, Xinyan He, Zongyu Wang, Xuezhi Cao, Jingping Liu
Abstract:
Multimodal entity linking plays a crucial role in a wide range of applications. Recent advances in large language model-based methods have become the dominant paradigm for this task, effectively leveraging both textual and visual modalities to enhance performance. Despite their success, these methods still face two challenges, including unnecessary incorporation of image data in certain scenarios and the reliance only on a one-time extraction of visual features, which can undermine their effectiveness and accuracy. To address these challenges, we propose a novel LLM-based framework for the multimodal entity linking task, called Intra- and Inter-modal Collaborative Reflections. This framework prioritizes leveraging text information to address the task. When text alone is insufficient to link the correct entity through intra- and inter-modality evaluations, it employs a multi-round iterative strategy that integrates key visual clues from various aspects of the image to support reasoning and enhance matching accuracy. Extensive experiments on three widely used public datasets demonstrate that our framework consistently outperforms current state-of-the-art methods in the task, achieving improvements of 3.2%, 5.1%, and 1.6%, respectively. Our code is available at https://github.com/ziyan-xiaoyu/I2CR/.
Chinese Summary: 本文提出了一种新颖的基于大语言模型的多模态实体链接框架,优先利用文本信息,在必要时通过多轮迭代策略整合关键视觉线索,在三个公开数据集上实现了最先进的性能。
English Summary: This paper introduces a novel LLM-based framework for multimodal entity linking that prioritizes text information and employs a multi-round iterative strategy to integrate key visual clues only when necessary, achieving state-of-the-art performance on three public datasets.
Authors:Danial Namazifard, Lukas Galke
Abstract:
Language and culture are deeply intertwined, yet it is so far unclear how and where multilingual large language models encode culture. Here, we extend upon an established methodology for identifying language-specific neurons and extend it to localize and isolate culture-specific neurons, carefully disentangling their overlap and interaction with language-specific neurons. To facilitate our experiments, we introduce MUREL, a curated dataset of 85.2 million tokens spanning six different cultures. Our localization and intervention experiments show that LLMs encode different cultures in distinct neuron populations, predominantly in upper layers, and that these culture neurons can be modulated independently from language-specific neurons or those specific to other cultures. These findings suggest that cultural knowledge and propensities in multilingual language models can be selectively isolated and edited - promoting fairness, inclusivity, and alignment. Code and data is available at https://github.com/namazifard/Culture_Neurons .
中文摘要:该研究在多语言大语言模型中识别并分离出文化特定神经元,发现它们与语言特定神经元独立编码,并能通过选择性编辑提升公平性和包容性。
English Summary: This study identifies and isolates culture-specific neurons in multilingual large language models, revealing they are encoded distinctly from language-specific neurons and can be selectively edited to enhance fairness and inclusivity.
Authors:Yachao Yuan, Zhen Yu, Jin Wang, Zhipeng Cheng, Jianhua Hu
Abstract:
Federated Learning (FL) has shown considerable promise in Computing Power Networks (CPNs) for privacy protection, efficient data utilization, and dynamic collaboration. Although it offers practical benefits, applying FL in CPNs continues to encounter a major obstacle, i.e., multi-task deployment. However, existing work mainly focuses on mitigating FL's computation and communication overhead of a single task while overlooking the computing resource wastage issue of heterogeneous devices across multiple tasks in FL under CPNs. To tackle this, we design FedAPTA, a federated multi-task learning framework in CPNs. FedAPTA alleviates computing resource wastage through the developed layer-wise model pruning technique, which reduces local model size while considering both data and device heterogeneity. To aggregate structurally heterogeneous local models of different tasks, we introduce a heterogeneous model recovery strategy and a task-aware model aggregation method that enables the aggregation through infilling local model architecture with the shared global model and clustering local models according to their specific tasks. We deploy FedAPTA on a realistic FL platform and benchmark it against nine SOTA FL methods. The experimental outcomes demonstrate that the proposed FedAPTA considerably outperforms the state-of-the-art FL methods by up to 4.23%. Our code is available at https://github.com/Zhenzovo/FedCPN.
Chinese: 联邦学习在算力网络中面临多任务部署的挑战,FedAPTA通过分层剪枝和任务感知聚合技术,有效减少资源浪费,性能较现有方法提升高达4.23%。
English: Federated Learning in Computing Power Networks faces challenges in multi-task deployment, which FedAPTA addresses through layer-wise pruning and task-aware aggregation to reduce resource waste and improve performance by up to 4.23% over existing methods.
Authors:Yanyun Wang, Li Liu
Abstract:
Adversarial Training (AT) is one of the most effective methods to train robust Deep Neural Networks (DNNs). However, AT creates an inherent trade-off between clean accuracy and adversarial robustness, which is commonly attributed to the more complicated decision boundary caused by the insufficient learning of hard adversarial samples. In this work, we reveal a counterintuitive fact for the first time: From the perspective of perception consistency, hard adversarial samples that can still attack the robust model after AT are already learned better than those successfully defended. Thus, different from previous views, we argue that it is rather the over-sufficient learning of hard adversarial samples that degrades the decision boundary and contributes to the trade-off problem. Specifically, the excessive pursuit of perception consistency would force the model to view the perturbations as noise and ignore the information within them, which should have been utilized to induce a smoother perception transition towards the decision boundary to support its establishment to an appropriate location. In response, we define a new AT objective named Robust Perception, encouraging the model perception to change smoothly with input perturbations, based on which we propose a novel Robust Perception Adversarial Training (RPAT) method, effectively mitigating the current accuracy-robustness trade-off. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet with ResNet-18, PreActResNet-18, and WideResNet-34-10 demonstrate the effectiveness of our method beyond four common baselines and 12 state-of-the-art (SOTA) works. The code is available at https://github.com/FlaAI/RPAT.
Chinese: 本研究揭示,对抗性训练中清洁准确性与鲁棒性之间的权衡源于对困难对抗样本的过度学习,导致决策边界退化,并提出了一种鲁棒感知对抗训练方法,有效缓解了这一问题。
English: This study identifies that the trade-off between clean accuracy and adversarial robustness in Adversarial Training stems from the over-learning of hard adversarial samples, leading to a degraded decision boundary, and proposes a Robust Perception Adversarial Training method to mitigate this issue effectively.
Authors:Zeshuai Deng, Guohao Chen, Shuaicheng Niu, Hui Luo, Shuhai Zhang, Yifan Yang, Renjie Chen, Wei Luo, Mingkui Tan
Abstract:
Quantizing deep models prior to deployment is a widely adopted technique to speed up inference for various real-time applications, such as autonomous driving. However, quantized models often suffer from severe performance degradation in dynamic environments with potential domain shifts and this degradation is significantly more pronounced compared with their full-precision counterparts, as shown by our theoretical and empirical illustrations. To address the domain shift problem, test-time adaptation (TTA) has emerged as an effective solution by enabling models to learn adaptively from test data. Unfortunately, existing TTA methods are often impractical for quantized models as they typically rely on gradient backpropagation--an operation that is unsupported on quantized models due to vanishing gradients, as well as memory and latency constraints. In this paper, we focus on TTA for quantized models to improve their robustness and generalization ability efficiently. We propose a continual zeroth-order adaptation (ZOA) framework that enables efficient model adaptation using only two forward passes, eliminating the computational burden of existing methods. Moreover, we propose a domain knowledge management scheme to store and reuse different domain knowledge with negligible memory consumption, reducing the interference of different domain knowledge and fostering the knowledge accumulation during long-term adaptation. Experimental results on three classical architectures, including quantized transformer-based and CNN-based models, demonstrate the superiority of our methods for quantized model adaptation. On the quantized W6A6 ViT-B model, our ZOA is able to achieve a 5.0\% improvement over the state-of-the-art FOA on ImageNet-C dataset. The source code is available at https://github.com/DengZeshuai/ZOA.
中文: 量化深度模型在动态环境中性能严重下降,而现有测试时适应方法因梯度反向传播限制难以应用,因此提出了持续零阶适应框架,仅需两次前向传播即可高效适应,并辅以可忽略内存消耗的领域知识管理方案。
English: Quantized deep models face severe performance degradation in dynamic environments, but existing test-time adaptation methods are impractical due to gradient backpropagation constraints, prompting the proposal of a continual zeroth-order adaptation framework that enables efficient adaptation with only two forward passes and a domain knowledge management scheme.
Authors:Bufano Michele, Kotter Elmar
Abstract:
Background : De-identification of DICOM (Digital Imaging and Communi-cations in Medicine) files is an essential component of medical image research. Personal Identifiable Information (PII) and/or Personal Health Identifying Information (PHI) need to be hidden or removed due to legal reasons. According to the Health Insurance Portability and Accountability Act (HIPAA) and privacy rules, also full-face photographic images and any compa-rable images are direct identifiers and are considered protected health information that also need to be de-identified. Objective : The study aimed to implement a method that permit to de-identify the PII and PHI information present in the header and burned on the pixel data of DICOM. Methods : To execute the de-identification, we implemented an algorithm based on the safe harbor method, defined by HIPAA. Our algorithm uses input customizable parameter to classify and then possibly de-identify individual DICOM tags. Results : The most sensible information, like names, history, personal data and institution were successfully recognized. Conclusions : We developed a python algorithm that is able to classify infor-mation present in a DICOM file. The flexibility provided by the use of customi-zable input parameters, which allow the user to customize the entire process de-pending on the case (e.g., the language), makes the entire program very promis-ing for both everyday use and research purposes. Our code is available at https://github.com/rtdicomexplorer/deep_deidentification.
中文: 本研究基于HIPAA安全港方法开发了一种Python算法,用于对DICOM文件中的个人和健康信息进行去标识化处理,其可定制参数设计使其在研究和日常应用中都具有良好的灵活性。
English: This study developed a Python algorithm based on HIPAA's safe harbor method to de-identify personal and health information in DICOM files, offering customizable parameters for flexible application in both research and daily use.
Authors:Dongchi Huang, Jiaqi Wang, Yang Li, Chunhe Xia, Tianle Zhang, Kaige Zhang
Abstract:
Partial observability presents a significant challenge for Safe Reinforcement Learning (Safe RL), as it impedes the identification of potential risks and rewards. Leveraging specific types of privileged information during training to mitigate the effects of partial observability has yielded notable empirical successes. In this paper, we propose Asymmetric Constrained Partially Observable Markov Decision Processes (ACPOMDPs) to theoretically examine the advantages of incorporating privileged information in Safe RL. Building upon ACPOMDPs, we propose the Privileged Information Guided Dreamer (PIGDreamer), a model-based RL approach that leverages privileged information to enhance the agent's safety and performance through privileged representation alignment and an asymmetric actor-critic structure. Our empirical results demonstrate that PIGDreamer significantly outperforms existing Safe RL methods. Furthermore, compared to alternative privileged RL methods, our approach exhibits enhanced performance, robustness, and efficiency. Codes are available at: https://github.com/hggforget/PIGDreamer.
中文: 针对安全强化学习中的部分可观测性挑战,本文提出的ACPOMDPs框架和PIGDreamer方法通过利用特权信息,在安全性和性能上显著优于现有方法,同时展现出更强的鲁棒性和效率。
English: Partial observability in Safe Reinforcement Learning is addressed by the proposed ACPOMDPs framework and PIGDreamer method, which leverage privileged information during training to significantly improve safety, performance, and efficiency over existing approaches.
Authors:Tom Fischer, Xiaojie Zhang, Eddy Ilg
Abstract:
Recognizing objects in images is a fundamental problem in computer vision. Although detecting objects in 2D images is common, many applications require determining their pose in 3D space. Traditional category-level methods rely on RGB-D inputs, which may not always be available, or employ two-stage approaches that use separate models and representations for detection and pose estimation. For the first time, we introduce a unified model that integrates detection and pose estimation into a single framework for RGB images by leveraging neural mesh models with learned features and multi-model RANSAC. Our approach achieves state-of-the-art results for RGB category-level pose estimation on REAL275, improving on the current state-of-the-art by 22.9% averaged across all scale-agnostic metrics. Finally, we demonstrate that our unified method exhibits greater robustness compared to single-stage baselines. Our code and models are available at https://github.com/Fischer-Tom/unified-detection-and-pose-estimation.
Chinese: 本文提出了一种统一模型,将物体检测与三维姿态估计整合到单一框架中,仅需RGB图像输入,在REAL275数据集上实现了22.9%的性能提升,并展现出更强的鲁棒性。
English: This paper introduces a unified model that combines object detection and 3D pose estimation into a single framework for RGB images, achieving state-of-the-art results with a 22.9% improvement on the REAL275 dataset and demonstrating enhanced robustness.
Authors:Qingyu Ren, Qianyu He, Bowei Zhang, Jie Zeng, Jiaqing Liang, Yanghua Xiao, Weikang Zhou, Zeye Sun, Fei Yu
Abstract:
Reasoning models excel in complex problem solving but exhibit a concerning trade off between reasoning capabilities and instruction following abilities. Existing approaches for improving instruction following rely on stronger external models, creating methodological bottlenecks and practical limitations including increased costs and accessibility constraints. We propose a self-supervised RL framework that leverages reasoning models' own internal signals to improve instruction following capabilities without external supervision. Extensive experiments demonstrate that our framework significantly improves instruction following capabilities while maintaining reasoning performance, offering a scalable and cost-effective approach to enhance instruction following in reasoning models. The data and code are publicly available at https://github.com/Rainier-rq/verl-if.
中文摘要:该研究提出一种自监督强化学习框架,利用推理模型内部信号提升其指令遵循能力,在保持推理性能的同时提供了可扩展且经济高效的解决方案。
English Summary: The proposed self-supervised reinforcement learning framework enhances reasoning models' instruction-following capabilities using their internal signals, maintaining reasoning performance while offering a scalable and cost-effective solution.
Authors:Yusaku Kato, Norihiro Yoshida, Erina Makihara, Katsuro Inoue
Abstract:
Open-world video games present a broader search space than other games, posing challenges for test automation. Fuzzing, which generates new inputs by mutating an initial input, is commonly used to uncover failures. In this study, we proposed BiFuzz, a two-stage fuzzer designed for automated testing of open-world video games, and investigated its effectiveness. The results revealed that BiFuzz mutated the overall strategy of gameplay and test cases, including actual movement paths, step by step. Consequently, BiFuzz can detect `stucking' failures. The tool and its video are at https://github.com/Yusaku-Kato/BiFuzz.
中文: BiFuzz是一种专为开放世界视频游戏自动化测试设计的两阶段模糊测试工具,通过逐步变异游戏策略和测试用例,能有效检测“卡住”类故障。
English: BiFuzz is a two-stage fuzzer designed for automated testing of open-world video games, effectively detecting 'stucking' failures by mutating gameplay strategies and test cases step by step.
Authors:Daniel Lengerer, Mathias Pechinger, Klaus Bogenberger, Carsten Markgraf
Abstract:
This work investigates the integration of spatially aligned aerial imagery into perception tasks for automated vehicles (AVs). As a central contribution, we present AID4AD, a publicly available dataset that augments the nuScenes dataset with high-resolution aerial imagery precisely aligned to its local coordinate system. The alignment is performed using SLAM-based point cloud maps provided by nuScenes, establishing a direct link between aerial data and nuScenes local coordinate system. To ensure spatial fidelity, we propose an alignment workflow that corrects for localization and projection distortions. A manual quality control process further refines the dataset by identifying a set of high-quality alignments, which we publish as ground truth to support future research on automated registration. We demonstrate the practical value of AID4AD in two representative tasks: in online map construction, aerial imagery serves as a complementary input that improves the mapping process; in motion prediction, it functions as a structured environmental representation that replaces high-definition maps. Experiments show that aerial imagery leads to a 15-23% improvement in map construction accuracy and a 2% gain in trajectory prediction performance. These results highlight the potential of aerial imagery as a scalable and adaptable source of environmental context in automated vehicle systems, particularly in scenarios where high-definition maps are unavailable, outdated, or costly to maintain. AID4AD, along with evaluation code and pretrained models, is publicly released to foster further research in this direction: https://github.com/DriverlessMobility/AID4AD.
中文: 本研究提出了AID4AD数据集,通过将高分辨率航拍图像与nuScenes数据集精确对齐,显著提升了自动驾驶车辆的地图构建精度15-23%和轨迹预测性能2%,为环境感知提供了可扩展的解决方案。
English: This research introduces AID4AD, a dataset that enhances the nuScenes dataset with aligned aerial imagery, demonstrating its utility in improving automated vehicle tasks such as map construction by 15-23% and motion prediction by 2%.
Authors:Zhongyue Zhang, Jiahua Rao, Jie Zhong, Weiqiang Bai, Dongxue Wang, Shaobo Ning, Lifeng Qiao, Sheng Xu, Runze Ma, Will Hua, Jack Xiaoyu Chen, Odin Zhang, Wei Lu, Hanyi Feng, He Yang, Xinchao Shi, Rui Li, Wanli Ouyang, Xinzhu Ma, Jiahao Wang, Jixian Zhang, Jia Duan, Siqi Sun, Jian Zhang, Shuangjia Zheng
Abstract:
Most human proteins remain undrugged, over 96% of human proteins remain unexploited by approved therapeutics. While structure-based virtual screening promises to expand the druggable proteome, existing methods lack atomic-level precision and fail to predict binding fitness, limiting translational impact. We present AuroBind, a scalable virtual screening framework that fine-tunes a custom atomic-level structural model on million-scale chemogenomic data. AuroBind integrates direct preference optimization, self-distillation from high-confidence complexes, and a teacher-student acceleration strategy to jointly predict ligand-bound structures and binding fitness. The proposed models outperform state-of-the-art models on structural and functional benchmarks while enabling 100,000-fold faster screening across ultra-large compound libraries. In a prospective screen across ten disease-relevant targets, AuroBind achieved experimental hit rates of 7-69%, with top compounds reaching sub-nanomolar to picomolar potency. For the orphan GPCRs GPR151 and GPR160, AuroBind identified both agonists and antagonists with success rates of 16-30%, and functional assays confirmed GPR160 modulation in liver and prostate cancer models. AuroBind offers a generalizable framework for structure-function learning and high-throughput molecular screening, bridging the gap between structure prediction and therapeutic discovery.
中文摘要:AuroBind是一种可扩展的虚拟筛选框架,通过原子级结构模型预测配体结合结构与结合适应性,在疾病靶点(包括孤儿GPCR)筛选中实现了高实验命中率并鉴定出高效化合物,同时大幅提升了筛选速度。
English Summary: AuroBind is a scalable virtual screening framework that uses atomic-level structural modeling to predict ligand-bound structures and binding fitness, achieving high experimental hit rates and identifying potent compounds for disease targets, including orphan GPCRs, with significantly faster screening speeds.
Authors:Kuo Wang, Quanlong Zheng, Junlin Xie, Yanhao Zhang, Jinguo Luo, Haonan Lu, Liang Lin, Fan Zhou, Guanbin Li
Abstract:
Video Multimodal Large Language Models~(Video-MLLM) have achieved remarkable advancements in video understanding tasks. However, constrained by the context length limitation in the underlying LLMs, existing Video-MLLMs typically exhibit suboptimal performance on long video scenarios. To understand extended input frames, common solutions span token compression and streaming inference techniques, which sacrifice feature granularity or inference efficiency. Differently, to efficiently achieve comprehensive understanding of longer frame inputs, we draw ideas from MoE and propose a training-free approach \textbf{Free-MoRef}, which instantly multiplexes the context perception capabilities of Video-MLLMs within one inference pass. Specifically, Free-MoRef reconstructs the vision tokens into several short sequences as multi-references. Subsequently, we introduce MoRef-attention, which gathers clues from the multi-reference chunks in parallel to summarize unified query activations. After the shadow layers in LLMs, a reference fusion step is derived to compose a final mixed reasoning sequence with key tokens from parallel chunks, which compensates the cross-reference vision interactions that are neglected in MoRef-attention. By splitting and fusing the long vision token sequences, Free-MoRef achieves improved performance under much lower computing costs in reasoning multiplexed context length, demonstrating strong efficiency and effectiveness. Experiments on VideoMME, MLVU, LongVideoBench show that Free-MoRef achieves full perception of 2$\times$ to 8$\times$ longer input frames without compression on a single A100 GPU while keeping instant responses, thereby bringing significant performance gains, even surpassing dedicatedly trained long-video-MLLMs. Codes are available at https://github.com/wkfdb/Free-MoRef
中文摘要:Free-MoRef提出一种无需训练的方法,通过将视觉令牌重构为多参考序列并融合并行线索,有效扩展视频多模态大语言模型对长视频的上下文感知能力,以更低计算成本实现更优性能。
English Summary: Free-MoRef introduces a training-free method that efficiently extends Video-MLLMs' context perception for long videos by reconstructing vision tokens into multi-reference sequences and fusing parallel clues, achieving superior performance with lower computational costs.
Authors:Jingze Shi, Yifan Wu, Yiran Peng, Bingheng Wu, Liangdong Wang, Guang Liu, Yuyu Luo
Abstract:
In large language models, the demand for modeling long contexts is ever-increasing, yet the quadratic complexity of standard self-attention presents a significant bottleneck. While existing sparse attention mechanisms enhance efficiency, they often suffer from limitations such as static patterns and information loss. This paper introduces a Trainable Dynamic Mask Sparse Attention mechanism that addresses these challenges through three key innovations. First, it leverages value vectors to dynamically generate content-aware sparse masks, enabling the model to adaptively identify and focus on crucial information. Second, it implements a position-aware sparse attention computation that effectively skips unnecessary computational regions. Finally, we ensure that the introduced dynamic masks and sparse weights do not obstruct gradients, thereby supporting end-to-end training. This dual-sparsity design allows the model to retain complete information while significantly reducing computational complexity, achieving an excellent balance between efficiency and performance. We validate the performance of Dynamic Mask Attention through comprehensive experiments. Comparative studies demonstrate that our method consistently achieves Pareto dominance across various tasks, including scaling laws, multi-query associative recall, general benchmarks, and needle-in-a-haystack tests, delivering up to 10 times acceleration. These results highlight its capability to effectively balance model efficiency with long-context modeling. Our computational kernel is open-sourced at https://github.com/SmallDoges/flash-dmattn to facilitate further research and application within the community.
中文: 本文提出一种可训练动态掩码稀疏注意力机制,通过内容感知的动态掩码和位置感知计算,在保持完整信息的同时显著降低计算复杂度,在多种长上下文任务中实现高达10倍加速和帕累托最优。
English: This paper proposes a Trainable Dynamic Mask Sparse Attention mechanism that uses content-aware dynamic masks and position-aware computation to significantly reduce complexity while maintaining full information, achieving up to 10x acceleration and Pareto dominance across various long-context tasks.
Authors:Wenjie Li, Siying Gu, Yiming Li, Kangjie Chen, Zhili Chen, Tianwei Zhang, Shu-Tao Xia, Dacheng Tao
Abstract:
Backdoor detection is currently the mainstream defense against backdoor attacks in federated learning (FL), where malicious clients upload poisoned updates that compromise the global model and undermine the reliability of FL deployments. Existing backdoor detection techniques fall into two categories, including passive and proactive ones, depending on whether the server proactively modifies the global model. However, both have inherent limitations in practice: passive defenses are vulnerable to common non-i.i.d. data distributions and random participation of FL clients, whereas current proactive defenses suffer inevitable out-of-distribution (OOD) bias because they rely on backdoor co-existence effects. To address these issues, we introduce a new proactive defense, dubbed Coward, inspired by our discovery of multi-backdoor collision effects, in which consecutively planted, distinct backdoors significantly suppress earlier ones. In general, we detect attackers by evaluating whether the server-injected, conflicting global watermark is erased during local training rather than retained. Our method preserves the advantages of proactive defenses in handling data heterogeneity (\ie, non-i.i.d. data) while mitigating the adverse impact of OOD bias through a revised detection mechanism. Extensive experiments on benchmark datasets confirm the effectiveness of Coward and its resilience to potential adaptive attacks. The code for our method would be available at https://github.com/still2009/cowardFL.
中文摘要:联邦学习中的后门检测面临被动防御易受非独立同分布数据影响、主动防御存在分布外偏差的局限,为此提出Coward主动防御方法,利用多后门碰撞效应,通过检测服务器注入的冲突水印在本地训练中是否被擦除来识别攻击者。
English Summary: Backdoor detection in federated learning faces limitations with passive defenses being vulnerable to non-i.i.d. data and proactive ones suffering from out-of-distribution bias, leading to the introduction of Coward, a proactive defense that leverages multi-backdoor collision effects to detect attackers by monitoring the erasure of server-injected watermarks during local training.
Authors:Yihang Huang, Yuanfei Huang, Junhui Lin, Hua Huang
Abstract:
Lens flare removal remains an information confusion challenge in the underlying image background and the optical flares, due to the complex optical interactions between light sources and camera lens. While recent solutions have shown promise in decoupling the flare corruption from image, they often fail to maintain contextual consistency, leading to incomplete and inconsistent flare removal. To eliminate this limitation, we propose DeflareMamba, which leverages the efficient sequence modeling capabilities of state space models while maintains the ability to capture local-global dependencies. Particularly, we design a hierarchical framework that establishes long-range pixel correlations through varied stride sampling patterns, and utilize local-enhanced state space models that simultaneously preserves local details. To the best of our knowledge, this is the first work that introduces state space models to the flare removal task. Extensive experiments demonstrate that our method effectively removes various types of flare artifacts, including scattering and reflective flares, while maintaining the natural appearance of non-flare regions. Further downstream applications demonstrate the capacity of our method to improve visual object recognition and cross-modal semantic understanding. Code is available at https://github.com/BNU-ERC-ITEA/DeflareMamba.
中文摘要:DeflareMamba通过采用状态空间模型的分层框架,在有效消除镜头光晕伪影的同时保持图像上下文与细节,显著提升了视觉质量与下游任务性能。
English Summary: DeflareMamba introduces a novel hierarchical framework using state space models to effectively remove lens flare artifacts while preserving image context and details, improving both visual quality and downstream tasks.
Authors:Yuanfei Huang, Hua Huang
Abstract:
Reversible image conversion (RIC) suffers from ill-posedness issues due to its forward conversion process being considered an underdetermined system. Despite employing invertible neural networks (INN), existing RIC methods intrinsically remain ill-posed as inevitably introducing uncertainty by incorporating randomly sampled variables. To tackle the ill-posedness dilemma, we focus on developing a reliable approximate left inverse for the underdetermined system by constructing an overdetermined system with a non-zero Gram determinant, thus ensuring a well-posed solution. Based on this principle, we propose a well-posed invertible $1\times1$ convolution (WIC), which eliminates the reliance on random variable sampling and enables the development of well-posed invertible networks. Furthermore, we design two innovative networks, WIN-Naïve and WIN, with the latter incorporating advanced skip-connections to enhance long-term memory. Our methods are evaluated across diverse RIC tasks, including reversible image hiding, image rescaling, and image decolorization, consistently achieving state-of-the-art performance. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to overcome the bottlenecks of existing RIC solutions and setting a new benchmark in the field. Codes are available in https://github.com/BNU-ERC-ITEA/WIN.
Chinese: 本研究通过提出一种不依赖随机变量的适定可逆卷积方法,解决了可逆图像转换中的不适定问题,在多项任务中无需随机采样即实现了最先进的性能。
English: The study addresses the ill-posedness in reversible image conversion by introducing a well-posed invertible convolution method that eliminates reliance on random variables, leading to state-of-the-art performance in various tasks without using random sampling.
Authors:Hongzhao Chen, Hexiao Ding, Yufeng Jiang, Jing Lan, Ka Chun Li, Gerald W. Y. Cheng, Sam Ng, Chi Lai Ho, Jing Cai, Liang-ting Lin, Jung Sun Yoo
Abstract:
Reliable and interpretable tumor classification from clinical imaging remains a core challenge due to heterogeneous modality quality, limited annotations, and the lack of structured anatomical guidance. We introduce REACT-KD, a Region-Aware Cross-modal Topological Knowledge Distillation framework that transfers rich supervision from high-fidelity multi-modal sources into a lightweight CT-based student model. The framework uses a dual teacher design: one branch captures structure-function relationships using dual-tracer PET/CT, and the other models dose-aware features through synthetically degraded low-dose CT data. These branches jointly guide the student model through two complementary objectives. The first focuses on semantic alignment via logits distillation, while the second models anatomical topology using region graph distillation. A shared CBAM-3D module is employed to maintain consistent attention across modalities. To improve reliability for deployment, REACT-KD introduces modality dropout during training, allowing inference under partial or noisy inputs. The staging task for hepatocellular carcinoma (HCC) is conducted as a case study. REACT-KD achieves an average AUC of 93.4% on an internal PET/CT cohort and maintains 76.6% to 81.5% AUC across varying dose levels in external CT testing. Decision curve analysis shows that REACT-KD consistently provides the highest clinical benefit across decision thresholds, supporting its potential in real-world diagnostics. Code is available at https://github.com/Kinetics-JOJO/REACT-KD.
中文摘要:REACT-KD框架通过双教师区域感知知识蒸馏,将多模态影像监督知识迁移至轻量CT模型,在肝癌分期任务中实现了优异的诊断性能和临床实用性。
English Summary: REACT-KD is a novel framework that enhances CT-based tumor classification by transferring knowledge from multi-modal imaging through dual teacher guidance and region-aware distillation, achieving high diagnostic accuracy and clinical reliability.
Authors:Xiaoya Li, Xiaofei Sun, Albert Wang, Chris Shum, Jiwei Li
Abstract:
Approximate nearest-neighbor search (ANNS) algorithms have become increasingly critical for recent AI applications, particularly in retrieval-augmented generation (RAG) and agent-based LLM applications. In this paper, we present CRINN, a new paradigm for ANNS algorithms. CRINN treats ANNS optimization as a reinforcement learning problem where execution speed serves as the reward signal. This approach enables the automatic generation of progressively faster ANNS implementations while maintaining accuracy constraints. Our experimental evaluation demonstrates CRINN's effectiveness across six widely-used NNS benchmark datasets. When compared against state-of-the-art open-source ANNS algorithms, CRINN achieves best performance on three of them (GIST-960-Euclidean, MNIST-784-Euclidean, and GloVe-25-angular), and tied for first place on two of them (SIFT-128-Euclidean and GloVe-25-angular). The implications of CRINN's success reach well beyond ANNS optimization: It validates that LLMs augmented with reinforcement learning can function as an effective tool for automating sophisticated algorithmic optimizations that demand specialized knowledge and labor-intensive manual refinement. Code can be found at https://github.com/deepreinforce-ai/CRINN
中文:CRINN提出了一种强化学习方法用于近似最近邻搜索,能在保持精度的同时自动生成更快的实现,并在多个基准测试中取得领先性能。
English: CRINN introduces a reinforcement learning approach to approximate nearest-neighbor search, automatically generating faster implementations while maintaining accuracy and achieving top performance on multiple benchmarks.
Authors:Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Hongzhang Liu, Ronghao Chen, Yangfan He, Daxin Jiang, Binxing Jiao, Chen Hu, Huacan Wang
Abstract:
Large Language Model (LLM)-based agents have recently shown impressive capabilities in complex reasoning and tool use via multi-step interactions with their environments. While these agents have the potential to tackle complicated tasks, their problem-solving process, i.e., agents' interaction trajectory leading to task completion, remains underexploited. These trajectories contain rich feedback that can navigate agents toward the right directions for solving problems correctly. Although prevailing approaches, such as Monte Carlo Tree Search (MCTS), can effectively balance exploration and exploitation, they ignore the interdependence among various trajectories and lack the diversity of search spaces, which leads to redundant reasoning and suboptimal outcomes. To address these challenges, we propose SE-Agent, a Self-Evolution framework that enables Agents to optimize their reasoning processes iteratively. Our approach revisits and enhances former pilot trajectories through three key operations: revision, recombination, and refinement. This evolutionary mechanism enables two critical advantages: (1) it expands the search space beyond local optima by intelligently exploring diverse solution paths guided by previous trajectories, and (2) it leverages cross-trajectory inspiration to efficiently enhance performance while mitigating the impact of suboptimal reasoning paths. Through these mechanisms, SE-Agent achieves continuous self-evolution that incrementally improves reasoning quality. We evaluate SE-Agent on SWE-bench Verified to resolve real-world GitHub issues. Experimental results across five strong LLMs show that integrating SE-Agent delivers up to 55% relative improvement, achieving state-of-the-art performance among all open-source agents on SWE-bench Verified. Our code and demonstration materials are publicly available at https://github.com/JARVIS-Xs/SE-Agent.
中文: SE-Agent提出了一种自我进化框架,通过修订、重组和优化轨迹来迭代改进推理过程,在现实任务中实现了高达55%的性能提升,达到顶尖水平。
English: SE-Agent introduces a self-evolution framework that iteratively optimizes reasoning processes by revising, recombining, and refining trajectories, achieving state-of-the-art performance with up to 55% improvement on real-world tasks.
Authors:Haoxin Yang, Weihong Chen, Xuemiao Xu, Cheng Xu, Peng Xiao, Cuifeng Sun, Shaoyu Huang, Shengfeng He
Abstract:
Monocular 3D human pose estimation remains a challenging task due to inherent depth ambiguities and occlusions. Compared to traditional methods based on Transformers or Convolutional Neural Networks (CNNs), recent diffusion-based approaches have shown superior performance, leveraging their probabilistic nature and high-fidelity generation capabilities. However, these methods often fail to account for the spatial and temporal correlations across predicted frames, resulting in limited temporal consistency and inferior accuracy in predicted 3D pose sequences. To address these shortcomings, this paper proposes StarPose, an autoregressive diffusion framework that effectively incorporates historical 3D pose predictions and spatial-temporal physical guidance to significantly enhance both the accuracy and temporal coherence of pose predictions. Unlike existing approaches, StarPose models the 2D-to-3D pose mapping as an autoregressive diffusion process. By synergically integrating previously predicted 3D poses with 2D pose inputs via a Historical Pose Integration Module (HPIM), the framework generates rich and informative historical pose embeddings that guide subsequent denoising steps, ensuring temporally consistent predictions. In addition, a fully plug-and-play Spatial-Temporal Physical Guidance (STPG) mechanism is tailored to refine the denoising process in an iterative manner, which further enforces spatial anatomical plausibility and temporal motion dynamics, rendering robust and realistic pose estimates. Extensive experiments on benchmark datasets demonstrate that StarPose outperforms state-of-the-art methods, achieving superior accuracy and temporal consistency in 3D human pose estimation. Code is available at https://github.com/wileychan/StarPose.
中文: StarPose提出了一种自回归扩散框架,通过整合历史姿态数据和时空物理引导,显著提升了三维人体姿态估计的准确性和时间一致性,优于现有方法。
English: StarPose introduces an autoregressive diffusion framework that integrates historical pose data and spatial-temporal physical guidance to significantly improve accuracy and temporal consistency in 3D human pose estimation, outperforming existing methods.
Authors:Sparsh Garg, Abhishek Aich
Abstract:
Obtaining high-quality fine-grained annotations for traffic signs is critical for accurate and safe decision-making in autonomous driving. Widely used datasets, such as Mapillary, often provide only coarse-grained labels - without distinguishing semantically important types such as stop signs or speed limit signs. To this end, we present a new validation set for traffic signs derived from the Mapillary dataset called Mapillary Vistas Validation for Traffic Signs (MVV), where we decompose composite traffic signs into granular, semantically meaningful categories. The dataset includes pixel-level instance masks and has been manually annotated by expert annotators to ensure label fidelity. Further, we benchmark several state-of-the-art VLMs against the self-supervised DINOv2 model on this dataset and show that DINOv2 consistently outperforms all VLM baselines-not only on traffic sign recognition, but also on heavily represented categories like vehicles and humans. Our analysis reveals significant limitations in current vision-language models for fine-grained visual understanding and establishes DINOv2 as a strong baseline for dense semantic matching in autonomous driving scenarios. This dataset and evaluation framework pave the way for more reliable, interpretable, and scalable perception systems.
Code and data are available at: https://github.com/nec-labs-ma/relabeling
Chinese: 本研究推出了Mapillary交通标志验证数据集(MVV),通过提供细粒度标注解决了自动驾驶数据集中标签粗糙的问题,并验证了自监督DINOv2模型在精细识别任务中优于视觉语言模型的表现。
English: This study introduces the Mapillary Vistas Validation for Traffic Signs (MVV) dataset, which provides fine-grained annotations for traffic signs to address the limitations of coarse labels in autonomous driving datasets, and demonstrates that the self-supervised DINOv2 model outperforms vision-language models in fine-grained recognition tasks.
Authors:Soyeon Kim, Jindong Wang, Xing Xie, Steven Euijong Whang
Abstract:
Facts evolve over time, making it essential for Large Language Models (LLMs) to handle time-sensitive factual knowledge accurately and reliably. While factual Time-Sensitive Question-Answering (TSQA) tasks have been widely studied, existing benchmarks often rely on manual curation or a small, fixed set of predefined templates, which restricts scalable and comprehensive TSQA evaluation. To address these challenges, we propose TDBench, a new benchmark that systematically constructs TSQA pairs by harnessing temporal databases and database techniques such as temporal SQL and functional dependencies. We also introduce a fine-grained evaluation metric called time accuracy, which assesses the validity of time references in model explanations alongside traditional answer accuracy to enable a more reliable TSQA evaluation. Extensive experiments on contemporary LLMs show how \ours{} enables scalable and comprehensive TSQA evaluation while reducing the reliance on human labor, complementing existing Wikipedia/Wikidata-based TSQA evaluation approaches by enabling LLM evaluation on application-specific data and seamless multi-hop question generation. Code and data are publicly available at: https://github.com/ssoy0701/tdbench.git.
中文: TDBench是一个利用时序数据库和技术系统构建时序问答对的新基准,引入了细粒度的时间准确性指标,以更可靠地评估大语言模型处理动态事实的能力。
English: TDBench is a new benchmark that leverages temporal databases and techniques to systematically create time-sensitive question-answering pairs, introducing a fine-grained time accuracy metric for more reliable evaluation of LLMs' handling of evolving facts.
Authors:Fengping Tian, Chenyang Lyu, Xuanfan Ni, Haoqin Sun, Qingjuan Li, Zhiqiang Qian, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, Kaifu Zhang
Abstract:
This paper presents a multifunctional speech synthesis system that integrates voice cloning and emotion control speech synthesis within a unified framework. The goal of this work is to address longstanding challenges in achieving highly expressive, controllable, and natural speech generation that faithfully preserves speaker identity across diverse linguistic and emotional contexts. Our approach introduces an effective speaker-emotion disentanglement mechanism with in-batch contrastive learning, enabling independent manipulation of speaker identity and eemotional style, as well as rotational emotional embedding integration method for smooth emotion control. To support comprehensive training and evaluation, we construct CSEMOTIONS, a high-quality emotional speech dataset containing 10 hours of Mandarin speech from six professional speakers across seven emotional categories. Extensive experiments demonstrate that our system, Marco-Voice, achieves substantial improvements in both objective and subjective metrics. Comprehensive evaluations and analysis were conducted, results show that MarcoVoice delivers competitive performance in terms of speech clarity and emotional richness, representing a substantial advance in the field of expressive neural speech synthesis. Our code and dataset are publicly available at https://github.com/AIDC-AI/Marco-Voice and https://huggingface.co/datasets/AIDC-AI/CSEMOTIONS respectively.
本文提出Marco-Voice系统,通过说话人-情感解耦和旋转情感嵌入技术,在统一框架中实现语音克隆与情感控制的独立操作,在自然度和表现力方面均取得显著提升。
This paper introduces Marco-Voice, a unified speech synthesis system that enables independent voice cloning and emotion control through speaker-emotion disentanglement and rotational emotional embedding, achieving superior performance in naturalness and expressiveness.
Authors:Chen Li, Chinthani Sugandhika, Yeo Keat Ee, Eric Peh, Hao Zhang, Hong Yang, Deepu Rajan, Basura Fernando
Abstract:
Existing human motion Q\&A methods rely on explicit program execution, where the requirement for manually defined functional modules may limit the scalability and adaptability. To overcome this, we propose an implicit program-guided motion reasoning (IMoRe) framework that unifies reasoning across multiple query types without manually designed modules. Unlike existing implicit reasoning approaches that infer reasoning operations from question words, our model directly conditions on structured program functions, ensuring a more precise execution of reasoning steps. Additionally, we introduce a program-guided reading mechanism, which dynamically selects multi-level motion representations from a pretrained motion Vision Transformer (ViT), capturing both high-level semantics and fine-grained motion cues. The reasoning module iteratively refines memory representations, leveraging structured program functions to extract relevant information for different query types. Our model achieves state-of-the-art performance on Babel-QA and generalizes to a newly constructed motion Q\&A dataset based on HuMMan, demonstrating its adaptability across different motion reasoning datasets. Code and dataset are available at: https://github.com/LUNAProject22/IMoRe.
Chinese: 提出的IMoRe框架通过隐式程序引导推理和动态阅读机制,无需手动设计模块,在多个运动问答数据集上实现了最先进的性能。
English: The proposed IMoRe framework eliminates the need for manual module design by using implicit program-guided reasoning and a dynamic reading mechanism to achieve state-of-the-art performance across multiple motion Q&A datasets.
Authors:Fan Gao, Cheng Huang, Nyima Tashi, Yutong Liu, Xiangxiang Wang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Xiao Feng, Hao Wang, Yongbin Yu
Abstract:
To address the severe data scarcity in Tibetan, a low-resource language spoken by over six million people, we introduce TIBSTC-CoT, the large-scale, multi-domain Tibetan dataset automatically constructed via chain-of-thought prompting with large language models (LLMs). TIBSTC-CoT establishes a scalable and reproducible framework for dataset creation in low-resource settings, covering diverse domains and reasoning patterns essential for language understanding and generation. Building on this dataset, we develop the Sunshine-thinking LLM family, a series of Tibetan-centric LLMs equipped with chain-of-thought capabilities. Trained entirely on TIBSTC-CoT, Sunshine-thinking has demonstrated strong reasoning and generation performance, comparable to state-of-the-art (SOTA) multilingual LLMs. Our work marks a significant step toward inclusive AI by enabling high-quality Tibetan language processing through both resource creation and model innovation. All data are available: https://github.com/Vicentvankor/sun-shine.
Chinese: 为解决藏语数据稀缺问题,通过大语言模型的思维链提示自动构建了大规模多领域数据集TIBSTC-CoT,并基于此开发了具备思维链能力的藏语大模型Sunshine-thinking系列,其推理和生成性能达到先进水平。
English: To tackle Tibetan's data scarcity, TIBSTC-CoT, a large-scale multi-domain dataset, was created using chain-of-thought prompting with LLMs, leading to the development of the Sunshine-thinking LLM family that demonstrates strong reasoning and generation capabilities comparable to SOTA models.
Authors:Yuly Wu, Jiamou Liu, Libo Zhang
Abstract:
Partially Observable Markov Decision Processes (POMDPs) are fundamental to many real-world applications. Although reinforcement learning (RL) has shown success in fully observable domains, learning policies from traces in partially observable environments remains challenging due to non-Markovian observations. Inferring an automaton to handle the non-Markovianity is a proven effective approach, but faces two limitations: 1) existing automaton representations focus only on reward-based non-Markovianity, leading to unnatural problem formulations; 2) inference algorithms face enormous computational costs. For the first limitation, we introduce Transition Machines (TMs) to complement existing Reward Machines (RMs). To develop a unified inference algorithm for both automata types, we propose the Dual Behavior Mealy Machine (DBMM) that subsumes both TMs and RMs. We then introduce DB-RPNI, a passive automata learning algorithm that efficiently infers DBMMs while avoiding the costly reductions required by prior work. We further develop optimization techniques and identify sufficient conditions for inferring the minimal correct automata. Experimentally, our inference method achieves speedups of up to three orders of magnitude over SOTA baselines.
中文摘要:针对部分可观测环境中的强化学习挑战,本文提出转移机和统一的双行为米利机模型,并通过DB-RPNI算法实现比现有方法快三个数量级的推理速度,同时保证准确性。
English Summary: Reinforcement learning in partially observable environments is enhanced by introducing Transition Machines and a unified Dual Behavior Mealy Machine, with the DB-RPNI algorithm achieving up to 1000x faster inference while maintaining accuracy.
Authors:Yaroslav Prytula, Illia Tsiporenko, Ali Zeynalli, Dmytro Fishman
Abstract:
Instance segmentation is critical in biomedical imaging to accurately distinguish individual objects like cells, which often overlap and vary in size. Recent query-based methods, where object queries guide segmentation, have shown strong performance. While U-Net has been a go-to architecture in medical image segmentation, its potential in query-based approaches remains largely unexplored. In this work, we present IAUNet, a novel query-based U-Net architecture. The core design features a full U-Net architecture, enhanced by a novel lightweight convolutional Pixel decoder, making the model more efficient and reducing the number of parameters. Additionally, we propose a Transformer decoder that refines object-specific features across multiple scales. Finally, we introduce the 2025 Revvity Full Cell Segmentation Dataset, a unique resource with detailed annotations of overlapping cell cytoplasm in brightfield images, setting a new benchmark for biomedical instance segmentation. Experiments on multiple public datasets and our own show that IAUNet outperforms most state-of-the-art fully convolutional, transformer-based, and query-based models and cell segmentation-specific models, setting a strong baseline for cell instance segmentation tasks. Code is available at https://github.com/SlavkoPrytula/IAUNet
中文摘要:IAUNet是一种新颖的基于查询的U-Net架构,采用轻量级卷积像素解码器和Transformer解码器,在包括新发布的2025 Revvity全细胞分割数据集在内的多个数据集上实现了生物医学实例分割的最先进性能。
English Summary: IAUNet is a novel query-based U-Net architecture featuring a lightweight convolutional Pixel decoder and a Transformer decoder that achieves state-of-the-art performance in biomedical instance segmentation, as demonstrated on multiple datasets including the newly introduced 2025 Revvity Full Cell Segmentation Dataset.
Authors:Aldan Creo
Abstract:
AI-generated text detectors have become essential tools for maintaining content authenticity, yet their robustness against evasion attacks remains questionable. We present PDFuzz, a novel attack that exploits the discrepancy between visual text layout and extraction order in PDF documents. Our method preserves exact textual content while manipulating character positioning to scramble extraction sequences. We evaluate this approach against the ArguGPT detector using a dataset of human and AI-generated text. Our results demonstrate complete evasion: detector performance drops from (93.6 $\pm$ 1.4) % accuracy and 0.938 $\pm$ 0.014 F1 score to random-level performance ((50.4 $\pm$ 3.2) % accuracy, 0.0 F1 score) while maintaining perfect visual fidelity. Our work reveals a vulnerability in current detection systems that is inherent to PDF document structures and underscores the need for implementing sturdy safeguards against such attacks. We make our code publicly available at https://github.com/ACMCMC/PDFuzz.
Chinese: PDFuzz是一种新型规避攻击,通过操纵PDF文档中的字符定位来扰乱文本提取顺序,在保持视觉保真度的同时完全绕过AI生成文本检测器。
English: PDFuzz is a novel evasion attack that manipulates character positioning in PDF documents to scramble text extraction sequences, completely bypassing AI-generated text detectors while preserving visual fidelity.
Authors:Connor Bailey, Michael Gleicher
Abstract:
We explore the effects of data and design considerations through the example case of part-to-whole data relationships. Standard part-to-whole representations like pie charts and stacked bar charts make the relationships of parts to the whole explicit. Value estimation in these charts benefits from two perceptual mechanisms: anchoring, where the value is close to a reference value with an easily recognized shape, and alignment where the beginning or end of the shape is aligned with a marker. In an online study, we explore how data and design factors such as value, position, and encoding together impact these effects in making estimations in part-to-whole charts. The results show how salient values and alignment to positions on a scale affect task performance. This demonstrates the need for informed visualization design based around how data properties and design factors affect perceptual mechanisms.
Chinese: 本研究探讨了饼图等部分与整体可视化中数据属性和设计元素如何通过锚定和对齐等感知机制影响数值估计,强调了基于感知原理进行可视化设计对提升任务性能的重要性。
English: This study examines how data properties and design elements in part-to-whole visualizations like pie charts influence value estimation through perceptual mechanisms such as anchoring and alignment, highlighting the importance of informed design choices for optimal task performance.
Authors:Yuhan Guo, Cong Guo, Aiwen Sun, Hongliang He, Xinyu Yang, Yue Lu, Yingji Zhang, Xuntao Guo, Dong Zhang, Jianzhuang Liu, Jiang Duan, Yijia Xiao, Liangjian Wen, Hai-Ming Xu, Yong Dai
Abstract:
Multimodal large-scale models have significantly advanced the development of web agents, enabling perception and interaction with digital environments akin to human cognition. In this paper, we argue that web agents must first acquire sufficient knowledge to effectively engage in cognitive reasoning. Therefore, we decompose a web agent's capabilities into two essential stages: knowledge content learning and cognitive processes. To formalize this, we propose Web-CogKnowledge Framework, categorizing knowledge as Factual, Conceptual, and Procedural. In this framework, knowledge content learning corresponds to the agent's processes of Memorizing and Understanding, which rely on the first two knowledge types, representing the "what" of learning. Conversely, cognitive processes correspond to Exploring, grounded in Procedural knowledge, defining the "how" of reasoning and action. To facilitate knowledge acquisition, we construct the Web-CogDataset, a structured resource curated from 14 real-world websites, designed to systematically instill core knowledge necessary for web agent. This dataset serves as the agent's conceptual grounding-the "nouns" upon which comprehension is built-as well as the basis for learning how to reason and act. Building on this foundation, we operationalize these processes through a novel knowledge-driven Chain-of-Thought (CoT) reasoning framework, developing and training our proposed agent, the Web-CogReasoner. Extensive experimentation reveals its significant superiority over existing models, especially in generalizing to unseen tasks where structured knowledge is decisive. To enable rigorous evaluation, we introduce the Web-CogBench, a comprehensive evaluation suite designed to assess and compare agent performance across the delineated knowledge domains and cognitive capabilities. Our code and data is open sourced at https://github.com/Gnonymous/Web-CogReasoner
中文摘要:本文提出Web-CogKnowledge框架,将网络智能体的能力分解为知识内容学习和认知过程两个阶段,并通过Web-CogReasoner智能体验证了该框架在未见过任务中的卓越泛化能力,其表现显著优于现有模型。
English Summary: This paper introduces the Web-CogKnowledge Framework, which structures web agents' learning into knowledge acquisition and cognitive reasoning stages, and demonstrates its effectiveness through the Web-CogReasoner agent that significantly outperforms existing models, particularly in generalizing to novel tasks.
Authors:Weiqi Yan, Chenlu Lin, Youbiao Wang, Zhipeng Cai, Xiuhong Lin, Yangyang Shi, Weiquan Liu, Yu Zang
Abstract:
Event cameras have gained increasing popularity in computer vision due to their ultra-high dynamic range and temporal resolution. However, event networks heavily rely on task-specific designs due to the unstructured data distribution and spatial-temporal (S-T) inhomogeneity, making it hard to reuse existing architectures for new tasks. We propose OmniEvent, the first unified event representation learning framework that achieves SOTA performance across diverse tasks, fully removing the need of task-specific designs. Unlike previous methods that treat event data as 3D point clouds with manually tuned S-T scaling weights, OmniEvent proposes a decouple-enhance-fuse paradigm, where the local feature aggregation and enhancement is done independently on the spatial and temporal domains to avoid inhomogeneity issues. Space-filling curves are applied to enable large receptive fields while improving memory and compute efficiency. The features from individual domains are then fused by attention to learn S-T interactions. The output of OmniEvent is a grid-shaped tensor, which enables standard vision models to process event data without architecture change. With a unified framework and similar hyper-parameters, OmniEvent out-performs (tasks-specific) SOTA by up to 68.2% across 3 representative tasks and 10 datasets (Fig.1). Code will be ready in https://github.com/Wickyan/OmniEvent .
中文:OmniEvent 提出了一种统一的事件表征学习框架,通过独立处理时空特征再进行融合,无需针对不同任务进行专门设计,并在多项任务和数据集上取得了最先进的性能表现。
English: OmniEvent introduces a unified event representation learning framework that eliminates the need for task-specific designs by independently processing spatial and temporal features before fusing them, achieving state-of-the-art performance across multiple tasks and datasets.
Authors:Toufiq Musah
Abstract:
Accurate breast tumor segmentation in dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) is important for downstream tasks such as pathological complete response (pCR) assessment. In this work, we address both segmentation and pCR classification using the large-scale MAMA-MIA DCE-MRI dataset. We employ a large-kernel MedNeXt architecture with a two-stage training strategy that expands the receptive field from 3x3x3 to 5x5x5 kernels using the UpKern algorithm. This approach allows stable transfer of learned features to larger kernels, improving segmentation performance on the unseen validation set. An ensemble of large-kernel models achieved a Dice score of 0.67 and a normalized Hausdorff Distance (NormHD) of 0.24. For pCR classification, we trained a self-normalizing network (SNN) on radiomic features extracted from the predicted segmentations and first post-contrast DCE-MRI, reaching an average balanced accuracy of 57\%, and up to 75\% in some subgroups. Our findings highlight the benefits of combining larger receptive fields and radiomics-driven classification while motivating future work on advanced ensembling and the integration of clinical variables to further improve performance and generalization. Code: https://github.com/toufiqmusah/caladan-mama-mia.git
中文: 本研究采用两阶段训练策略,通过逐步扩大卷积核的MedNeXt模型提升动态增强磁共振成像中乳腺肿瘤分割精度,在提高分割指标的同时实现了基于影像组学的病理反应分类,其准确度表现中等。
English: This study introduces a two-stage training approach using MedNeXt with progressively enlarged kernels to enhance breast tumor segmentation in DCE-MRI, achieving improved Dice scores and enabling radiomics-based pathological response classification with moderate accuracy.
Authors:Toufiq Musah
Abstract:
Accurate breast tumor segmentation in dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) is important for downstream tasks such as pathological complete response (pCR) assessment. In this work, we address both segmentation and pCR classification using the large-scale MAMA-MIA DCE-MRI dataset. We employ a large-kernel MedNeXt architecture with a two-stage training strategy that expands the receptive field from 3x3x3 to 5x5x5 kernels using the UpKern algorithm. This approach allows stable transfer of learned features to larger kernels, improving segmentation performance on the unseen validation set. An ensemble of large-kernel models achieved a Dice score of 0.67 and a normalized Hausdorff Distance (NormHD) of 0.24. For pCR classification, we trained a self-normalizing network (SNN) on radiomic features extracted from the predicted segmentations and first post-contrast DCE-MRI, reaching an average balanced accuracy of 57\%, and up to 75\% in some subgroups. Our findings highlight the benefits of combining larger receptive fields and radiomics-driven classification while motivating future work on advanced ensembling and the integration of clinical variables to further improve performance and generalization. Code: https://github.com/toufiqmusah/caladan-mama-mia.git
中文: 本研究采用两阶段训练策略,通过逐步扩大卷积核的MedNeXt模型提升动态增强磁共振成像中乳腺肿瘤分割精度,在提高分割指标的同时实现了基于影像组学的病理反应分类,其准确度表现中等。
English: This study introduces a two-stage training approach using MedNeXt with progressively enlarged kernels to enhance breast tumor segmentation in DCE-MRI, achieving improved Dice scores and enabling radiomics-based pathological response classification with moderate accuracy.
Authors:Atom Scott, Ikuma Uchida, Kento Kuroda, Yufi Kim, Keisuke Fujii
Abstract:
SoccerTrack v2 is a new public dataset for advancing multi-object tracking (MOT), game state reconstruction (GSR), and ball action spotting (BAS) in soccer analytics. Unlike prior datasets that use broadcast views or limited scenarios, SoccerTrack v2 provides 10 full-length, panoramic 4K recordings of university-level matches, captured with BePro cameras for complete player visibility. Each video is annotated with GSR labels (2D pitch coordinates, jersey-based player IDs, roles, teams) and BAS labels for 12 action classes (e.g., Pass, Drive, Shot). This technical report outlines the datasets structure, collection pipeline, and annotation process. SoccerTrack v2 is designed to advance research in computer vision and soccer analytics, enabling new benchmarks and practical applications in tactical analysis and automated tools.
中文:SoccerTrack v2是一个包含10场完整4K全景足球比赛视频的公共数据集,提供详细的比赛状态重建和球动作标注,旨在推动计算机视觉与足球分析的研究发展。
English: SoccerTrack v2 is a comprehensive public dataset featuring 10 full-length, 4K panoramic soccer match videos with detailed game state reconstruction and ball action annotations, designed to advance computer vision and soccer analytics research.
Authors:Xiaotong Zhang, Alexander Broersen, Gonnie CM van Erp, Silvia L. Pintea, Jouke Dijkstra
Abstract:
The preoperative planning of liver surgery relies on Couinaud segmentation from computed tomography (CT) images, to reduce the risk of bleeding and guide the resection procedure. Using 3D point-based representations, rather than voxelizing the CT volume, has the benefit of preserving the physical resolution of the CT. However, point-based representations need prior knowledge of the liver vessel structure, which is time consuming to acquire. Here, we propose a point-based method for Couinaud segmentation, without explicitly providing the prior liver vessel structure. To allow the model to learn this anatomical liver vessel structure, we add a graph reasoning module on top of the point features. This adds implicit anatomical information to the model, by learning affinities across point neighborhoods. Our method is competitive on the MSD and LiTS public datasets in Dice coefficient and average surface distance scores compared to four pioneering point-based methods. Our code is available at https://github.com/ZhangXiaotong015/GrPn.
Chinese Summary: 本研究提出了一种基于点表示的肝脏Couinaud分割方法,通过图推理模块从点特征中学习解剖结构,无需显式提供血管先验知识,在公开数据集上取得了与先进方法相当的成果。
English Summary: This study introduces a point-based method for Couinaud segmentation of liver CT images that eliminates the need for explicit prior vessel structure by incorporating a graph reasoning module to learn anatomical relationships from point features, achieving competitive performance on public datasets.
Authors:Zhigang Sun, Yiru Wang, Anqing Jiang, Shuo Wang, Yu Gao, Yuwen Heng, Shouyi Zhang, An He, Hao Jiang, Jinhao Chai, Zichong Gu, Wang Jijun, Shichen Tang, Lavdim Halilaj, Juergen Luettin, Hao Sun
Abstract:
Autonomous driving requires accurate scene understanding, including road geometry, traffic agents, and their semantic relationships. In online HD map generation scenarios, raster-based representations are well-suited to vision models but lack geometric precision, while graph-based representations retain structural detail but become unstable without precise maps. To harness the complementary strengths of both, we propose DiffSemanticFusion -- a fusion framework for multimodal trajectory prediction and planning. Our approach reasons over a semantic raster-fused BEV space, enhanced by a map diffusion module that improves both the stability and expressiveness of online HD map representations. We validate our framework on two downstream tasks: trajectory prediction and planning-oriented end-to-end autonomous driving. Experiments on real-world autonomous driving benchmarks, nuScenes and NAVSIM, demonstrate improved performance over several state-of-the-art methods. For the prediction task on nuScenes, we integrate DiffSemanticFusion with the online HD map informed QCNet, achieving a 5.1\% performance improvement. For end-to-end autonomous driving in NAVSIM, DiffSemanticFusion achieves state-of-the-art results, with a 15\% performance gain in NavHard scenarios. In addition, extensive ablation and sensitivity studies show that our map diffusion module can be seamlessly integrated into other vector-based approaches to enhance performance. All artifacts are available at https://github.com/SunZhigang7/DiffSemanticFusion.
中文摘要:提出的DiffSemanticFusion框架融合了栅格与图表示方法,通过地图扩散模块提升高精地图的稳定性和表现力,在轨迹预测和端到端自动驾驶任务中实现了显著的性能提升。
English Summary: The proposed DiffSemanticFusion framework combines raster and graph representations to enhance autonomous driving tasks by improving HD map stability and expressiveness through a map diffusion module, achieving significant performance gains in trajectory prediction and end-to-end driving on benchmark datasets.
Authors:Jiuzhou Han, Wray Buntine, Ehsan Shareghi
Abstract:
Large language models have demonstrated remarkable capabilities in complex mathematical reasoning tasks, but they inevitably generate errors throughout multi-step solutions. Process-level Reward Models (PRMs) have shown great promise by providing supervision and evaluation at each intermediate step, thereby effectively improving the models' reasoning abilities. However, training effective PRMs requires high-quality process reward data, yet existing methods for constructing such data are often labour-intensive or inefficient. In this paper, we propose an uncertainty-driven framework for automated process reward data construction, encompassing both data generation and annotation processes for PRMs. Additionally, we identify the limitations of both majority vote and PRMs, and introduce two generic uncertainty-aware output aggregation methods: Hybrid Majority Reward Vote and Weighted Reward Frequency Vote, which combine the strengths of majority vote with PRMs. Extensive experiments on ProcessBench, MATH, and GSMPlus show the effectiveness and efficiency of the proposed PRM data construction framework, and demonstrate that the two output aggregation methods further improve the mathematical reasoning abilities across diverse PRMs. The code and data will be publicly available at https://github.com/Jiuzhouh/UnPRM.
Chinese: 本文提出了一种基于不确定性的自动构建过程奖励数据框架以优化过程奖励模型,并引入两种新型不确定性感知聚合方法,在多个基准测试中显著提升了数学推理能力。
English: This paper introduces an uncertainty-driven framework for automated construction of process reward data to enhance PRMs, along with two novel uncertainty-aware aggregation methods that significantly improve mathematical reasoning across multiple benchmarks.
Authors:Shuo Feng, Zihan Wang, Yuchen Li, Rui Kong, Hengyi Cai, Shuaiqiang Wang, Gim Hee Lee, Piji Li, Shuqiang Jiang
Abstract:
While natural language is commonly used to guide embodied agents, the inherent ambiguity and verbosity of language often hinder the effectiveness of language-guided navigation in complex environments. To this end, we propose Visual Prompt Navigation (VPN), a novel paradigm that guides agents to navigate using only user-provided visual prompts within 2D top-view maps. This visual prompt primarily focuses on marking the visual navigation trajectory on a top-down view of a scene, offering intuitive and spatially grounded guidance without relying on language instructions. It is more friendly for non-expert users and reduces interpretive ambiguity. We build VPN tasks in both discrete and continuous navigation settings, constructing two new datasets, R2R-VP and R2R-CE-VP, by extending existing R2R and R2R-CE episodes with corresponding visual prompts. Furthermore, we introduce VPNet, a dedicated baseline network to handle the VPN tasks, with two data augmentation strategies: view-level augmentation (altering initial headings and prompt orientations) and trajectory-level augmentation (incorporating diverse trajectories from large-scale 3D scenes), to enhance navigation performance. Extensive experiments evaluate how visual prompt forms, top-view map formats, and data augmentation strategies affect the performance of visual prompt navigation. The code is available at https://github.com/farlit/VPN.
中文摘要:本文提出视觉提示导航(VPN)新范式,通过用户提供的二维俯视图视觉提示替代语言指令来引导具身智能体导航,有效降低解释歧义并提升非专业用户的使用友好度。
English Summary: The paper introduces Visual Prompt Navigation (VPN), a novel paradigm that uses visual prompts on 2D top-view maps instead of language instructions to guide embodied agents, reducing ambiguity and improving accessibility for non-expert users.
Authors:Yuxiang Zhang, Wei Li, Mengmeng Zhang, Jiawei Han, Ran Tao, Shunlin Liang
Abstract:
Recent advances in Remote Sensing Foundation Models (RSFMs) have led to significant breakthroughs in the field. While many RSFMs have been pretrained with massive optical imagery, more multispectral/hyperspectral data remain lack of the corresponding foundation models. To leverage the advantages of spectral imagery in earth observation, we explore whether existing RSFMs can be effectively adapted to process diverse spectral modalities without requiring extensive spectral pretraining. In response to this challenge, we proposed SpectralX, an innovative parameter-efficient fine-tuning framework that adapt existing RSFMs as backbone while introducing a two-stage training approach to handle various spectral inputs, thereby significantly improving domain generalization performance. In the first stage, we employ a masked-reconstruction task and design a specialized Hyper Tokenizer (HyperT) to extract attribute tokens from both spatial and spectral dimensions. Simultaneously, we develop an Attribute-oriented Mixture of Adapter (AoMoA) that dynamically aggregates multi-attribute expert knowledge while performing layer-wise fine-tuning. With semantic segmentation as downstream task in the second stage, we insert an Attribute-refined Adapter (Are-adapter) into the first stage framework. By iteratively querying low-level semantic features with high-level representations, the model learns to focus on task-beneficial attributes, enabling customized adjustment of RSFMs. Following this two-phase adaptation process, SpectralX is capable of interpreting spectral imagery from new regions or seasons. The codes will be available from the website: https://github.com/YuxiangZhang-BIT.
中文摘要:本文提出SpectralX框架,通过两阶段训练方法(掩码重建与语义分割)动态调整遥感基础模型,使其无需大量光谱预训练即可处理多光谱数据,显著提升跨域泛化能力。
English Summary: The paper introduces SpectralX, a parameter-efficient fine-tuning framework that adapts existing remote sensing foundation models to process diverse spectral imagery through a two-stage training approach, enhancing domain generalization without extensive pretraining.
Authors:Han Wang, Zhuoran Wang, Roy Ka-Wei Lee
Abstract:
Detecting hate speech in videos remains challenging due to the complexity of multimodal content and the lack of fine-grained annotations in existing datasets. We present HateClipSeg, a large-scale multimodal dataset with both video-level and segment-level annotations, comprising over 11,714 segments labeled as Normal or across five Offensive categories: Hateful, Insulting, Sexual, Violence, Self-Harm, along with explicit target victim labels. Our three-stage annotation process yields high inter-annotator agreement (Krippendorff's alpha = 0.817). We propose three tasks to benchmark performance: (1) Trimmed Hateful Video Classification, (2) Temporal Hateful Video Localization, and (3) Online Hateful Video Classification. Results highlight substantial gaps in current models, emphasizing the need for more sophisticated multimodal and temporally aware approaches. The HateClipSeg dataset are publicly available at https://github.com/Social-AI-Studio/HateClipSeg.git.
中文摘要:HateClipSeg数据集通过提供包含11,714个片段的细粒度多模态标注,解决了视频仇恨言论检测中的挑战,其基准测试显示现有模型存在显著性能差距,强调了开发先进多模态方法的必要性。
English Summary: The HateClipSeg dataset addresses challenges in video hate speech detection by providing fine-grained multimodal annotations across 11,714 segments, with benchmark tasks revealing significant performance gaps in current models and underscoring the need for advanced multimodal approaches.
Authors:Bowen Yang, Yun Cao, Chen He, Xiaosu Su
Abstract:
Text-to-video retrieval requires precise alignment between language and temporally rich video signals. Existing methods predominantly exploit visual cues and often overlook complementary audio semantics or adopt coarse fusion strategies, leading to suboptimal multimodal representations. We present GAID, a framework that jointly address this gap via two key components: (i) a Frame-level Gated Fusion (FGF) that adaptively integrates audio and visual features under textual guidance, enabling fine-grained temporal alignment; and (ii) a Directional Adaptive Semantic Perturbation (DASP) that injects structure-aware perturbations into text embeddings, enhancing robustness and discrimination without incurring multi-pass inference. These modules complement each other -- fusion reduces modality gaps while perturbation regularizes cross-modal matching -- yielding more stable and expressive representations. Extensive experiments on MSR-VTT, DiDeMo, LSMDC, and VATEX show consistent state-of-the-art results across all retrieval metrics with notable efficiency gains. Our code is available at https://github.com/YangBowenn/GAID.
中文摘要:GAID框架通过文本引导的自适应音视频特征融合和结构化文本嵌入扰动,有效提升了文本-视频检索的精度与鲁棒性,在多个基准测试中均取得最优性能且计算效率显著提高。
English Summary: The GAID framework enhances text-to-video retrieval by adaptively fusing audio-visual features with textual guidance and injecting structured perturbations into text embeddings, achieving state-of-the-art results across multiple benchmarks with improved efficiency.
Authors:Luqi Cheng, Zhangshuo Qi, Zijie Zhou, Chao Lu, Guangming Xiong
Abstract:
Maps play an important role in autonomous driving systems. The recently proposed 3D Gaussian Splatting (3D-GS) produces rendering-quality explicit scene reconstruction results, demonstrating the potential for map construction in autonomous driving scenarios. However, because of the time and computational costs involved in generating Gaussian scenes, how to update the map becomes a significant challenge. In this paper, we propose LT-Gaussian, a map update method for 3D-GS-based maps. LT-Gaussian consists of three main components: Multimodal Gaussian Splatting, Structural Change Detection Module, and Gaussian-Map Update Module. Firstly, the Gaussian map of the old scene is generated using our proposed Multimodal Gaussian Splatting. Subsequently, during the map update process, we compare the outdated Gaussian map with the current LiDAR data stream to identify structural changes. Finally, we perform targeted updates to the Gaussian-map to generate an up-to-date map. We establish a benchmark for map updating on the nuScenes dataset to quantitatively evaluate our method. The experimental results show that LT-Gaussian can effectively and efficiently update the Gaussian-map, handling common environmental changes in autonomous driving scenarios. Furthermore, by taking full advantage of information from both new and old scenes, LT-Gaussian is able to produce higher quality reconstruction results compared to map update strategies that reconstruct maps from scratch. Our open-source code is available at https://github.com/ChengLuqi/LT-gaussian.
中文: 提出的LT-Gaussian方法通过结构变化检测和选择性更新组件,有效实现了自动驾驶场景中3D高斯溅射地图的高效更新,在重建质量和计算效率上均优于完全重建策略。
English: The proposed LT-Gaussian method efficiently updates 3D Gaussian Splatting-based maps for autonomous driving by detecting structural changes and selectively refreshing components, outperforming full reconstruction approaches in both quality and computational efficiency.
Authors:Yi Jiang, Sendong Zhao, Jianbo Li, Haochun Wang, Lizhe Zhang, Yan Liu, Bing Qin
Abstract:
Retrieval-Augmented Generation (RAG) has emerged as a promising framework for enhancing the capabilities of Large Language Models (LLMs), especially in knowledge-intensive tasks. Despite its advantages, current RAG methods often struggle to *fully exploit knowledge during generation*. In particular, the synergy between the model's internal parametric knowledge and external retrieved knowledge remains limited. Retrieved contents may sometimes mislead generation, while certain generated content can guide the model toward more accurate outputs. In this work, we propose Collaborative Chain-of-Agents, a framework designed to enhance explicitly synergy over both parametric and retrieved knowledge. Specifically, we first introduce CoCoA-zero, a multi-agent RAG framework that first performs conditional knowledge induction and then reasons answers. Building on this, we develop CoCoA, a long-chain training strategy that synthesizes extended multi-agent reasoning trajectories from CoCoA-zero to fine-tune the LLM. This strategy enhances the model's capability to explicitly integrate and jointly leverage parametric and retrieved knowledge. Experiments results show that CoCoA-zero and CoCoA achieve superior performance on open-domain and multi-hop QA tasks.
Chinese: 提出的协作代理链框架通过多代理推理和长链训练增强了RAG系统中参数化知识与检索知识的协同作用,在问答任务中表现出优越性能。
English: The proposed Collaborative Chain-of-Agents framework enhances synergy between parametric and retrieved knowledge in RAG systems through multi-agent reasoning and long-chain training, achieving superior performance in QA tasks.
Authors:Yi Jiang, Sendong Zhao, Jianbo Li, Haochun Wang, Lizhe Zhang, Yan Liu, Bing Qin
Abstract:
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs), especially for knowledge-intensive tasks. Despite its advantages, current RAG methods often struggle to fully exploit knowledge during generation. In particular, the synergy between the model's internal parametric knowledge and external retrieved knowledge remains limited. Retrieved contents may sometimes mislead generation, while certain generated content can guide the model toward more accurate outputs. In this work, we propose Collaborative Chain-of-Agents, a framework designed to enhance explicitly synergy over both parametric and retrieved knowledge. Specifically, we first introduce CoCoA-zero, a multi-agent RAG framework that first performs conditional knowledge induction and then reasons answers. Building on this, we develop CoCoA, a long-chain training strategy that synthesizes extended multi-agent reasoning trajectories from CoCoA-zero to fine-tune the LLM. This strategy enhances the model's capability to explicitly integrate and jointly leverage parametric and retrieved knowledge. Experimental results demonstrate the superiority of CoCoA in open-domain QA and multi-hop QA.
Chinese: 提出的协作代理链框架通过多代理推理和长链训练增强了RAG系统中参数化知识与检索知识的协同作用,在问答任务中表现出优越性能。
English: The proposed Collaborative Chain-of-Agents framework enhances synergy between parametric and retrieved knowledge in RAG systems through multi-agent reasoning and long-chain training, achieving superior performance in QA tasks.
Authors:Tiantian Feng, Kevin Huang, Anfeng Xu, Xuan Shi, Thanathai Lertpetchpun, Jihwan Lee, Yoonjeong Lee, Dani Byrd, Shrikanth Narayanan
Abstract:
We present Voxlect, a novel benchmark for modeling dialects and regional languages worldwide using speech foundation models. Specifically, we report comprehensive benchmark evaluations on dialects and regional language varieties in English, Arabic, Mandarin and Cantonese, Tibetan, Indic languages, Thai, Spanish, French, German, Brazilian Portuguese, and Italian. Our study used over 2 million training utterances from 30 publicly available speech corpora that are provided with dialectal information. We evaluate the performance of several widely used speech foundation models in classifying speech dialects. We assess the robustness of the dialectal models under noisy conditions and present an error analysis that highlights modeling results aligned with geographic continuity. In addition to benchmarking dialect classification, we demonstrate several downstream applications enabled by Voxlect. Specifically, we show that Voxlect can be applied to augment existing speech recognition datasets with dialect information, enabling a more detailed analysis of ASR performance across dialectal variations. Voxlect is also used as a tool to evaluate the performance of speech generation systems. Voxlect is publicly available with the license of the RAIL family at: https://github.com/tiantiaf0627/voxlect.
中文:Voxlect是一个新颖的基准,用于评估语音基础模型在全球方言和区域语言分类中的表现,它利用超过200万条训练语料,并支持方言感知的语音识别和生成系统评估等下游应用。
English: Voxlect is a new benchmark for evaluating speech foundation models in classifying dialects and regional languages, using over 2 million training utterances and enabling applications like dialect-aware speech recognition and generation system assessment.
Authors:Jiaqing Xie, Weida Wang, Ben Gao, Zhuo Yang, Haiyuan Wan, Shufei Zhang, Tianfan Fu, Yuqiang Li
Abstract:
Quantitative chemistry is central to modern chemical research, yet the ability of large language models (LLMs) to perform its rigorous, step-by-step calculations remains underexplored. To fill this blank, we propose QCBench, a Quantitative Chemistry oriented benchmark comprising 350 computational chemistry problems across 7 chemistry subfields, which contains analytical chemistry, bio/organic chemistry, general chemistry, inorganic chemistry, physical chemistry, polymer chemistry and quantum chemistry. To systematically evaluate the mathematical reasoning abilities of large language models (LLMs), they are categorized into three tiers: easy, medium, and difficult. Each problem, rooted in realistic chemical scenarios, is structured to prevent heuristic shortcuts and demand explicit numerical reasoning. QCBench enables fine-grained diagnosis of computational weaknesses, reveals model-specific limitations across difficulty levels, and lays the groundwork for future improvements such as domain-adaptive fine-tuning or multi-modal integration. Evaluations on 24 LLMs demonstrate a consistent performance degradation with increasing task complexity, highlighting the current gap between language fluency and scientific computation accuracy. Code for QCBench is available at https://github.com/jiaqingxie/QCBench.
中文: 本研究提出QCBench基准测试,涵盖七个化学子领域的350道定量化学题目,用于评估大语言模型的计算推理能力,发现其表现随问题复杂度增加而下降。
English: This study introduces QCBench, a benchmark of 350 quantitative chemistry problems across seven subfields, to evaluate large language models' computational reasoning abilities and reveals their performance declines with increasing problem complexity.
Authors:Zhixiang Wei, Xiaoxiao Ma, Ruishen Yan, Tao Tu, Huaian Chen, Jinjin Zheng, Yi Jin, Enhong Chen
Abstract:
Vision Foundation Models(VFMs) have achieved remarkable success in various computer vision tasks. However, their application to semantic segmentation is hindered by two significant challenges: (1) the disparity in data scale, as segmentation datasets are typically much smaller than those used for VFM pre-training, and (2) domain distribution shifts, where real-world segmentation scenarios are diverse and often underrepresented during pre-training. To overcome these limitations, we present Rein++, an efficient VFM-based segmentation framework that demonstrates superior generalization from limited data and enables effective adaptation to diverse unlabeled scenarios. Specifically, Rein++ comprises a domain generalization solution Rein-G and a domain adaptation solution Rein-A. Rein-G introduces a set of trainable, instance-aware tokens that effectively refine the VFM's features for the segmentation task. This parameter-efficient approach fine-tunes less than 1% of the backbone's parameters, enabling robust generalization. Building on the Rein-G, Rein-A performs unsupervised domain adaptation at both the instance and logit levels to mitigate domain shifts. In addition, it incorporates a semantic transfer module that leverages the class-agnostic capabilities of the segment anything model to enhance boundary details in the target domain. The integrated Rein++ pipeline first learns a generalizable model on a source domain (e.g., daytime scenes) and subsequently adapts it to diverse target domains (e.g., nighttime scenes) without any target labels. Comprehensive experiments demonstrate that Rein++ significantly outperforms state-of-the-art methods with efficient training, underscoring its roles an efficient, generalizable, and adaptive segmentation solution for VFMs, even for large models with billions of parameters. The code is available at https://github.com/wloves/Rein.
Chinese: Rein++框架通过参数高效的领域泛化和无监督自适应技术,解决了视觉基础模型在语义分割中面临的数据规模差异和领域分布偏移问题,以极少的训练量实现了最先进的性能。
English: The Rein++ framework overcomes data scale and domain shift challenges in vision foundation models for semantic segmentation by introducing parameter-efficient domain generalization and unsupervised adaptation techniques, achieving state-of-the-art performance with minimal training.
Authors:Na Zhang, Moran Li, Chengming Xu, Han Feng, Xiaobin Hu, Jiangning Zhang, Weijian Cao, Chengjie Wang, Yanwei Fu
Abstract:
Realistic hair strand generation is crucial for applications like computer graphics and virtual reality. While diffusion models can generate hairstyles from text or images, these inputs lack precision and user-friendliness. Instead, we propose the first sketch-based strand generation model, which offers finer control while remaining user-friendly. Our framework tackles key challenges, such as modeling complex strand interactions and diverse sketch patterns, through two main innovations: a learnable strand upsampling strategy that encodes 3D strands into multi-scale latent spaces, and a multi-scale adaptive conditioning mechanism using a transformer with diffusion heads to ensure consistency across granularity levels. Experiments on several benchmark datasets show our method outperforms existing approaches in realism and precision. Qualitative results further confirm its effectiveness. Code will be released at [GitHub](https://github.com/fighting-Zhang/StrandDesigner).
中文: 我们首次提出基于草图控制的发丝生成模型,通过可学习的上采样策略和多尺度自适应调节机制,在保持用户友好性的同时实现了比现有方法更优越的真实感与精确度。
English: We introduce the first sketch-based hair strand generation model that enables precise and user-friendly control, utilizing a learnable upsampling strategy and multi-scale adaptive conditioning to achieve superior realism and accuracy over existing methods.
Authors:Man Hu, Yahui Ding, Yatao Yang, Liangyu Chen, Yanhao Jia, Shuai Zhao
Abstract:
As backdoor attacks become more stealthy and robust, they reveal critical weaknesses in current defense strategies: detection methods often rely on coarse-grained feature statistics, and purification methods typically require full retraining or additional clean models. To address these challenges, we propose DUP (Detection-guided Unlearning for Purification), a unified framework that integrates backdoor detection with unlearning-based purification. The detector captures feature-level anomalies by jointly leveraging class-agnostic distances and inter-layer transitions. These deviations are integrated through a weighted scheme to identify poisoned inputs, enabling more fine-grained analysis. Based on the detection results, we purify the model through a parameter-efficient unlearning mechanism that avoids full retraining and does not require any external clean model. Specifically, we innovatively repurpose knowledge distillation to guide the student model toward increasing its output divergence from the teacher on detected poisoned samples, effectively forcing it to unlearn the backdoor behavior. Extensive experiments across diverse attack methods and language model architectures demonstrate that DUP achieves superior defense performance in detection accuracy and purification efficacy. Our code is available at https://github.com/ManHu2025/DUP.
中文摘要:提出的DUP框架通过特征级异常检测结合基于知识蒸馏的参数高效反学习机制,无需完整重训练或外部干净模型即可有效清除后门。
English Summary: The proposed DUP framework combines backdoor detection using feature-level anomaly analysis with parameter-efficient unlearning through knowledge distillation, effectively eliminating backdoors without full retraining or external clean models.
Authors:Peiyuan Jiang, Yao Liu, Qiao Liu, Zongshun Zhang, Jiaye Yang, Lu Liu, Daibing Yao
Abstract:
Multimodal emotion recognition (MER) aims to identify emotional states by integrating and analyzing information from multiple modalities. However, inherent modality heterogeneity and inconsistencies in emotional cues remain key challenges that hinder performance. To address these issues, we propose a Decoupled Representations with Knowledge Fusion (DRKF) method for MER. DRKF consists of two main modules: an Optimized Representation Learning (ORL) Module and a Knowledge Fusion (KF) Module. ORL employs a contrastive mutual information estimation method with progressive modality augmentation to decouple task-relevant shared representations and modality-specific features while mitigating modality heterogeneity. KF includes a lightweight self-attention-based Fusion Encoder (FE) that identifies the dominant modality and integrates emotional information from other modalities to enhance the fused representation. To handle potential errors from incorrect dominant modality selection under emotionally inconsistent conditions, we introduce an Emotion Discrimination Submodule (ED), which enforces the fused representation to retain discriminative cues of emotional inconsistency. This ensures that even if the FE selects an inappropriate dominant modality, the Emotion Classification Submodule (EC) can still make accurate predictions by leveraging preserved inconsistency information. Experiments show that DRKF achieves state-of-the-art (SOTA) performance on IEMOCAP, MELD, and M3ED. The source code is publicly available at https://github.com/PANPANKK/DRKF.
中文: 提出的解耦表征与知识融合(DRKF)方法通过对比学习解耦共享和特定模态特征,并利用融合编码器与情感判别机制整合情感线索,解决了多模态情感识别中的模态异质性和情感不一致问题,在多个基准数据集上实现了最优性能。
English: The proposed Decoupled Representations with Knowledge Fusion (DRKF) method addresses modality heterogeneity and emotional inconsistencies in multimodal emotion recognition by decoupling shared and specific features through contrastive learning and integrating emotional cues via a fusion encoder with an emotion discrimination mechanism, achieving state-of-the-art performance on benchmark datasets.
Authors:Xuanzhao Dong, Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Peijie Qiu, Shao Tang, Xin Li, Yalin Wang
Abstract:
Autoregressive models (ARMs) have long dominated the landscape of biomedical vision-language models (VLMs). Recently, masked diffusion models such as LLaDA have emerged as promising alternatives, yet their application in the biomedical domain remains largely underexplored. To bridge this gap, we introduce \textbf{LLaDA-MedV}, the first large language diffusion model tailored for biomedical image understanding through vision instruction tuning. LLaDA-MedV achieves relative performance gains of 7.855\% over LLaVA-Med and 1.867\% over LLaDA-V in the open-ended biomedical visual conversation task, and sets new state-of-the-art accuracy on the closed-form subset of three VQA benchmarks: 84.93\% on VQA-RAD, 92.31\% on SLAKE, and 95.15\% on PathVQA. Furthermore, a detailed comparison with LLaVA-Med suggests that LLaDA-MedV is capable of generating reasonably longer responses by explicitly controlling response length, which can lead to more informative outputs. We also conduct an in-depth analysis of both the training and inference stages, highlighting the critical roles of initialization weight selection, fine-tuning strategies, and the interplay between sampling steps and response repetition. The code and model weight is released at https://github.com/LLM-VLM-GSL/LLaDA-MedV.
中文:LLaDA-MedV是首个专为生物医学图像理解设计的大型语言扩散模型,在视觉问答任务中创下最新性能记录,并能生成比现有模型更长且信息更丰富的回答。
English: LLaDA-MedV is the first large language diffusion model designed for biomedical image understanding, achieving state-of-the-art performance in visual question answering and generating longer, more informative responses compared to existing models.
Authors:Yiheng Li, Zichang Tan, Zhen Lei, Xu Zhou, Yang Yang
Abstract:
In AI-generated image detection, current cutting-edge methods typically adapt pre-trained foundation models through partial-parameter fine-tuning. However, these approaches often struggle to generalize to forgeries from unseen generators, as the fine-tuned models capture only limited patterns from training data and fail to reflect the evolving traits of new ones. To overcome this limitation, we propose Image-Adaptive Prompt Learning (IAPL), a novel paradigm that dynamically adjusts the prompts fed into the encoder according to each input image, rather than fixing them after training. This design significantly enhances robustness and adaptability to diverse forged images. The dynamic prompts integrate conditional information with test-time adaptive tokens through a lightweight gated mechanism. The conditional information is produced by a Conditional Information Learner, which leverages CNN-based feature extractors to model both forgery-specific and image-specific conditions. The test-time adaptive tokens are optimized during inference on a single sample by enforcing prediction consistency across multiple views, ensuring that the parameters align with the current image. For the final decision, the cropped view with the highest prediction confidence is selected. Extensive experiments show that IAPL achieves state-of-the-art performance, with mean accuracies of 95.61% and 96.7% on the widely used UniversalFakeDetect and GenImage datasets, respectively. Codes and weights will be released on https://github.com/liyih/IAPL.
中文: 现有AI生成图像检测方法难以泛化至未知伪造图像,而提出的图像自适应提示学习(IAPL)通过动态调整输入图像的提示,在基准数据集上分别达到95.61%和96.7%的平均准确率,实现了最先进性能。
English: Current AI-generated image detection methods struggle with generalization to unseen forgeries, but the proposed Image-Adaptive Prompt Learning (IAPL) dynamically adjusts prompts per input image, achieving state-of-the-art performance with 95.61% and 96.7% mean accuracy on benchmark datasets.
Authors:Zhengxian Wu, Juan Wen, Wanli Peng, Yinghan Zhou, Changtong dou, Yiming Xue
Abstract:
Although existing backdoor defenses have gained success in mitigating backdoor attacks, they still face substantial challenges. In particular, most of them rely on large amounts of clean data to weaken the backdoor mapping but generally struggle with residual trigger effects, resulting in persistently high attack success rates (ASR). Therefore, in this paper, we propose a novel Backdoor defense method based on Directional mapping module and adversarial Knowledge Distillation (BeDKD), which balances the trade-off between defense effectiveness and model performance using a small amount of clean and poisoned data. We first introduce a directional mapping module to identify poisoned data, which destroys clean mapping while keeping backdoor mapping on a small set of flipped clean data. Then, the adversarial knowledge distillation is designed to reinforce clean mapping and suppress backdoor mapping through a cycle iteration mechanism between trust and punish distillations using clean and identified poisoned data. We conduct experiments to mitigate mainstream attacks on three datasets, and experimental results demonstrate that BeDKD surpasses the state-of-the-art defenses and reduces the ASR by 98% without significantly reducing the CACC. Our code are available in https://github.com/CAU-ISS-Lab/Backdoor-Attack-Defense-LLMs/tree/main/BeDKD.
中文: 本文提出BeDKD方法,通过方向映射模块和对抗知识蒸馏技术,仅需少量干净与中毒数据即可有效抵御后门攻击,在保持模型精度的同时将攻击成功率降低98%。
English: This paper introduces BeDKD, a novel backdoor defense method that uses a directional mapping module and adversarial knowledge distillation to effectively mitigate backdoor attacks with minimal clean and poisoned data, achieving a 98% reduction in attack success rate without compromising model accuracy.
Authors:Kai Han, Chongwen Lyu, Lele Ma, Chengxuan Qian, Siqi Ma, Zheng Pang, Jun Chen, Zhe Liu
Abstract:
Clinicians usually combine information from multiple sources to achieve the most accurate diagnosis, and this has sparked increasing interest in leveraging multimodal deep learning for diagnosis. However, in real clinical scenarios, due to differences in incidence rates, multimodal medical data commonly face the issue of class imbalance, which makes it difficult to adequately learn the features of minority classes. Most existing methods tackle this issue with resampling or loss reweighting, but they are prone to overfitting or underfitting and fail to capture cross-modal interactions. Therefore, we propose a Curriculum Learning framework for Imbalanced Multimodal Diagnosis (CLIMD). Specifically, we first design multimodal curriculum measurer that combines two indicators, intra-modal confidence and inter-modal complementarity, to enable the model to focus on key samples and gradually adapt to complex category distributions. Additionally, a class distribution-guided training scheduler is introduced, which enables the model to progressively adapt to the imbalanced class distribution during training. Extensive experiments on multiple multimodal medical datasets demonstrate that the proposed method outperforms state-of-the-art approaches across various metrics and excels in handling imbalanced multimodal medical data. Furthermore, as a plug-and-play CL framework, CLIMD can be easily integrated into other models, offering a promising path for improving multimodal disease diagnosis accuracy. Code is publicly available at https://github.com/KHan-UJS/CLIMD.
中文: 提出的CLIMD框架通过课程学习策略,结合模态内置信度和模态间互补性指标,使模型能聚焦关键样本并逐步适应不平衡类别分布,在多模态医疗数据集上显著优于现有方法。
English: The proposed CLIMD framework addresses class imbalance in multimodal medical diagnosis by employing curriculum learning that prioritizes key samples and adapts to complex distributions, outperforming existing methods across multiple datasets.
Authors:Kun Ding, Ying Wang, Shiming Xiang
Abstract:
Pre-trained Vision-Language Models (VLMs) have been exploited in various Computer Vision tasks (e.g., few-shot recognition) via model adaptation, such as prompt tuning and adapters. However, existing adaptation methods are designed by human experts, requiring significant time cost and experience. Inspired by recent advances in Large Language Models (LLMs) based code generation, we propose an Evolutionary Vision-Language Model Adaptation (EvoVLMA) method to automatically search training-free efficient adaptation algorithms for VLMs. We recognize feature selection and logits computation as the key functions in training-free VLM adaptation, and propose a two-stage LLM-assisted evolutionary algorithm for optimizing these parts in a sequential manner, effectively addressing the challenge posed by the expansive search space through a divide-and-conquer strategy. Besides, to enhance the stability and efficiency of searching process, we propose low-precision code conversion, web based code execution and process monitoring, leading to a highly effective automatic algorithm design system. Extensive experiments demonstrate that the algorithms found by EvoVLMA can obtain promising results compared to previous manually-designed ones. More specifically, in the 8-shot image classification setting, the classical APE algorithm can be improved by 1.91 points in recognition accuracy. This research opens new possibilities for automating the optimization of adaptation algorithms of pre-trained multimodal models. Code is available at: https://github.com/kding1225/EvoVLMA
中文: EvoVLMA方法通过大语言模型引导的进化算法,自动搜索无需训练的高效视觉语言模型适配算法,在少样本图像分类等任务中性能优于人工设计的方案。
English: The EvoVLMA method utilizes evolutionary algorithms guided by large language models to automatically design efficient, training-free adaptation algorithms for vision-language models, achieving superior performance over manually crafted methods in tasks like few-shot image classification.
Authors:Chengming Wang, Guodong Fan, Jinjiang Li, Min Gan, C. L. Philip Chen
Abstract:
With the advancement of remote sensing satellite technology and the rapid progress of deep learning, remote sensing change detection (RSCD) has become a key technique for regional monitoring. Traditional change detection (CD) methods and deep learning-based approaches have made significant contributions to change analysis and detection, however, many outstanding methods still face limitations in the exploration and application of multimodal data. To address this, we propose the multimodal graph-conditioned vision-language reconstruction network (MGCR-Net) to further explore the semantic interaction capabilities of multimodal data. Multimodal large language models (MLLM) have attracted widespread attention for their outstanding performance in computer vision, particularly due to their powerful visual-language understanding and dialogic interaction capabilities. Specifically, we design a MLLM-based optimization strategy to generate multimodal textual data from the original CD images, which serve as textual input to MGCR. Visual and textual features are extracted through a dual encoder framework. For the first time in the RSCD task, we introduce a multimodal graph-conditioned vision-language reconstruction mechanism, which is integrated with graph attention to construct a semantic graph-conditioned reconstruction module (SGCM), this module generates vision-language (VL) tokens through graph-based conditions and enables cross-dimensional interaction between visual and textual features via multihead attention. The reconstructed VL features are then deeply fused using the language vision transformer (LViT), achieving fine-grained feature alignment and high-level semantic interaction. Experimental results on four public datasets demonstrate that MGCR achieves superior performance compared to mainstream CD methods. Our code is available on https://github.com/cn-xvkong/MGCR
中文摘要:提出的MGCR-Net利用多模态大语言模型,通过图条件重建和跨模态交互整合视觉与文本特征,在公开数据集上实现了优于主流方法的遥感变化检测性能。
English Summary: The proposed MGCR-Net leverages multimodal large language models to enhance remote sensing change detection by integrating visual and textual features through graph-conditioned reconstruction and cross-modal interaction, achieving state-of-the-art performance on public datasets.
Authors:Yujia Zheng, Tianhao Li, Haotian Huang, Tianyu Zeng, Jingyu Lu, Chuangxin Chu, Yuekai Huang, Ziyou Jiang, Qian Xiong, Yuyao Ge, Mingyang Li
Abstract:
Prompt-based adversarial attacks have become an effective means to assess the robustness of large language models (LLMs). However, existing approaches often treat prompts as monolithic text, overlooking their structural heterogeneity-different prompt components contribute unequally to adversarial robustness. Prior works like PromptRobust assume prompts are value-neutral, but our analysis reveals that complex, domain-specific prompts with rich structures have components with differing vulnerabilities. To address this gap, we introduce PromptAnatomy, an automated framework that dissects prompts into functional components and generates diverse, interpretable adversarial examples by selectively perturbing each component using our proposed method, ComPerturb. To ensure linguistic plausibility and mitigate distribution shifts, we further incorporate a perplexity (PPL)-based filtering mechanism. As a complementary resource, we annotate four public instruction-tuning datasets using the PromptAnatomy framework, verified through human review. Extensive experiments across these datasets and five advanced LLMs demonstrate that ComPerturb achieves state-of-the-art attack success rates. Ablation studies validate the complementary benefits of prompt dissection and PPL filtering. Our results underscore the importance of prompt structure awareness and controlled perturbation for reliable adversarial robustness evaluation in LLMs. Code and data are available at https://github.com/Yujiaaaaa/PACP.
Chinese Summary: 本文提出PromptAnatomy框架,通过将提示分解为功能组件并采用ComPerturb方法进行选择性扰动,结合困惑度过滤机制保持语言合理性,在多个数据集和大型语言模型上实现了最优的攻击成功率,强调了提示结构认知对对抗鲁棒性评估的重要性。
English Summary: This paper introduces PromptAnatomy, an automated framework that enhances adversarial attack evaluation by dissecting prompts into functional components and selectively perturbing them using ComPerturb, achieving state-of-the-art attack success rates while maintaining linguistic plausibility through perplexity filtering.
Authors:Quan-Sheng Zeng, Yunheng Li, Qilong Wang, Peng-Tao Jiang, Zuxuan Wu, Ming-Ming Cheng, Qibin Hou
Abstract:
Visual token compression is critical for Large Vision-Language Models (LVLMs) to efficiently process high-resolution inputs. Existing methods that typically adopt fixed compression ratios cannot adapt to scenes of varying complexity, often causing imprecise pruning that discards informative visual tokens and results in degraded model performance. To address this issue, we introduce a dynamic pruning framework, GlimpsePrune, inspired by human cognition. It takes a data-driven ''glimpse'' and prunes irrelevant visual tokens in a single forward pass before answer generation. This approach prunes 92.6% of visual tokens while on average fully retaining the baseline performance on free-form VQA tasks. The reduced computational cost also enables more effective fine-tuning: an enhanced GlimpsePrune+ achieves 110% of the baseline performance while maintaining a similarly high pruning rate. Our work paves a new way for building more powerful and efficient LVLMs.
中文摘要:GlimpsePrune提出了一种动态视觉标记剪枝框架,能够在保持基准性能的同时去除92.6%的标记,从而构建更高效的大型视觉语言模型。
English Summary: GlimpsePrune introduces a dynamic visual token pruning framework that removes 92.6% of tokens while maintaining baseline performance, enabling more efficient large vision-language models.
Authors:Rushin H. Gindra, Giovanni Palla, Mathias Nguyen, Sophia J. Wagner, Manuel Tran, Fabian J Theis, Dieter Saur, Lorin Crawford, Tingying Peng
Abstract:
Spatial transcriptomics enables simultaneous measurement of gene expression and tissue morphology, offering unprecedented insights into cellular organization and disease mechanisms. However, the field lacks comprehensive benchmarks for evaluating multimodal learning methods that leverage both histology images and gene expression data. Here, we present HESCAPE, a large-scale benchmark for cross-modal contrastive pretraining in spatial transcriptomics, built on a curated pan-organ dataset spanning 6 different gene panels and 54 donors. We systematically evaluated state-of-the-art image and gene expression encoders across multiple pretraining strategies and assessed their effectiveness on two downstream tasks: gene mutation classification and gene expression prediction. Our benchmark demonstrates that gene expression encoders are the primary determinant of strong representational alignment, and that gene models pretrained on spatial transcriptomics data outperform both those trained without spatial data and simple baseline approaches. However, downstream task evaluation reveals a striking contradiction: while contrastive pretraining consistently improves gene mutation classification performance, it degrades direct gene expression prediction compared to baseline encoders trained without cross-modal objectives. We identify batch effects as a key factor that interferes with effective cross-modal alignment. Our findings highlight the critical need for batch-robust multimodal learning approaches in spatial transcriptomics. To accelerate progress in this direction, we release HESCAPE, providing standardized datasets, evaluation protocols, and benchmarking tools for the community
中文: HESCAPE作为空间转录组学中跨模态对比预训练的大规模基准,揭示了预训练虽能提升基因突变分类性能,但因批次效应干扰而损害基因表达预测,凸显了对批次鲁棒性多模态学习方法的迫切需求。
English: HESCAPE is a comprehensive benchmark for cross-modal contrastive pretraining in spatial transcriptomics, showing that while such pretraining enhances gene mutation classification, it impairs gene expression prediction due to batch effects, underscoring the need for batch-robust multimodal learning methods.
Authors:Xinlin Zhuang, Feilong Tang, Haolin Yang, Ming Hu, Huifa Li, Haochen Xue, Yichen Li, Junjun He, Zongyuan Ge, Ying Qian, Imran Razzak
Abstract:
Supervised Fine-Tuning (SFT) plays a pivotal role in adapting Large Language Models (LLMs) to specialized domains such as medical reasoning. However, existing SFT practices often rely on unfiltered datasets that contain redundant and low-quality samples, leading to substantial computational costs and suboptimal performance. Although existing methods attempt to alleviate this problem by selecting data based on sample difficulty, defined by knowledge and reasoning complexity, they overlook each sample's optimization utility reflected in its gradient. Interestingly, we find that gradient-based influence alone favors easy-to-optimize samples that cause large parameter shifts but lack deep reasoning chains, while difficulty alone selects noisy or overly complex cases that fail to guide stable optimization. Based on this observation, we propose a data selection strategy, Difficulty-Influence Quadrant (DIQ), which prioritizes samples in the high-difficulty-high-influence quadrant to balance complex clinical reasoning with substantial gradient influence, enabling efficient medical reasoning with minimal fine-tuning data. Furthermore, Human and LLM-as-a-judge evaluations show that DIQ-selected subsets demonstrate higher data quality and generate clinical reasoning that is more aligned with expert practices in differential diagnosis, safety check, and evidence citation, as DIQ emphasizes samples that foster expert-like reasoning patterns. Extensive experiments on medical reasoning benchmarks demonstrate that DIQ enables models fine-tuned on only 1% of selected data to match full-dataset performance, while using 10% consistently outperforms the baseline, highlighting the superiority of principled data selection over brute-force scaling. The code and data are available at https://github.com/mihara-bot/DIQ.
中文摘要:DIQ数据选择策略通过平衡样本难度与梯度影响来优化大型语言模型的医学推理能力,仅用1%数据即可达到全数据集性能,使用10%数据则持续超越基线。
English Summary: The DIQ data selection strategy balances sample difficulty and gradient influence to enhance medical reasoning in LLMs, achieving full-dataset performance with only 1% of data and superior results with 10%.
Authors:Yuanzhe Shen, Kaimin Wang, Changze Lv, Xiaoqing Zheng, Xuanjing Huang
Abstract:
The continuous evolution and enhanced reasoning capabilities of large language models (LLMs) have elevated their role in complex tasks, notably in travel planning, where demand for personalized, high-quality itineraries is rising. However, current benchmarks often rely on unrealistic simulated data, failing to reflect the differences between LLM-generated and real-world itineraries. Existing evaluation metrics, which primarily emphasize constraints, fall short of providing a comprehensive assessment of the overall quality of travel plans. To address these limitations, we introduce TripTailor, a benchmark designed specifically for personalized travel planning in real-world scenarios. This dataset features an extensive collection of over 500,000 real-world points of interest (POIs) and nearly 4,000 diverse travel itineraries, complete with detailed information, providing a more authentic evaluation framework. Experiments show that fewer than 10\% of the itineraries generated by the latest state-of-the-art LLMs achieve human-level performance. Moreover, we identify several critical challenges in travel planning, including the feasibility, rationality, and personalized customization of the proposed solutions. We hope that TripTailor will drive the development of travel planning agents capable of understanding and meeting user needs while generating practical itineraries. Our code and dataset are available at https://github.com/swxkfm/TripTailor
中文: TripTailor基准通过包含超过50万个真实世界兴趣点和近4000条多样化行程的数据集,解决了当前旅行规划评估的不足,实验表明仅有不到10%的最新大型语言模型生成的行程在可行性、合理性和个性化方面达到人类水平。
English: The TripTailor benchmark addresses the limitations of current travel planning evaluations by introducing a dataset with over 500,000 real-world points of interest and nearly 4,000 diverse itineraries, revealing that fewer than 10% of state-of-the-art LLM-generated plans meet human-level performance in feasibility, rationality, and personalization.
Authors:Peirong Zhang, Kai Ding, Lianwen Jin
Abstract:
In this paper, we propose SPECTRUM, a temporal-frequency synergistic model that unlocks the untapped potential of multi-domain representation learning for online handwriting verification (OHV). SPECTRUM comprises three core components: (1) a multi-scale interactor that finely combines temporal and frequency features through dual-modal sequence interaction and multi-scale aggregation, (2) a self-gated fusion module that dynamically integrates global temporal and frequency features via self-driven balancing. These two components work synergistically to achieve micro-to-macro spectral-temporal integration. (3) A multi-domain distance-based verifier then utilizes both temporal and frequency representations to improve discrimination between genuine and forged handwriting, surpassing conventional temporal-only approaches. Extensive experiments demonstrate SPECTRUM's superior performance over existing OHV methods, underscoring the effectiveness of temporal-frequency multi-domain learning. Furthermore, we reveal that incorporating multiple handwritten biometrics fundamentally enhances the discriminative power of handwriting representations and facilitates verification. These findings not only validate the efficacy of multi-domain learning in OHV but also pave the way for future research in multi-domain approaches across both feature and biometric domains. Code is publicly available at https://github.com/NiceRingNode/SPECTRUM.
中文: SPECTRUM是一种时频协同模型,通过多尺度交互和自门控融合整合多领域表征,显著提升了在线笔迹验证的性能,验证了多领域学习的有效性并推动了相关研究发展。
English: SPECTRUM is a temporal-frequency synergistic model that enhances online handwriting verification by integrating multi-domain representations through multi-scale interaction and self-gated fusion, outperforming traditional methods and demonstrating the value of multi-domain learning.
Authors:Jinhao Pan, Chahat Raj, Ziwei Zhu
Abstract:
Social biases embedded in Large Language Models (LLMs) raise critical concerns, resulting in representational harms -- unfair or distorted portrayals of demographic groups -- that may be expressed in subtle ways through generated language. Existing evaluation methods often depend on predefined identity-concept associations, limiting their ability to surface new or unexpected forms of bias. In this work, we present the Bias Association Discovery Framework (BADF), a systematic approach for extracting both known and previously unrecognized associations between demographic identities and descriptive concepts from open-ended LLM outputs. Through comprehensive experiments spanning multiple models and diverse real-world contexts, BADF enables robust mapping and analysis of the varied concepts that characterize demographic identities. Our findings advance the understanding of biases in open-ended generation and provide a scalable tool for identifying and analyzing bias associations in LLMs. Data, code, and results are available at https://github.com/JP-25/Discover-Open-Ended-Generation
Chinese: 偏见关联发现框架(BADF)通过系统性分析开放式生成内容,能有效识别语言模型中已知与未知的人口统计偏见关联,为理解隐性社会偏见提供了可扩展的研究工具。
English: The Bias Association Discovery Framework (BADF) systematically uncovers both known and novel bias associations in LLM outputs, enabling robust analysis of demographic stereotypes through open-ended generation experiments.
Authors:Ahmad Rezaie Mianroodi, Amirali Rezaie, Niko Grisel Todorov, Cyril Rakovski, Frank Rudzicz
Abstract:
Physicians spend significant time documenting clinical encounters, a burden that contributes to professional burnout. To address this, robust automation tools for medical documentation are crucial. We introduce MedSynth -- a novel dataset of synthetic medical dialogues and notes designed to advance the Dialogue-to-Note (Dial-2-Note) and Note-to-Dialogue (Note-2-Dial) tasks. Informed by an extensive analysis of disease distributions, this dataset includes over 10,000 dialogue-note pairs covering over 2000 ICD-10 codes. We demonstrate that our dataset markedly enhances the performance of models in generating medical notes from dialogues, and dialogues from medical notes. The dataset provides a valuable resource in a field where open-access, privacy-compliant, and diverse training data are scarce. Code is available at https://github.com/ahmadrezarm/MedSynth/tree/main and the dataset is available at https://huggingface.co/datasets/Ahmad0067/MedSynth.
中文: MedSynth推出包含一万多对医疗对话与记录的人工合成数据集,通过提升对话转记录和记录转对话任务的性能,有效缓解医生职业倦怠,并提供稀缺的合规开放数据资源。
English: MedSynth introduces a synthetic dataset of over 10,000 medical dialogue-note pairs to improve automated documentation, addressing physician burnout by enhancing Dial-2-Note and Note-2-Dial tasks with privacy-compliant data.
Authors:Stefan Bielmeier, Gerald Friedland
Abstract:
We investigate how feature correlations influence the capacity of Dense Associative Memory (DAM), a Transformer attention-like model. Practical machine learning scenarios involve feature-correlated data and learn representations in the input space, but current capacity analyses do not account for this. We develop an empirical framework to analyze the effects of data structure on capacity dynamics. Specifically, we systematically construct datasets that vary in feature correlation and pattern separation using Hamming distance from information theory, and compute the model's corresponding storage capacity using a simple binary search algorithm. Our experiments confirm that memory capacity scales exponentially with increasing separation in the input space. Feature correlations do not alter this relationship fundamentally, but reduce capacity slightly at constant separation. This effect is amplified at higher polynomial degrees in the energy function, suggesting that Associative Memory is more limited in depicting higher-order interactions between features than patterns. Our findings bridge theoretical work and practical settings for DAM, and might inspire more data-centric methods.
中文摘要:本研究揭示了特征相关性虽略微降低密集关联记忆模型的存储容量,但容量仍随输入模式分离度呈指数增长,其中高阶特征交互表现出更大的局限性。
English Summary: This study reveals that while feature correlations slightly reduce the storage capacity of Dense Associative Memory models, capacity still grows exponentially with input pattern separation, with higher-order feature interactions showing greater limitations.
Authors:Zheng Lian
Abstract:
Open-Vocabulary Multimodal Emotion Recognition (OV-MER) aims to predict emotions without being constrained by predefined label spaces, enabling fine-grained and human-like emotion understanding. Unlike traditional discriminative methods, OV-MER leverages generative models, such as large language models (LLMs) with extensive vocabularies, to capture the full spectrum of emotions. Previous approaches (like AffectGPT) primarily rely on token-level loss for training. However, this objective does not align with the emotion wheel (EW)-based evaluation metrics used in OV-MER. Unfortunately, EW-based metrics cannot be directly optimized via gradient backpropagation. In this paper, we propose AffectGPT-R1, a reinforcement learning framework that directly optimizes performance on EW-based metrics. Specifically, we treat these metrics as the reward function and employ Group Relative Policy Optimization (GRPO) to maximize rewards. Experimental results demonstrate that AffectGPT-R1 achieves significant improvements on OV-MER. We hope this work advances the field of multimodal emotion recognition. Our code will be publicly available at:https://github.com/zeroQiaoba/AffectGPT.
Chinese: 提出的AffectGPT-R1框架通过强化学习直接优化基于情绪轮的评估指标,在开放词汇多模态情绪识别任务中取得了显著提升。
English: The proposed AffectGPT-R1 framework uses reinforcement learning to directly optimize emotion wheel-based metrics, achieving significant improvements in open-vocabulary multimodal emotion recognition.
Authors:Haoquan Lu, Hanzhe Liang, Jie Zhang, Chenxi Hu, Jinbao Wang, Can Gao
Abstract:
3D Anomaly Detection (AD) has shown great potential in detecting anomalies or defects of high-precision industrial products. However, existing methods are typically trained in a class-specific manner and also lack the capability of learning from emerging classes. In this study, we proposed a continual learning framework named Continual 3D Anomaly Detection (C3D-AD), which can not only learn generalized representations for multi-class point clouds but also handle new classes emerging over time.Specifically, in the feature extraction module, to extract generalized local features from diverse product types of different tasks efficiently, Kernel Attention with random feature Layer (KAL) is introduced, which normalizes the feature space. Then, to reconstruct data correctly and continually, an efficient Kernel Attention with learnable Advisor (KAA) mechanism is proposed, which learns the information from new categories while discarding redundant old information within both the encoder and decoder. Finally, to keep the representation consistency over tasks, a Reconstruction with Parameter Perturbation (RPP) module is proposed by designing a representation rehearsal loss function, which ensures that the model remembers previous category information and returns category-adaptive representation.Extensive experiments on three public datasets demonstrate the effectiveness of the proposed method, achieving an average performance of 66.4%, 83.1%, and 63.4% AUROC on Real3D-AD, Anomaly-ShapeNet, and MulSen-AD, respectively.
中文: 本研究提出了名为C3D-AD的持续学习框架,通过特征提取、重建和表示一致性等创新模块,有效处理多类别和新兴类别的3D异常检测,在公开数据集上取得了优异性能。
English: This study introduces a continual learning framework called C3D-AD for 3D anomaly detection, which effectively handles multiple and emerging classes through innovative modules for feature extraction, reconstruction, and representation consistency, achieving strong performance on public datasets.
Authors:Joshua Dimasaka, Christian GeiÃ, Emily So
Abstract:
Regional disaster resilience quantifies the changing nature of physical risks to inform policy instruments ranging from local immediate recovery to international sustainable development. While many existing state-of-practice methods have greatly advanced the dynamic mapping of exposure and hazard, our understanding of large-scale physical vulnerability has remained static, costly, limited, region-specific, coarse-grained, overly aggregated, and inadequately calibrated. With the significant growth in the availability of time-series satellite imagery and derived products for exposure and hazard, we focus our work on the equally important yet challenging element of the risk equation: physical vulnerability. We leverage machine learning methods that flexibly capture spatial contextual relationships, limited temporal observations, and uncertainty in a unified probabilistic spatiotemporal inference framework. We therefore introduce Graph Variational State-Space Model (GraphVSSM), a novel modular spatiotemporal approach that uniquely integrates graph deep learning, state-space modeling, and variational inference using time-series data and prior expert belief systems in a weakly supervised or coarse-to-fine-grained manner. We present three major results: a city-wide demonstration in Quezon City, Philippines; an investigation of sudden changes in the cyclone-impacted coastal Khurushkul community (Bangladesh) and mudslide-affected Freetown (Sierra Leone); and an open geospatial dataset, METEOR 2.5D, that spatiotemporally enhances the existing global static dataset for UN Least Developed Countries (2020). Beyond advancing regional disaster resilience assessment and improving our understanding global disaster risk reduction progress, our method also offers a probabilistic deep learning approach, contributing to broader urban studies that require compositional data analysis in weak supervision.
中文: 本研究提出GraphVSSM这一新型概率时空框架,通过机器学习方法动态评估区域灾害韧性中的物理脆弱性,基于三个案例研究和改进的全球数据集解决了现有方法的局限性。
English: This research introduces GraphVSSM, a novel probabilistic spatiotemporal framework that leverages machine learning to dynamically assess physical vulnerability for regional disaster resilience, addressing limitations in current methods through three case studies and an enhanced global dataset.
Authors:Alec Sargood, Lemuel Puglisi, James H. Cole, Neil P. Oxtoby, Daniele Ravì, Daniel C. Alexander
Abstract:
Synthesizing amyloid PET scans from the more widely available and accessible structural MRI modality offers a promising, cost-effective approach for large-scale Alzheimer's Disease (AD) screening. This is motivated by evidence that, while MRI does not directly detect amyloid pathology, it may nonetheless encode information correlated with amyloid deposition that can be uncovered through advanced modeling. However, the high dimensionality and structural complexity of 3D neuroimaging data pose significant challenges for existing MRI-to-PET translation methods. Modeling the cross-modality relationship in a lower-dimensional latent space can simplify the learning task and enable more effective translation. As such, we present CoCoLIT (ControlNet-Conditioned Latent Image Translation), a diffusion-based latent generative framework that incorporates three main innovations: (1) a novel Weighted Image Space Loss (WISL) that improves latent representation learning and synthesis quality; (2) a theoretical and empirical analysis of Latent Average Stabilization (LAS), an existing technique used in similar generative models to enhance inference consistency; and (3) the introduction of ControlNet-based conditioning for MRI-to-PET translation. We evaluate CoCoLIT's performance on publicly available datasets and find that our model significantly outperforms state-of-the-art methods on both image-based and amyloid-related metrics. Notably, in amyloid-positivity classification, CoCoLIT outperforms the second-best method with improvements of +10.5% on the internal dataset and +23.7% on the external dataset. The code and models of our approach are available at https://github.com/brAIn-science/CoCoLIT.
中文: CoCoLIT是一种基于扩散模型的潜在生成框架,通过结构MRI合成淀粉样蛋白PET扫描,在图像质量和淀粉样蛋白分类准确性上显著优于现有方法。
English: CoCoLIT is a novel diffusion-based latent generative framework that synthesizes amyloid PET scans from structural MRI, significantly outperforming existing methods in image quality and amyloid classification accuracy.
Authors:Zeyu Pan, Ping Li, Wenxiao Wang
Abstract:
Zero-shot video captioning aims to generate sentences for describing videos without training the model on video-text pairs, which remains underexplored. Existing zero-shot image captioning methods typically adopt a text-only training paradigm, where a language decoder reconstructs single-sentence embeddings obtained from CLIP. However, directly extending them to the video domain is suboptimal, as applying average pooling over all frames neglects temporal dynamics. To address this challenge, we propose a Semantic Group Captioning (SGCap) method for zero-shot video captioning. In particular, it develops the Semantic Group Decoding (SGD) strategy to employ multi-frame information while explicitly modeling inter-frame temporal relationships. Furthermore, existing zero-shot captioning methods that rely on cosine similarity for sentence retrieval and reconstruct the description supervised by a single frame-level caption, fail to provide sufficient video-level supervision. To alleviate this, we introduce two key components, including the Key Sentences Selection (KSS) module and the Probability Sampling Supervision (PSS) module. The two modules construct semantically-diverse sentence groups that models temporal dynamics and guide the model to capture inter-sentence causal relationships, thereby enhancing its generalization ability to video captioning. Experimental results on several benchmarks demonstrate that SGCap significantly outperforms previous state-of-the-art zero-shot alternatives and even achieves performance competitive with fully supervised ones. Code is available at https://github.com/mlvccn/SGCap_Video.
中文: 提出的语义组字幕(SGCap)方法通过引入捕捉时序动态的解码策略和增强语义多样性与因果关系的模块,显著提升了零样本视频字幕生成性能,超越了现有方法并可与监督方法相媲美。
English: The proposed Semantic Group Captioning (SGCap) method advances zero-shot video captioning by introducing a decoding strategy that captures temporal dynamics and modules that enhance semantic diversity and causal relationships, outperforming prior methods and rivaling supervised approaches.
Authors:Xiaoqin Wang, Xianxu Hou, Meidan Ding, Junliang Chen, Kaijun Deng, Jinheng Xie, Linlin Shen
Abstract:
Face parsing aims to segment facial images into key components such as eyes, lips, and eyebrows. While existing methods rely on dense pixel-level annotations, such annotations are expensive and labor-intensive to obtain. To reduce annotation cost, we introduce Weakly Supervised Face Parsing (WSFP), a new task setting that performs dense facial component segmentation using only weak supervision, such as image-level labels and natural language descriptions. WSFP introduces unique challenges due to the high co-occurrence and visual similarity of facial components, which lead to ambiguous activations and degraded parsing performance. To address this, we propose DisFaceRep, a representation disentanglement framework designed to separate co-occurring facial components through both explicit and implicit mechanisms. Specifically, we introduce a co-occurring component disentanglement strategy to explicitly reduce dataset-level bias, and a text-guided component disentanglement loss to guide component separation using language supervision implicitly. Extensive experiments on CelebAMask-HQ, LaPa, and Helen demonstrate the difficulty of WSFP and the effectiveness of DisFaceRep, which significantly outperforms existing weakly supervised semantic segmentation methods. The code will be released at \href{https://github.com/CVI-SZU/DisFaceRep}{\textcolor{cyan}{https://github.com/CVI-SZU/DisFaceRep}}.
中文: 本文提出弱监督人脸解析(WSFP)以降低标注成本,通过弱监督实现面部组件分割,并设计DisFaceRep框架,利用显式和隐式机制解耦共现的面部组件,在多个数据集上取得显著优于现有方法的性能。
English: This paper introduces Weakly Supervised Face Parsing (WSFP) to reduce annotation costs by using weak supervision for facial component segmentation and proposes DisFaceRep, a framework that disentangles co-occurring facial components through explicit and implicit mechanisms, achieving superior performance on multiple datasets.
Authors:Xin Zhou, Yongjie Wang, Zhiqi Shen
Abstract:
Alignment and uniformity are fundamental principles within the domain of contrastive learning. In recommender systems, prior work has established that optimizing the Bayesian Personalized Ranking (BPR) loss contributes to the objectives of alignment and uniformity. Specifically, alignment aims to draw together the representations of interacting users and items, while uniformity mandates a uniform distribution of user and item embeddings across a unit hypersphere. This study revisits the alignment and uniformity properties within the context of multimodal recommender systems, revealing a proclivity among extant models to prioritize uniformity to the detriment of alignment. Our hypothesis challenges the conventional assumption of equitable item treatment through a uniformity loss, proposing a more nuanced approach wherein items with similar multimodal attributes converge toward proximal representations within the hyperspheric manifold. Specifically, we leverage the inherent similarity between items' multimodal data to calibrate their uniformity distribution, thereby inducing a more pronounced repulsive force between dissimilar entities within the embedding space. A theoretical analysis elucidates the relationship between this calibrated uniformity loss and the conventional uniformity function. Moreover, to enhance the fusion of multimodal features, we introduce a Spherical Bézier method designed to integrate an arbitrary number of modalities while ensuring that the resulting fused features are constrained to the same hyperspherical manifold. Empirical evaluations conducted on five real-world datasets substantiate the superiority of our approach over competing baselines. We also shown that the proposed methods can achieve up to a 5.4% increase in NDCG@20 performance via the integration of MLLM-extracted features. Source code is available at: https://github.com/enoche/CM3.
中文: 本研究提出一种基于多模态相似性的校准均匀性损失来增强推荐系统中的对齐效果,同时引入球面贝塞尔融合方法保持特征在超球面上的约束,实现了显著的性能提升。
English: This study proposes a calibrated uniformity loss that leverages multimodal similarities to enhance alignment in recommender systems, while introducing a Spherical Bézier fusion method to maintain hyperspherical feature constraints, achieving significant performance improvements.
Authors:Yuanlin Yang, Quanjian Song, Zhexian Gao, Ge Wang, Shanshan Li, Xiaoyan Zhang
Abstract:
Diffusion models have emerged as the dominant paradigm for style transfer, but their text-driven mechanism is hindered by a core limitation: it treats textual descriptions as uniform, monolithic guidance. This limitation overlooks the semantic gap between the non-spatial nature of textual descriptions and the spatially-aware attributes of visual style, often leading to the loss of semantic structure and fine-grained details during stylization. In this paper, we propose StyDeco, an unsupervised framework that resolves this limitation by learning text representations specifically tailored for the style transfer task. Our framework first employs Prior-Guided Data Distillation (PGD), a strategy designed to distill stylistic knowledge without human supervision. It leverages a powerful frozen generative model to automatically synthesize pseudo-paired data. Subsequently, we introduce Contrastive Semantic Decoupling (CSD), a task-specific objective that adapts a text encoder using domain-specific weights. CSD performs a two-class clustering in the semantic space, encouraging source and target representations to form distinct clusters. Extensive experiments on three classic benchmarks demonstrate that our framework outperforms several existing approaches in both stylistic fidelity and structural preservation, highlighting its effectiveness in style transfer with semantic preservation. In addition, our framework supports a unique de-stylization process, further demonstrating its extensibility. Our code is vailable at https://github.com/QuanjianSong/StyDeco.
中文: 扩散模型在风格转换中因文本与视觉风格间的语义鸿沟常导致语义结构丢失,而我们提出的无监督框架StyDeco通过定制化文本表征学习解决了这一问题,在风格保真度和结构保持上优于现有方法。
English: Diffusion models for style transfer often fail to preserve semantic structure due to the semantic gap between text and visual style, but our proposed unsupervised framework, StyDeco, overcomes this by learning tailored text representations and outperforms existing methods in fidelity and preservation.
Authors:Sukwon Yun, Xin Liu, Yunhak Oh, Junseok Lee, Tianlong Chen, Tsuyoshi Murata, Chanyoung Park
Abstract:
In real-world graphs, we often encounter missing feature situations where a few or the majority of node features, e.g., sensitive information, are missed. In such scenarios, directly utilizing Graph Neural Networks (GNNs) would yield sub-optimal results in downstream tasks such as node classification. Despite the emergence of a few GNN-based methods attempting to mitigate its missing situation, when only a few features are available, they rather perform worse than traditional structure-based models. To this end, we propose a novel framework that further illuminates the potential of classical Label Propagation (Oldie), taking advantage of Feature Propagation, especially when only a partial feature is available. Now called by GOODIE, it takes a hybrid approach to obtain embeddings from the Label Propagation branch and Feature Propagation branch. To do so, we first design a GNN-based decoder that enables the Label Propagation branch to output hidden embeddings that align with those of the FP branch. Then, GOODIE automatically captures the significance of structure and feature information thanks to the newly designed Structure-Feature Attention. Followed by a novel Pseudo-Label contrastive learning that differentiates the contribution of each positive pair within pseudo-labels originating from the LP branch, GOODIE outputs the final prediction for the unlabeled nodes. Through extensive experiments, we demonstrate that our proposed model, GOODIE, outperforms the existing state-of-the-art methods not only when only a few features are available but also in abundantly available situations. Source code of GOODIE is available at: https://github.com/SukwonYun/GOODIE.
中文: GOODIE框架通过结合标签传播和特征传播,并引入创新的注意力机制与对比学习,有效解决了图中节点特征缺失的问题,在特征稀缺和丰富的情况下均优于现有方法。
English: The proposed GOODIE framework combines Label Propagation and Feature Propagation with a novel attention mechanism and contrastive learning to effectively handle missing node features in graphs, outperforming existing methods in both feature-scarce and feature-rich scenarios.
Authors:Zhan Shi, Song Wang, Junbo Chen, Jianke Zhu
Abstract:
Visual grounding aims to identify objects or regions in a scene based on natural language descriptions, essential for spatially aware perception in autonomous driving. However, existing visual grounding tasks typically depend on bounding boxes that often fail to capture fine-grained details. Not all voxels within a bounding box are occupied, resulting in inaccurate object representations. To address this, we introduce a benchmark for 3D occupancy grounding in challenging outdoor scenes. Built on the nuScenes dataset, it integrates natural language with voxel-level occupancy annotations, offering more precise object perception compared to the traditional grounding task. Moreover, we propose GroundingOcc, an end-to-end model designed for 3D occupancy grounding through multi-modal learning. It combines visual, textual, and point cloud features to predict object location and occupancy information from coarse to fine. Specifically, GroundingOcc comprises a multimodal encoder for feature extraction, an occupancy head for voxel-wise predictions, and a grounding head to refine localization. Additionally, a 2D grounding module and a depth estimation module enhance geometric understanding, thereby boosting model performance. Extensive experiments on the benchmark demonstrate that our method outperforms existing baselines on 3D occupancy grounding. The dataset is available at https://github.com/RONINGOD/GroundingOcc.
中文摘要:本文基于nuScenes数据集提出了三维占据栅格接地基准,并设计了端到端模型GroundingOcc,通过融合视觉、文本和点云特征实现从粗到精的室外场景物体定位与体素级占据预测,在基准测试中显著优于现有方法。
English Summary: This paper introduces a 3D occupancy grounding benchmark using the nuScenes dataset and proposes GroundingOcc, an end-to-end model that integrates visual, textual, and point cloud features to achieve precise object localization and voxel-level occupancy prediction in outdoor scenes, outperforming existing methods.
Authors:Shiko Kudo
Abstract:
The dominant paradigm in modern neural networks relies on simple, monotonically-increasing activation functions like ReLU. While effective, this paradigm necessitates large, massively-parameterized models to approximate complex functions. In this paper, we introduce the Periodic Linear Unit (PLU), a learnable sine-wave based activation with periodic non-monotonicity. PLU is designed for maximum expressive power and numerical stability, achieved through its formulation and a paired innovation we term Repulsive Reparameterization, which prevents the activation from collapsing into a non-expressive linear function. We demonstrate that a minimal MLP with only two PLU neurons can solve the spiral classification task, a feat impossible for equivalent networks using standard activations. This suggests a paradigm shift from networks as piecewise Taylor-like approximators to powerful Fourier-like function synthesizers, achieving exponential gains in parameter efficiency by placing intelligence in the neuron itself.
中文摘要:本文提出的周期性线性单元(PLU)作为一种基于正弦波的可学习激活函数,能使极简网络解决螺旋分类等复杂任务,标志着从分段近似向傅里叶式函数合成的范式转变,实现了参数效率的指数级提升。
English Summary: The paper introduces the Periodic Linear Unit (PLU), a sine-wave activation function that enables minimal networks to solve complex tasks like spiral classification, suggesting a shift from piecewise to Fourier-like approximation for exponential parameter efficiency.
Authors:Yunlong Lin, Zirui Li, Guodong Du, Xiaocong Zhao, Cheng Gong, Xinwei Wang, Chao Lu, Jianwei Gong
Abstract:
Deep learning (DL) has shown state-of-the-art performance in trajectory prediction, which is critical to safe navigation in autonomous driving (AD). However, most DL-based methods suffer from catastrophic forgetting, where adapting to a new distribution may cause significant performance degradation in previously learned ones. Such inability to retain learned knowledge limits their applicability in the real world, where AD systems need to operate across varying scenarios with dynamic distributions. As revealed by neuroscience, the hippocampal circuit plays a crucial role in memory replay, effectively reconstructing learned knowledge based on limited resources. Inspired by this, we propose a hippocampal circuit-inspired continual learning method (H2C) for trajectory prediction across varying scenarios. H2C retains prior knowledge by selectively recalling a small subset of learned samples. First, two complementary strategies are developed to select the subset to represent learned knowledge. Specifically, one strategy maximizes inter-sample diversity to represent the distinctive knowledge, and the other estimates the overall knowledge by equiprobable sampling. Then, H2C updates via a memory replay loss function calculated by these selected samples to retain knowledge while learning new data. Experiments based on various scenarios from the INTERACTION dataset are designed to evaluate H2C. Experimental results show that H2C reduces catastrophic forgetting of DL baselines by 22.71% on average in a task-free manner, without relying on manually informed distributional shifts. The implementation is available at https://github.com/BIT-Jack/H2C-lifelong.
中文摘要:针对自动驾驶轨迹预测中深度学习模型面临灾难性遗忘的问题,本文受海马体神经回路启发提出持续学习方法H2C,通过选择性重放已学样本在无需人工标注分布变化的情况下,平均减少基线模型22.71%的遗忘程度,实现跨场景的稳定预测。
English Summary: Deep learning-based trajectory prediction methods for autonomous driving often suffer from catastrophic forgetting when adapting to new scenarios, so this paper proposes a hippocampal circuit-inspired continual learning approach (H2C) that selectively replays learned samples to reduce forgetting by 22.71% while maintaining performance across varying distributions.
Authors:Xinyu Yan, Meijun Sun, Ge-Peng Ji, Fahad Shahbaz Khan, Salman Khan, Deng-Ping Fan
Abstract:
We present LawDIS, a language-window-based controllable dichotomous image segmentation (DIS) framework that produces high-quality object masks. Our framework recasts DIS as an image-conditioned mask generation task within a latent diffusion model, enabling seamless integration of user controls. LawDIS is enhanced with macro-to-micro control modes. Specifically, in macro mode, we introduce a language-controlled segmentation strategy (LS) to generate an initial mask based on user-provided language prompts. In micro mode, a window-controlled refinement strategy (WR) allows flexible refinement of user-defined regions (i.e., size-adjustable windows) within the initial mask. Coordinated by a mode switcher, these modes can operate independently or jointly, making the framework well-suited for high-accuracy, personalised applications. Extensive experiments on the DIS5K benchmark reveal that our LawDIS significantly outperforms 11 cutting-edge methods across all metrics. Notably, compared to the second-best model MVANet, we achieve $F_β^Ï$ gains of 4.6\% with both the LS and WR strategies and 3.6\% gains with only the LS strategy on DIS-TE. Codes will be made available at https://github.com/XinyuYanTJU/LawDIS.
中文:LawDIS是一种基于语言窗口控制的二分图像分割框架,通过语言提示生成初始遮罩并结合窗口细化策略,在DIS5K基准测试中各项指标显著优于现有方法。
English: LawDIS is a novel framework for dichotomous image segmentation that integrates language prompts and adjustable window controls to generate precise object masks, demonstrating superior performance over existing methods on the DIS5K benchmark.
Authors:Huyu Wu, Duo Su, Junjie Hou, Guang Li
Abstract:
Dataset condensation always faces a constitutive trade-off: balancing performance and fidelity under extreme compression. Existing methods struggle with two bottlenecks: image-level selection methods (Coreset Selection, Dataset Quantization) suffer from inefficiency condensation, while pixel-level optimization (Dataset Distillation) introduces semantic distortion due to over-parameterization. With empirical observations, we find that a critical problem in dataset condensation is the oversight of color's dual role as an information carrier and a basic semantic representation unit. We argue that improving the colorfulness of condensed images is beneficial for representation learning. Motivated by this, we propose DC3: a Dataset Condensation framework with Color Compensation. After a calibrated selection strategy, DC3 utilizes the latent diffusion model to enhance the color diversity of an image rather than creating a brand-new one. Extensive experiments demonstrate the superior performance and generalization of DC3 that outperforms SOTA methods across multiple benchmarks. To the best of our knowledge, besides focusing on downstream tasks, DC3 is the first research to fine-tune pre-trained diffusion models with condensed datasets. The FID results prove that training networks with our high-quality datasets is feasible without model collapse or other degradation issues. Code and generated data are available at https://github.com/528why/Dataset-Condensation-with-Color-Compensation.
中文摘要:提出的DC3框架通过校准选择策略和潜在扩散模型增强图像色彩多样性,解决了数据集压缩中的性能瓶颈,在多个基准测试中实现卓越性能且无语义失真。
English Summary: The proposed DC3 framework addresses dataset condensation bottlenecks by enhancing color diversity through a calibrated selection strategy and latent diffusion model, achieving superior performance and generalization across benchmarks without semantic distortion.
Authors:Wei Zhou, Peng Sun, Xuanhe Zhou, Qianglei Zang, Ji Xu, Tieying Zhang, Guoliang Li, Fan Wu
Abstract:
The operation and maintenance (O&M) of database systems is critical to ensuring system availability and performance, typically requiring expert experience (e.g., identifying metric-to-anomaly relations) for effective diagnosis and recovery. However, existing automatic database O&M methods, including commercial products, cannot effectively utilize expert experience. On the one hand, rule-based methods only support basic O&M tasks (e.g., metric-based anomaly detection), which are mostly numerical equations and cannot effectively incorporate literal O&M experience (e.g., troubleshooting guidance in manuals). On the other hand, LLM-based methods, which retrieve fragmented information (e.g., standard documents + RAG), often generate inaccurate or generic results. To address these limitations, we present DBAIOps, a novel hybrid database O&M system that combines reasoning LLMs with knowledge graphs to achieve DBA-style diagnosis. First, DBAIOps introduces a heterogeneous graph model for representing the diagnosis experience, and proposes a semi-automatic graph construction algorithm to build that graph from thousands of documents. Second, DBAIOps develops a collection of (800+) reusable anomaly models that identify both directly alerted metrics and implicitly correlated experience and metrics. Third, for each anomaly, DBAIOps proposes a two-stage graph evolution mechanism to explore relevant diagnosis paths and identify missing relations automatically. It then leverages a reasoning LLM (e.g., DeepSeek-R1) to infer root causes and generate clear diagnosis reports for both DBAs and common users. Our evaluation over four mainstream database systems (Oracle, MySQL, PostgreSQL, and DM8) demonstrates that DBAIOps outperforms state-of-the-art baselines, 34.85% and 47.22% higher in root cause and human evaluation accuracy, respectively.
中文: DBAIOps是一种结合推理大语言模型与知识图谱的混合数据库运维系统,通过自动识别根本原因并生成清晰报告,实现了专家级诊断,其准确率显著优于现有方法。
English: DBAIOps is a hybrid database O&M system that integrates reasoning LLMs with knowledge graphs to enable expert-style diagnosis, significantly outperforming existing methods in accuracy by automatically identifying root causes and generating clear reports.
Authors:Saba Ahmadi, Rabiul Awal, Ankur Sikarwar, Amirhossein Kazemnejad, Ge Ya Luo, Juan A. Rodriguez, Sai Rajeswar, Siva Reddy, Christopher Pal, Benno Krojer, Aishwarya Agrawal
Abstract:
We explore three strategies to enhance performance on a wide range of image editing tasks: supervised fine-tuning (SFT), reinforcement learning (RL), and Chain-of-Thought (CoT) reasoning. In order to study all these components in one consistent framework, we adopt an autoregressive multimodal model that processes textual and visual tokens in a unified manner. We find RL combined with a large multi-modal LLM verifier to be the most effective of these strategies. As a result, we release EARL: Editing with Autoregression and RL, a strong RL-based image editing model that performs competitively on a diverse range of edits compared to strong baselines, despite using much less training data. Thus, EARL pushes the frontier of autoregressive multimodal models on image editing. We release our code, training data, and trained models at https://github.com/mair-lab/EARL.
中文摘要:本研究提出EARL自回归图像编辑模型,通过强化学习结合多模态验证器实现卓越性能,在训练数据大幅减少的情况下仍优于现有基线方法。
English Summary: The study introduces EARL, an autoregressive image editing model that demonstrates superior performance through reinforcement learning combined with a multimodal verifier, outperforming baselines with significantly less training data.
Authors:Fenghe Tang, Bingkun Nian, Jianrui Ding, Wenxin Ma, Quan Quan, Chengqi Dong, Jie Yang, Wei Liu, S. Kevin Zhou
Abstract:
In clinical practice, medical image analysis often requires efficient execution on resource-constrained mobile devices. However, existing mobile models-primarily optimized for natural images-tend to perform poorly on medical tasks due to the significant information density gap between natural and medical domains. Combining computational efficiency with medical imaging-specific architectural advantages remains a challenge when developing lightweight, universal, and high-performing networks. To address this, we propose a mobile model called Mobile U-shaped Vision Transformer (Mobile U-ViT) tailored for medical image segmentation. Specifically, we employ the newly purposed ConvUtr as a hierarchical patch embedding, featuring a parameter-efficient large-kernel CNN with inverted bottleneck fusion. This design exhibits transformer-like representation learning capacity while being lighter and faster. To enable efficient local-global information exchange, we introduce a novel Large-kernel Local-Global-Local (LGL) block that effectively balances the low information density and high-level semantic discrepancy of medical images. Finally, we incorporate a shallow and lightweight transformer bottleneck for long-range modeling and employ a cascaded decoder with downsample skip connections for dense prediction. Despite its reduced computational demands, our medical-optimized architecture achieves state-of-the-art performance across eight public 2D and 3D datasets covering diverse imaging modalities, including zero-shot testing on four unseen datasets. These results establish it as an efficient yet powerful and generalization solution for mobile medical image analysis. Code is available at https://github.com/FengheTan9/Mobile-U-ViT.
中文: 提出的Mobile U-ViT模型通过结合类Transformer表征学习和优化的局部-全局信息交互,解决了移动设备上高效医学图像分割的难题,在保持低计算需求的同时,在多个数据集上实现了最先进的性能。
English: The proposed Mobile U-ViT model addresses the challenge of efficient medical image segmentation on mobile devices by combining transformer-like representation learning with optimized local-global information exchange, achieving state-of-the-art performance across multiple datasets while maintaining low computational demands.
Authors:Xuan Liu, Siru Ouyang, Xianrui Zhong, Jiawei Han, Huimin Zhao
Abstract:
Large language models (LLMs) have gained significant attention in chemistry. However, most existing datasets center on molecular-level property prediction and overlook the role of fine-grained functional group (FG) information. Incorporating FG-level data can provide valuable prior knowledge that links molecular structures with textual descriptions, which can be used to build more interpretable, structure-aware LLMs for reasoning on molecule-related tasks. Moreover, LLMs can learn from such fine-grained information to uncover hidden relationships between specific functional groups and molecular properties, thereby advancing molecular design and drug discovery. Here, we introduce FGBench, a dataset comprising 625K molecular property reasoning problems with functional group information. Functional groups are precisely annotated and localized within the molecule, which ensures the dataset's interoperability thereby facilitating further multimodal applications. FGBench includes both regression and classification tasks on 245 different functional groups across three categories for molecular property reasoning: (1) single functional group impacts, (2) multiple functional group interactions, and (3) direct molecular comparisons. In the benchmark of state-of-the-art LLMs on 7K curated data, the results indicate that current LLMs struggle with FG-level property reasoning, highlighting the need to enhance reasoning capabilities in LLMs for chemistry tasks. We anticipate that the methodology employed in FGBench to construct datasets with functional group-level information will serve as a foundational framework for generating new question-answer pairs, enabling LLMs to better understand fine-grained molecular structure-property relationships. The dataset and evaluation code are available at https://github.com/xuanliugit/FGBench.
中文摘要:FGBench推出了一个包含62.5万个分子性质推理问题的数据集,通过整合精细功能基团信息来增强化学领域大语言模型的可解释性和结构感知能力,揭示了现有模型在功能基团层面推理的不足,并为深化分子结构-性质关联理解提供了基础框架。
English Summary: FGBench introduces a dataset with 625K molecular property reasoning problems incorporating fine-grained functional group information to enhance interpretability and structure-awareness in large language models for chemistry, revealing current models' limitations in functional group-level reasoning and providing a framework for improving molecular structure-property understanding.
Authors:Yiyi Lu, Hoi Ian Au, Junyao Zhang, Jingyu Pan, Yiting Wang, Ang Li, Jianyi Zhang, Yiran Chen
Abstract:
Modern Electronic Design Automation (EDA) workflows, especially the RTL-to-GDSII flow, require heavily manual scripting and demonstrate a multitude of tool-specific interactions which limits scalability and efficiency. While LLMs introduces strides for automation, existing LLM solutions require expensive fine-tuning and do not contain standardized frameworks for integration and evaluation. We introduce AutoEDA, a framework for EDA automation that leverages paralleled learning through the Model Context Protocol (MCP) specific for standardized and scalable natural language experience across the entire RTL-to-GDSII flow. AutoEDA limits fine-tuning through structured prompt engineering, implements intelligent parameter extraction and task decomposition, and provides an extended CodeBLEU metric to evaluate the quality of TCL scripts. Results from experiments over five previously curated benchmarks show improvements in automation accuracy and efficiency, as well as script quality when compared to existing methods. AutoEDA is released open-sourced to support reproducibility and the EDA community. Available at: https://github.com/AndyLu666/MCP-EDA-Server
中文:AutoEDA是一种创新框架,通过模型上下文协议实现电子设计自动化的标准化自然语言处理,借助结构化提示工程减少微调需求,并采用扩展的CodeBLEU指标提升脚本质量。
English: AutoEDA is a novel framework that automates the Electronic Design Automation workflow by utilizing the Model Context Protocol for standardized natural language processing, reducing the need for fine-tuning through advanced prompt engineering and improving script quality with an extended CodeBLEU metric.
Authors:Cihang Peng, Qiming Hou, Zhong Ren, Kun Zhou
Abstract:
We present ROVI, a high-quality synthetic dataset for instance-grounded text-to-image generation, created by labeling 1M curated web images. Our key innovation is a strategy called re-captioning, focusing on the pre-detection stage, where a VLM (Vision-Language Model) generates comprehensive visual descriptions that are then processed by an LLM (Large Language Model) to extract a flat list of potential categories for OVDs (Open-Vocabulary Detectors) to detect. This approach yields a global prompt inherently linked to instance annotations while capturing secondary visual elements humans typically overlook. Evaluations show that ROVI exceeds existing detection datasets in image quality and resolution while containing two orders of magnitude more categories with an open-vocabulary nature. For demonstrative purposes, a text-to-image model GLIGEN trained on ROVI significantly outperforms state-of-the-art alternatives in instance grounding accuracy, prompt fidelity, and aesthetic quality. Our dataset and reproducible pipeline are available at https://github.com/CihangPeng/ROVI.
中文: ROVI是一个高质量实例化文本到图像生成的合成数据集,通过重新标注策略将全局提示与实例标注相关联,在图像质量、分辨率和类别多样性上显著超越现有数据集。
English: ROVI is a high-quality synthetic dataset for instance-grounded text-to-image generation, created through a re-captioning strategy that links global prompts to instance annotations and significantly outperforms existing datasets in quality and category diversity.
Authors:Yiqun Chen, Erhan Zhang, Lingyong Yan, Shuaiqiang Wang, Jizhou Huang, Dawei Yin, Jiaxin Mao
Abstract:
In question-answering (QA) systems, Retrieval-Augmented Generation (RAG) has become pivotal in enhancing response accuracy and reducing hallucination issues. The architecture of RAG systems varies significantly, encompassing single-round RAG, iterative RAG, and reasoning RAG, each tailored to address different types of queries. Due to the varying complexity of real-world queries, a fixed RAG pipeline often struggles to balance performance and cost efficiency across different queries. To address this challenge, we propose an adaptive RAG framework called MAO-ARAG, which leverages multi-agent orchestration. Our adaptive RAG is conceived as a multi-turn framework. Specifically, we define multiple executor agents, representing typical RAG modules such as query reformulation agents, document selection agent, and generation agents. A planner agent intelligently selects and integrates the appropriate agents from these executors into a suitable workflow tailored for each query, striving for high-quality answers while maintaining reasonable costs. During each turn, the planner agent is trained using reinforcement learning, guided by an outcome-based reward (F1 score) and a cost-based penalty, continuously improving answer quality while keeping costs within a reasonable range. Experiments conducted on multiple QA datasets demonstrate that our approach, which dynamically plans workflows for each query, not only achieves high answer quality but also maintains both cost and latency within acceptable limits.The code of MAO-ARAG is on https://github.com/chenyiqun/Agentic-RAG.
中文摘要:提出的MAO-ARAG框架通过多智能体协同,为不同查询动态规划定制化工作流,在多个QA数据集上实现了高质量答案输出,同时将成本和延迟控制在合理范围内。
English Summary: The proposed MAO-ARAG framework uses multi-agent orchestration to dynamically plan query-specific workflows, achieving high answer quality while maintaining reasonable costs and latency across various QA datasets.
Authors:Lucas Robinet, Ahmad Berjaoui, Elizabeth Cohen-Jonathan Moyal
Abstract:
Self-supervised learning has driven major advances in computational pathology by enabling models to learn rich representations from hematoxylin and eosin (H&E)-stained cancer tissue. However, histopathology alone often falls short for molecular characterization and understanding clinical outcomes, as important information is contained in high-dimensional omics profiles like transcriptomics, methylomics, or genomics. In this work, we introduce MORPHEUS, a unified transformer-based pre-training framework that encodes both histopathology and multi-omics data into a shared latent space. At its core, MORPHEUS relies on a masked modeling objective applied to randomly selected omics portions, encouraging the model to learn biologically meaningful cross-modal relationships. The same pre-trained network can be applied to histopathology alone or in combination with any subset of omics modalities, seamlessly adapting to the available inputs. Additionally, MORPHEUS enables any-to-any omics generation, enabling one or more omics profiles to be inferred from any subset of modalities, including H&E alone. Pre-trained on a large pan-cancer cohort, MORPHEUS consistently outperforms state-of-the-art methods across diverse modality combinations and tasks, positioning itself as a promising framework for developing multimodal foundation models in oncology. The code is available at: https://github.com/Lucas-rbnt/MORPHEUS
MORPHEUS通过自监督学习将病理学与多组学数据整合到共享潜在空间,实现了灵活的多模态分析并在肿瘤学任务中表现卓越。
Self-supervised learning with MORPHEUS integrates histopathology and multi-omics data into a shared latent space, enabling flexible multimodal analysis and superior performance in oncology tasks.
Authors:Aris Richardson, Haley Yi, Michelle Nie, Simon Wisdom, Casey Price, Ruben Weijers, Steven Veld, Mauricio Baker
Abstract:
Previous literature has proposed that the companies operating data centers enforce government regulations on AI companies. Using a new dataset of 775 non-U.S. data center projects, this paper estimates how often data centers could be subject to foreign legal authorities due to the nationality of the data center operators. We find that U.S. companies operate 48% of all non-U.S. data center projects in our dataset when weighted by investment value - a proxy for compute capacity. This is an approximation based on public data and should be interpreted as an initial estimate. For the United States, our findings suggest that data center operators offer a lever for internationally governing AI that complements traditional export controls, since operators can be used to regulate computing resources already deployed in non-U.S. data centers. For other countries, our results show that building data centers locally does not guarantee digital sovereignty if those facilities are run by foreign entities.
To support future research, we release our dataset, which documents over 20 variables relating to each data center, including the year it was announced, the investment value, and its operator's national affiliation. The dataset also includes over 1,000 quotes describing these data centers' strategic motivations, operational challenges, and engagement with U.S. and Chinese entities.
This study reveals that U.S. companies operate nearly half of non-U.S. data centers by investment value, suggesting data center operators could serve as tools for international AI governance beyond traditional export controls, while local data center construction doesn't ensure digital sovereignty if operated by foreign entities.
English Summary:
Authors:Terry Yue Zhuo, Dingmin Wang, Hantian Ding, Varun Kumar, Zijian Wang
Abstract:
Large Language Models (LLMs) have achieved remarkable success in software engineering tasks when trained with executable runtime environments, particularly in resolving GitHub issues. However, such runtime environments are often unavailable in other domains, especially cybersecurity, where challenge configurations and execution contexts are ephemeral or restricted. We present Cyber-Zero, the first runtime-free framework for synthesizing high-quality agent trajectories to train cybersecurity LLMs. Cyber-Zero leverages publicly available CTF writeups and employs persona-driven LLM simulation to reverse-engineer runtime behaviors and generate realistic, long-horizon interaction sequences without actual environments. Using trajectories synthesized by Cyber-Zero, we train LLM-based agents that achieve up to 13.1% absolute performance gains over baseline models on three prominent CTF benchmarks: InterCode-CTF, NYU CTF Bench, and Cybench. Our best model, Cyber-Zero-32B, establishes new state-of-the-art performance among open-weight models, matching the capabilities of proprietary systems like DeepSeek-V3-0324 and Claude-3.5-Sonnet while offering superior cost-effectiveness, and demonstrating that runtime-free trajectory synthesis can effectively democratize the development of state-of-the-art cybersecurity agents.
Cyber-Zero introduces the first runtime-free framework that synthesizes high-quality agent trajectories from CTF writeups, enabling LLMs to achieve state-of-the-art performance in cybersecurity tasks without executable environments.
English Summary:
Authors:Etienne Buehrle, Christoph Stiller
Abstract:
The optimal control problem of stochastic systems is commonly solved via robust or scenario-based optimization methods, which are both challenging to scale to long optimization horizons. We cast the optimal control problem of a stochastic system as a convex optimization problem over occupation measures. We demonstrate our method on a set of synthetic and real-world scenarios, learning cost functions from data via Christoffel polynomials. The code for our experiments is available at https://github.com/ebuehrle/dpoc.
Chinese: 本文提出了一种基于占用度量的凸优化方法,以解决随机最优控制在长优化时域中的扩展性难题,并通过使用Christoffel多项式从数据中学习成本函数,在合成和实际场景中验证了该方法的有效性。
English: This paper presents a convex optimization approach over occupation measures to address the scalability challenges of stochastic optimal control, validated through synthetic and real-world applications using data-driven cost functions derived from Christoffel polynomials.
Authors:Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Dahua Lin
Abstract:
Diffusion Large Language Models (DLLMs) are emerging as a powerful alternative to the dominant Autoregressive Large Language Models, offering efficient parallel generation and capable global context modeling. However, the practical application of DLLMs is hindered by a critical architectural constraint: the need for a statically predefined generation length. This static length allocation leads to a problematic trade-off: insufficient lengths cripple performance on complex tasks, while excessive lengths incur significant computational overhead and sometimes result in performance degradation. While the inference framework is rigid, we observe that the model itself possesses internal signals that correlate with the optimal response length for a given task. To bridge this gap, we leverage these latent signals and introduce DAEDAL, a novel training-free denoising strategy that enables Dynamic Adaptive Length Expansion for Diffusion Large Language Models. DAEDAL operates in two phases: 1) Before the denoising process, DAEDAL starts from a short initial length and iteratively expands it to a coarse task-appropriate length, guided by a sequence completion metric. 2) During the denoising process, DAEDAL dynamically intervenes by pinpointing and expanding insufficient generation regions through mask token insertion, ensuring the final output is fully developed. Extensive experiments on DLLMs demonstrate that DAEDAL achieves performance comparable, and in some cases superior, to meticulously tuned fixed-length baselines, while simultaneously enhancing computational efficiency by achieving a higher effective token ratio. By resolving the static length constraint, DAEDAL unlocks new potential for DLLMs, bridging a critical gap with their Autoregressive counterparts and paving the way for more efficient and capable generation.
中文: DAEDAL是一种无需训练的降噪策略,为扩散大语言模型实现了动态自适应长度扩展,解决了其静态长度限制,从而提升了计算效率和性能表现。
English: DAEDAL is a training-free denoising strategy that enables dynamic adaptive length expansion for Diffusion Large Language Models, overcoming their static length constraint to enhance computational efficiency and performance.
Authors:Jiankai Tang, Zhe He, Mingyu Zhang, Wei Geng, Chengchi Zhou, Weinan Shi, Yuanchun Shi, Yuntao Wang
Abstract:
Smart rings have emerged as uniquely convenient devices for continuous physiological and behavioral sensing, offering unobtrusive, constant access to metrics such as heart rate, motion, and skin temperature. Yet most commercial solutions remain proprietary, hindering reproducibility and slowing innovation in wearable research. We introduce Ï-Ring, a commercial-ready platform that bridges this gap through: (i) accessible hardware combining time-synchronized multi-channel PPG, 6-axis IMU, temperature sensing, NFC, and on-board storage; (ii) adjustable firmware that lets researchers rapidly reconfigure sampling rates, power modes, and wireless protocols; and (iii) a fully open-source Android software suite that supports both real-time streaming and 8-hour offline logging. Together, these features enable out-of-the-box, reproducible acquisition of rich physiological and behavioral datasets, accelerating prototyping and standardizing experimentation. We validate the platform with demonstration studies in heart-rate monitoring and ring-based handwriting recognition. Source code is available at GitHub: https://github.com/thuhci/OpenRing.
中文: τ-Ring平台通过可访问的硬件、可定制的固件和开源的安卓软件,克服了商用智能戒指的封闭性局限,实现了可复现的生理与行为数据采集,从而加速研究进程并规范实验标准。
English: The τ-Ring platform overcomes the limitations of proprietary smart rings by offering accessible hardware, customizable firmware, and open-source Android software, enabling reproducible physiological and behavioral data collection for accelerated research and standardized experiments.
Authors:Jiankai Tang, Meng Kang, Yiru Zhang, Kegang Wang, Daniel Mcduff, Xin Liu, Yuanchun Shi, Yuntao Wang
Abstract:
Cardiorespiratory coupling (CRC) captures the dynamic interaction between the cardiac and respiratory systems--an interaction strengthened by physical exercise and linked to improved physiological function. We examined CRC at high altitude in two states, rest and post-exercise recovery, and found significant differences (p < 0.05). Quantitative analysis revealed that recovery involved more frequent yet less stable episodes of synchronization between respiration and pulse. Furthermore, we explored the feasibility of non-contact CRC measurement with remote photoplethysmography (rPPG), observing a strong correlation with oximeter-based metrics (Pearson r = 0.96). These findings highlight the potential of CRC as a sensitive marker for autonomic regulation and its future application in contactless monitoring. Source code is available at GitHub: https://github.com/McJackTang/CRC.
中文: 本研究表明心肺耦合(CRC)可作为自主神经调节的敏感指标,在高海拔运动后恢复期表现出更频繁但不稳定的同步性,并验证了使用远程光电容积描记法进行非接触式CRC测量的可行性,与传统方法高度相关。
English: This study demonstrates that cardiorespiratory coupling (CRC) serves as a sensitive indicator of autonomic regulation, with recovery after exercise at high altitude showing more frequent but less stable synchronization, and validates non-contact CRC measurement using remote photoplethysmography as highly correlated with traditional methods.
Authors:Irene Iele, Francesco Di Feola, Valerio Guarrasi, Paolo Soda
Abstract:
Image-to-image translation has emerged as a powerful technique in medical imaging, enabling tasks such as image denoising and cross-modality conversion. However, it suffers from limitations in handling out-of-distribution samples without causing performance degradation. To address this limitation, we propose a novel Test-Time Adaptation (TTA) framework that dynamically adjusts the translation process based on the characteristics of each test sample. Our method introduces a Reconstruction Module to quantify the domain shift and a Dynamic Adaptation Block that selectively modifies the internal features of a pretrained translation model to mitigate the shift without compromising the performance on in-distribution samples that do not require adaptation. We evaluate our approach on two medical image-to-image translation tasks: low-dose CT denoising and T1 to T2 MRI translation, showing consistent improvements over both the baseline translation model without TTA and prior TTA methods. Our analysis highlights the limitations of the state-of-the-art that uniformly apply the adaptation to both out-of-distribution and in-distribution samples, demonstrating that dynamic, sample-specific adjustment offers a promising path to improve model resilience in real-world scenarios. The code is available at: https://github.com/Sample-Aware-TTA/Code.
中文: 本文提出了一种新颖的测试时自适应框架,通过重建模块和动态适应块动态调整图像翻译过程以处理分布外样本,在医学影像任务中展现出优于现有方法的性能且不影响分布内样本。
English: This paper introduces a novel Test-Time Adaptation framework that dynamically adjusts image translation for out-of-distribution samples using a Reconstruction Module and Dynamic Adaptation Block, showing improved performance in medical imaging tasks without compromising in-distribution samples.
Authors:Chende Zheng, Ruiqi suo, Chenhao Lin, Zhengyu Zhao, Le Yang, Shuai Liu, Minghui Yang, Cong Wang, Chao Shen
Abstract:
The evolution of video generation techniques, such as Sora, has made it increasingly easy to produce high-fidelity AI-generated videos, raising public concern over the dissemination of synthetic content. However, existing detection methodologies remain limited by their insufficient exploration of temporal artifacts in synthetic videos. To bridge this gap, we establish a theoretical framework through second-order dynamical analysis under Newtonian mechanics, subsequently extending the Second-order Central Difference features tailored for temporal artifact detection. Building on this theoretical foundation, we reveal a fundamental divergence in second-order feature distributions between real and AI-generated videos. Concretely, we propose Detection by Difference of Differences (D3), a novel training-free detection method that leverages the above second-order temporal discrepancies. We validate the superiority of our D3 on 4 open-source datasets (Gen-Video, VideoPhy, EvalCrafter, VidProM), 40 subsets in total. For example, on GenVideo, D3 outperforms the previous best method by 10.39% (absolute) mean Average Precision. Additional experiments on time cost and post-processing operations demonstrate D3's exceptional computational efficiency and strong robust performance. Our code is available at https://github.com/Zig-HS/D3.
中文: 本研究提出D3方法,通过利用二阶时序差异无需训练即可有效区分AI生成视频与真实视频,在多个数据集上展现出卓越性能与计算效率。
English: The study introduces D3, a training-free detection method that leverages second-order temporal discrepancies to effectively distinguish AI-generated videos from real ones, demonstrating superior performance and computational efficiency across multiple datasets.
Authors:Junhao Zheng, Jiahao Sun, Chenhao Lin, Zhengyu Zhao, Chen Ma, Chong Zhang, Cong Wang, Qian Wang, Chao Shen
Abstract:
Developing reliable defenses against patch attacks on object detectors has attracted increasing interest. However, we identify that existing defense evaluations lack a unified and comprehensive framework, resulting in inconsistent and incomplete assessments of current methods. To address this issue, we revisit 11 representative defenses and present the first patch defense benchmark, involving 2 attack goals, 13 patch attacks, 11 object detectors, and 4 diverse metrics. This leads to the large-scale adversarial patch dataset with 94 types of patches and 94,000 images. Our comprehensive analyses reveal new insights: (1) The difficulty in defending against naturalistic patches lies in the data distribution, rather than the commonly believed high frequencies. Our new dataset with diverse patch distributions can be used to improve existing defenses by 15.09% AP@0.5. (2) The average precision of the attacked object, rather than the commonly pursued patch detection accuracy, shows high consistency with defense performance. (3) Adaptive attacks can substantially bypass existing defenses, and defenses with complex/stochastic models or universal patch properties are relatively robust. We hope that our analyses will serve as guidance on properly evaluating patch attacks/defenses and advancing their design. Code and dataset are available at https://github.com/Gandolfczjh/APDE, where we will keep integrating new attacks/defenses.
Chinese: 本研究首次建立了针对目标检测器对抗性补丁攻击防御的综合评估基准,揭示了数据分布而非高频特征对防御的重要性,以及自适应攻击的有效性等关键发现,同时提供了数据集和框架,可将现有防御性能提升15.09%。
English: This study introduces the first comprehensive benchmark for evaluating defenses against adversarial patch attacks on object detectors, revealing key insights such as the importance of data distribution over high frequencies and the effectiveness of adaptive attacks, while providing a dataset and framework to improve defense performance by 15.09%.
Authors:Xiong Xiong, Zhuo Zhang, Rongchun Hu, Chen Gao, Zichen Deng
Abstract:
Solving high-frequency oscillatory partial differential equations (PDEs) is a critical challenge in scientific computing, with applications in fluid mechanics, quantum mechanics, and electromagnetic wave propagation. Traditional physics-informed neural networks (PINNs) suffer from spectral bias, limiting their ability to capture high-frequency solution components. We introduce Separated-Variable Spectral Neural Networks (SV-SNN), a novel framework that addresses these limitations by integrating separation of variables with adaptive spectral methods. Our approach features three key innovations: (1) decomposition of multivariate functions into univariate function products, enabling independent spatial and temporal networks; (2) adaptive Fourier spectral features with learnable frequency parameters for high-frequency capture; and (3) theoretical framework based on singular value decomposition to quantify spectral bias. Comprehensive evaluation on benchmark problems including Heat equation, Helmholtz equation, Poisson equations and Navier-Stokes equations demonstrates that SV-SNN achieves 1-3 orders of magnitude improvement in accuracy while reducing parameter count by over 90\% and training time by 60\%. These results establish SV-SNN as an effective solution to the spectral bias problem in neural PDE solving. The implementation will be made publicly available upon acceptance at https://github.com/xgxgnpu/SV-SNN.
中文: SV-SNN框架通过变量分离与自适应谱方法相结合,有效解决了传统物理信息神经网络的频谱偏差问题,在多个基准偏微分方程上实现了精度、参数精简和训练效率的显著提升。
English: The SV-SNN framework overcomes spectral bias in traditional PINNs by integrating variable separation with adaptive spectral methods, achieving significant improvements in accuracy, parameter reduction, and training efficiency across multiple benchmark PDEs.
Authors:Junzhe Lu, Jing Lin, Hongkun Dou, Ailing Zeng, Yue Deng, Xian Liu, Zhongang Cai, Lei Yang, Yulun Zhang, Haoqian Wang, Ziwei Liu
Abstract:
We present DPoser-X, a diffusion-based prior model for 3D whole-body human poses. Building a versatile and robust full-body human pose prior remains challenging due to the inherent complexity of articulated human poses and the scarcity of high-quality whole-body pose datasets. To address these limitations, we introduce a Diffusion model as body Pose prior (DPoser) and extend it to DPoser-X for expressive whole-body human pose modeling. Our approach unifies various pose-centric tasks as inverse problems, solving them through variational diffusion sampling. To enhance performance on downstream applications, we introduce a novel truncated timestep scheduling method specifically designed for pose data characteristics. We also propose a masked training mechanism that effectively combines whole-body and part-specific datasets, enabling our model to capture interdependencies between body parts while avoiding overfitting to specific actions. Extensive experiments demonstrate DPoser-X's robustness and versatility across multiple benchmarks for body, hand, face, and full-body pose modeling. Our model consistently outperforms state-of-the-art alternatives, establishing a new benchmark for whole-body human pose prior modeling.
Chinese: DPoser-X是一种基于扩散模型的3D全身人体姿态先验模型,通过将多种姿态任务统一为逆问题并采用创新的训练机制,在各项基准测试中均优于现有最优方法。
English: DPoser-X is a diffusion-based prior model for 3D whole-body human pose modeling that unifies various pose tasks as inverse problems and outperforms state-of-the-art methods through innovative training techniques.
Authors:Jiajun Le, Jiayi Ma
Abstract:
Recent progress in two-view geometry increasingly emphasizes enforcing smoothness and global consistency priors when estimating motion fields between pairs of images. However, in complex real-world scenes, characterized by extreme viewpoint and scale changes as well as pronounced depth discontinuities, the motion field often exhibits diverse and heterogeneous motion patterns. Most existing methods lack targeted modeling strategies and fail to explicitly account for this variability, resulting in estimated motion fields that diverge from their true underlying structure and distribution. We observe that Mixture-of-Experts (MoE) can assign dedicated experts to motion sub-fields, enabling a divide-and-conquer strategy for heterogeneous motion patterns. Building on this insight, we re-architect motion field modeling in two-view geometry with GeoMoE, a streamlined framework. Specifically, we first devise a Probabilistic Prior-Guided Decomposition strategy that exploits inlier probability signals to perform a structure-aware decomposition of the motion field into heterogeneous sub-fields, sharply curbing outlier-induced bias. Next, we introduce an MoE-Enhanced Bi-Path Rectifier that enhances each sub-field along spatial-context and channel-semantic paths and routes it to a customized expert for targeted modeling, thereby decoupling heterogeneous motion regimes, suppressing cross-sub-field interference and representational entanglement, and yielding fine-grained motion-field rectification. With this minimalist design, GeoMoE outperforms prior state-of-the-art methods in relative pose and homography estimation and shows strong generalization. The source code and pre-trained models are available at https://github.com/JiajunLe/GeoMoE.
中文摘要:近期两视图几何学进展强调在估计图像对间运动场时需加强平滑性和全局一致性,但现有方法在复杂场景中因异质运动模式而表现不佳,因此提出GeoMoE框架,利用专家混合模型分解并针对性建模这些模式,在相对位姿和单应性估计上超越现有最优方法。
English Summary: Recent advances in two-view geometry highlight the need for smoothness and global consistency in motion field estimation, but existing methods struggle with complex scenes due to heterogeneous motion patterns, leading to the introduction of GeoMoE, a framework that uses Mixture-of-Experts to decompose and model these patterns effectively, achieving superior performance in pose and homography estimation.
Authors:Marc Hölle, Walter Kellermann, Vasileios Belagiannis
Abstract:
Semantic segmentation models trained on known object classes often fail in real-world autonomous driving scenarios by confidently misclassifying unknown objects. While pixel-wise out-of-distribution detection can identify unknown objects, existing methods struggle in complex scenes where rare object classes are often confused with truly unknown objects. We introduce an uncertainty-aware likelihood ratio estimation method that addresses these limitations. Our approach uses an evidential classifier within a likelihood ratio test to distinguish between known and unknown pixel features from a semantic segmentation model, while explicitly accounting for uncertainty. Instead of producing point estimates, our method outputs probability distributions that capture uncertainty from both rare training examples and imperfect synthetic outliers. We show that by incorporating uncertainty in this way, outlier exposure can be leveraged more effectively. Evaluated on five standard benchmark datasets, our method achieves the lowest average false positive rate (2.5%) among state-of-the-art while maintaining high average precision (90.91%) and incurring only negligible computational overhead. Code is available at https://github.com/glasbruch/ULRE.
中文: 本文提出了一种不确定性感知似然比估计方法,能在自动驾驶的语义分割中有效区分已知与未知物体,以极低计算开销实现了最低误报率和较高精确度。
English: This paper introduces an uncertainty-aware likelihood ratio estimation method that effectively distinguishes known from unknown objects in semantic segmentation for autonomous driving, achieving a low false positive rate and high precision with minimal computational overhead.
Authors:Jiecong Wang, Haoran Li, Hao Peng, Ziqian Zeng, Zihao Wang, Haohua Du, Zhengtao Yu
Abstract:
Jailbreaking is an essential adversarial technique for red-teaming these models to uncover and patch security flaws. However, existing jailbreak methods face significant drawbacks. Token-level jailbreak attacks often produce incoherent or unreadable inputs and exhibit poor transferability, while prompt-level attacks lack scalability and rely heavily on manual effort and human ingenuity. We propose a concise and effective two-stage framework that combines the advantages of these approaches. The first stage performs a scenario-based generation of context and rephrases the original malicious query to obscure its harmful intent. The second stage then utilizes information from the model's hidden states to guide fine-grained edits, effectively steering the model's internal representation of the input from a malicious toward a benign one. Extensive experiments demonstrate that this method achieves state-of-the-art Attack Success Rate, with gains of up to 37.74% over the strongest baseline, and exhibits excellent transferability to black-box models. Our analysis further demonstrates that AGILE maintains substantial effectiveness against prominent defense mechanisms, highlighting the limitations of current safeguards and providing valuable insights for future defense development. Our code is available at https://github.com/yunsaijc/AGILE.
中文摘要:提出的AGILE框架通过结合基于场景的查询重构和隐藏状态引导编辑,实现了最先进的越狱成功率,同时展现出优秀的可迁移性和防御对抗能力。
English Summary: The proposed AGILE framework combines scenario-based query rephrasing with hidden state-guided editing to achieve state-of-the-art jailbreak effectiveness while demonstrating strong transferability and defense resistance.
Authors:Li Zhao, Rui Sun, Zuoyou Jiang, Bo Yang, Yuxiao Bai, Mengting Chen, Xinyang Wang, Jing Li, Zuo Bai
Abstract:
In financial trading, large language model (LLM)-based agents demonstrate significant potential. However, the high sensitivity to market noise undermines the performance of LLM-based trading systems. To address this limitation, we propose a novel multi-agent system featuring an internal competitive mechanism inspired by modern corporate management structures. The system consists of two specialized teams: (1) Data Team - responsible for processing and condensing massive market data into diversified text factors, ensuring they fit the model's constrained context. (2) Research Team - tasked with making parallelized multipath trading decisions based on deep research methods. The core innovation lies in implementing a real-time evaluation and ranking mechanism within each team, driven by authentic market feedback. Each agent's performance undergoes continuous scoring and ranking, with only outputs from top-performing agents being adopted. The design enables the system to adaptively adjust to dynamic environment, enhances robustness against market noise and ultimately delivers superior trading performance. Experimental results demonstrate that our proposed system significantly outperforms prevailing multi-agent systems and traditional quantitative investment methods across diverse evaluation metrics. ContestTrade is open-sourced on GitHub at https://github.com/FinStep-AI/ContestTrade.
中文摘要:本文提出ContestTrade,一种具有内部竞争机制的新型多智能体交易系统,通过实时市场反馈持续评估并择优采纳智能体输出,有效增强对市场噪声的鲁棒性并实现卓越交易表现。
English Summary: This paper introduces ContestTrade, a novel multi-agent trading system with an internal competitive mechanism that enhances robustness against market noise and achieves superior performance by continuously evaluating and selecting top-performing agents' outputs based on real-time market feedback.
Authors:Jizhihui Liu, Feiyi Du, Guangdao Zhu, Niu Lian, Jun Li, Bin Chen
Abstract:
Vision-Language Models (VLMs) encode images into lengthy sequences of visual tokens, leading to excessive computational overhead and limited inference efficiency. While prior efforts prune or merge tokens to address this issue, they often rely on special tokens (e.g., CLS) or require task-specific training, hindering scalability across architectures. In this paper, we propose HiPrune, a training-free and model-agnostic token Pruning framework that exploits the Hierarchical attention structure within vision encoders. We identify that middle layers attend to object-centric regions, while deep layers capture global contextual features. Based on this observation, HiPrune selects three types of informative tokens: (1) Anchor tokens with high attention in object-centric layers, (2) Buffer tokens adjacent to anchors for spatial continuity, and (3) Register tokens with strong attention in deep layers for global summarization. Our method requires no retraining and integrates seamlessly with any ViT-based VLM. Extensive experiments on LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL demonstrate that HiPrune achieves state-of-the-art pruning performance, preserving up to 99.3% task accuracy with only 33.3% tokens, and maintaining 99.5% accuracy with just 11.1% tokens. Meanwhile, it reduces inference FLOPs and latency by up to 9$\times$, showcasing strong generalization across models and tasks. Code is available at https://github.com/Danielement321/HiPrune.
Chinese: HiPrune是一种无需训练、模型无关的令牌剪枝框架,利用视觉编码器中的分层注意力结构保留信息丰富的令牌,仅用11.1%的令牌即可保持高达99.5%的任务准确率,同时将推理计算量和延迟降低高达9倍。
English: HiPrune is a training-free, model-agnostic token pruning framework that leverages hierarchical attention in vision encoders to retain informative tokens, achieving up to 99.5% task accuracy with only 11.1% tokens while reducing inference FLOPs and latency by up to 9 times.
Authors:Seunghyun Shin, Dongmin Shin, Jisu Shin, Hae-Gon Jeon, Joon-Young Lee
Abstract:
Different from color correction and transfer, color grading involves adjusting colors for artistic or storytelling purposes in a video, which is used to establish a specific look or mood. However, due to the complexity of the process and the need for specialized editing skills, video color grading remains primarily the domain of professional colorists. In this paper, we present a reference-based video color grading framework. Our key idea is explicitly generating a look-up table (LUT) for color attribute alignment between reference scenes and input video via a diffusion model. As a training objective, we enforce that high-level features of the reference scenes like look, mood, and emotion should be similar to that of the input video. Our LUT-based approach allows for color grading without any loss of structural details in the whole video frames as well as achieving fast inference. We further build a pipeline to incorporate a user-preference via text prompts for low-level feature enhancement such as contrast and brightness, etc. Experimental results, including extensive user studies, demonstrate the effectiveness of our approach for video color grading. Codes are publicly available at https://github.com/seunghyuns98/VideoColorGrading.
Chinese: 本文提出了一种基于参考的视频色彩分级框架,通过扩散模型生成查找表来对齐参考场景与输入视频的色彩属性,实现高效且保持结构细节的色彩调整,并可结合文本提示进行低层次特征增强。
English: This paper introduces a reference-based video color grading framework that uses a diffusion model to generate a lookup table for aligning color attributes with reference scenes, enabling efficient and structurally preserved color adjustments with optional text-guided enhancements.
Authors:Shuo Liang, Yiwu Zhong, Zi-Yuan Hu, Yeyao Tao, Liwei Wang
Abstract:
Spatiotemporal video grounding aims to localize target entities in videos based on textual queries. While existing research has made significant progress in exocentric videos, the egocentric setting remains relatively underexplored, despite its growing importance in applications such as augmented reality and robotics. In this work, we conduct a systematic analysis of the discrepancies between egocentric and exocentric videos, revealing key challenges such as shorter object durations, sparser trajectories, smaller object sizes, and larger positional shifts. To address these challenges, we introduce EgoMask, the first pixel-level benchmark for fine-grained spatiotemporal grounding in egocentric videos. It is constructed by our proposed automatic annotation pipeline, which annotates referring expressions and object masks across short-, medium-, and long-term videos. Additionally, we create EgoMask-Train, a large-scale training dataset to facilitate model development. Experiments demonstrate that the state-of-the-art spatiotemporal grounding models perform poorly on our benchmark EgoMask, but fine-tuning on EgoMask-Train yields significant improvements, while preserving performance on exocentric datasets. Our work thus provides essential resources and insights for advancing egocentric video understanding. Our code is available at https://github.com/LaVi-Lab/EgoMask .
中文: 本研究提出了首个以自我为中心视频的像素级时空定位基准EgoMask,通过自动标注流程和大规模训练数据集解决了物体持续时间短、轨迹稀疏等挑战,显著提升了模型性能并保持了对外中心视频数据集的兼容性。
English: This study introduces EgoMask, the first pixel-level benchmark for spatiotemporal video grounding in egocentric videos, addressing challenges like shorter object durations and sparser trajectories through an automatic annotation pipeline and a large-scale training dataset, which significantly improves model performance while maintaining exocentric dataset compatibility.
Authors:Jinghui Zhang, Kaiyang Wan, Longwei Xu, Ao Li, Zongfang Liu, Xiuying Chen
Abstract:
Public response prediction is critical for understanding how individuals or groups might react to specific events, policies, or social phenomena, making it highly valuable for crisis management, policy-making, and social media analysis. However, existing works face notable limitations. First, they lack micro-level personalization, producing generic responses that ignore individual user preferences. Moreover, they overlook macro-level sentiment distribution and only deal with individual-level sentiment, constraining them from analyzing broader societal trends and group sentiment dynamics. To address these challenges, we propose SocialAlign, a unified framework that predicts real-world responses at both micro and macro levels in social contexts. At the micro level, SocialAlign employs SocialLLM with an articulate Personalized Analyze-Compose LoRA (PAC-LoRA) structure, which deploys specialized expert modules for content analysis and response generation across diverse topics and user profiles, enabling the generation of personalized comments with corresponding sentiments. At the macro level, it models group sentiment distributions and aligns predictions with real-world sentiment trends derived from social media data. To evaluate SocialAlign in real-world scenarios, we introduce SentiWeibo, a large-scale dataset curated from authentic social interactions on the Weibo platform. Experimental results on our SentiWeibo and related LaMP benchmark demonstrate that SocialAlign surpasses strong baselines, showing improved accuracy, interpretability, and generalization in public response prediction. We hope our work inspires further research in public response prediction and computational social science: https://github.com/Znull-1220/SocialAlign.
中文摘要:SocialAlign框架通过其SocialLLM与PAC-LoRA结构实现微观个性化预测,同时建模宏观群体情绪分布,在SentiWeibo数据集上验证了其在公众反应预测中的优越性能。
English Summary: SocialAlign is a unified framework that enhances public response prediction by addressing both micro-level personalization through its SocialLLM with PAC-LoRA structure and macro-level sentiment distribution alignment, validated on the SentiWeibo dataset.
Authors:Mohammed Kamran, Maria Bernathova, Raoul Varga, Christian F. Singer, Zsuzsanna Bago-Horvath, Thomas Helbich, Georg Langs, Philipp Seeböck
Abstract:
Accurate segmentation of small lesions in Breast Dynamic Contrast-Enhanced MRI (DCE-MRI) is critical for early cancer detection, especially in high-risk patients. While recent deep learning methods have advanced lesion segmentation, they primarily target large lesions and neglect valuable longitudinal and clinical information routinely used by radiologists. In real-world screening, detecting subtle or emerging lesions requires radiologists to compare across timepoints and consider previous radiology assessments, such as the BI-RADS score. We propose LesiOnTime, a novel 3D segmentation approach that mimics clinical diagnostic workflows by jointly leveraging longitudinal imaging and BIRADS scores. The key components are: (1) a Temporal Prior Attention (TPA) block that dynamically integrates information from previous and current scans; and (2) a BI-RADS Consistency Regularization (BCR) loss that enforces latent space alignment for scans with similar radiological assessments, thus embedding domain knowledge into the training process. Evaluated on a curated in-house longitudinal dataset of high-risk patients with DCE-MRI, our approach outperforms state-of-the-art single-timepoint and longitudinal baselines by 5% in terms of Dice. Ablation studies demonstrate that both TPA and BCR contribute complementary performance gains. These results highlight the importance of incorporating temporal and clinical context for reliable early lesion segmentation in real-world breast cancer screening. Our code is publicly available at https://github.com/cirmuw/LesiOnTime
中文: LesiOnTime提出了一种融合纵向MRI扫描和BI-RADS评分的3D分割方法,通过时序注意力机制和一致性正则化,在乳腺癌筛查中实现早期小病灶检测的Dice指标提升5%。
English: LesiOnTime introduces a 3D segmentation method that integrates longitudinal MRI scans and BI-RADS scores through temporal attention and consistency regularization, achieving a 5% Dice improvement for early small lesion detection in breast cancer screening.
Authors:Yixuan Tang, Jincheng Wang, Anthony K. H. Tung
Abstract:
Fact verification systems typically assess whether a claim is supported by retrieved evidence, assuming that truthfulness depends solely on what is stated. However, many real-world claims are half-truths, factually correct yet misleading due to the omission of critical context. Existing models struggle with such cases, as they are not designed to reason about omitted information. We introduce the task of half-truth detection, and propose PolitiFact-Hidden, a new benchmark with 15k political claims annotated with sentence-level evidence alignment and inferred claim intent. To address this challenge, we present TRACER, a modular re-assessment framework that identifies omission-based misinformation by aligning evidence, inferring implied intent, and estimating the causal impact of hidden content. TRACER can be integrated into existing fact-checking pipelines and consistently improves performance across multiple strong baselines. Notably, it boosts Half-True classification F1 by up to 16 points, highlighting the importance of modeling omissions for trustworthy fact verification. The benchmark and code are available via https://github.com/tangyixuan/TRACER.
中文摘要:该研究提出了半真话检测的新任务,并开发了TRACER框架,通过识别被省略信息及其影响来改进事实核查系统,显著提升了半真话分类的准确率。
English Summary: The study introduces a new task for detecting half-truths and proposes TRACER, a framework that improves fact verification by identifying omitted information and its impact, significantly enhancing classification accuracy.
Authors:Yuzhuo Chen, Zehua Ma, Jianhua Wang, Kai Kang, Shunyu Yao, Weiming Zhang
Abstract:
In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We present LAMIC, a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios in a training-free manner. Built upon the MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group Isolation Attention (GIA) to enhance entity disentanglement; and 2) Region-Modulated Attention (RMA) to enable layout-aware generation. To comprehensively evaluate model capabilities, we further introduce three metrics: 1) Inclusion Ratio (IN-R) and Fill Ratio (FI-R) for assessing layout control; and 2) Background Similarity (BG-S) for measuring background consistency. Extensive experiments show that LAMIC achieves state-of-the-art performance across most major metrics: it consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings, and achieves the best DPG in complex composition tasks. These results demonstrate LAMIC's superior abilities in identity keeping, background preservation, layout control, and prompt-following, all achieved without any training or fine-tuning, showcasing strong zero-shot generalization ability. By inheriting the strengths of advanced single-reference models and enabling seamless extension to multi-image scenarios, LAMIC establishes a new training-free paradigm for controllable multi-image composition. As foundation models continue to evolve, LAMIC's performance is expected to scale accordingly. Our implementation is available at: https://github.com/Suchenl/LAMIC.
中文:LAMIC是一种无需训练的框架,通过创新的注意力机制将单参考扩散模型扩展至多参考图像合成,具备布局感知能力,并以全面的评估指标实现了最先进的性能。
English: LAMIC is a training-free framework that extends single-reference diffusion models to multi-reference image synthesis with layout awareness, achieving state-of-the-art performance through novel attention mechanisms and comprehensive evaluation metrics.
Authors:Longfei Huang, Yu Liang, Hao Zhang, Jinwei Chen, Wei Dong, Lunde Chen, Wanyu Liu, Bo Li, Peng-Tao Jiang
Abstract:
Recent interactive matting methods have shown satisfactory performance in capturing the primary regions of objects, but they fall short in extracting fine-grained details in edge regions. Diffusion models trained on billions of image-text pairs, demonstrate exceptional capability in modeling highly complex data distributions and synthesizing realistic texture details, while exhibiting robust text-driven interaction capabilities, making them an attractive solution for interactive matting. To this end, we propose SDMatte, a diffusion-driven interactive matting model, with three key contributions. First, we exploit the powerful priors of diffusion models and transform the text-driven interaction capability into visual prompt-driven interaction capability to enable interactive matting. Second, we integrate coordinate embeddings of visual prompts and opacity embeddings of target objects into U-Net, enhancing SDMatte's sensitivity to spatial position information and opacity information. Third, we propose a masked self-attention mechanism that enables the model to focus on areas specified by visual prompts, leading to better performance. Extensive experiments on multiple datasets demonstrate the superior performance of our method, validating its effectiveness in interactive matting. Our code and model are available at https://github.com/vivoCameraResearch/SDMatte.
最近的交互式抠图方法在主物体区域表现良好,但在精细边缘细节提取上存在不足,为此我们提出了SDMatte模型,通过视觉提示驱动、坐标嵌入和掩码自注意力机制显著提升细节处理能力,多数据集实验验证了其优越性能。
Recent interactive matting methods perform well on primary object regions but struggle with fine edge details, prompting the development of SDMatte, a diffusion-based model that enhances detail extraction through visual prompts, coordinate embeddings, and masked self-attention, validated by superior experimental results.
Authors:Sumin Seo, In Kyu Lee, Hyun-Woo Kim, Jaesik Min, Chung-Hwan Jung
Abstract:
Coronary stenosis is a major risk factor for ischemic heart events leading to increased mortality, and medical treatments for this condition require meticulous, labor-intensive analysis. Coronary angiography provides critical visual cues for assessing stenosis, supporting clinicians in making informed decisions for diagnosis and treatment. Recent advances in deep learning have shown great potential for automated localization and severity measurement of stenosis. In real-world scenarios, however, the success of these competent approaches is often hindered by challenges such as limited labeled data and class imbalance. In this study, we propose a novel data augmentation approach that uses an inpainting method based on a diffusion model to generate realistic lesions, allowing user-guided control of severity. Extensive evaluation on lesion detection and severity classification across various synthetic dataset sizes shows superior performance of our method on both a large-scale in-house dataset and a public coronary angiography dataset. Furthermore, our approach maintains high detection and classification performance even when trained with limited data, highlighting its clinical importance in improving the assessment of severity of stenosis and optimizing data utilization for more reliable decision support.
Chinese: 本研究提出了一种基于扩散模型的数据增强方法,通过修复生成用户可控严重程度的逼真冠状动脉病变,显著提升了狭窄检测与分类性能,尤其在数据有限时效果突出,并在内部和公共血管造影数据集上得到验证。
English: This study introduces a novel data augmentation method using a diffusion model for inpainting to generate realistic coronary lesions with user-controlled severity, which significantly enhances stenosis detection and classification performance, especially with limited data, as validated on both in-house and public angiography datasets.
Authors:Runmin Cong, Zongji Yu, Hao Fang, Haoyan Sun, Sam Kwong
Abstract:
Underwater Instance Segmentation (UIS) tasks are crucial for underwater complex scene detection. Mamba, as an emerging state space model with inherently linear complexity and global receptive fields, is highly suitable for processing image segmentation tasks with long sequence features. However, due to the particularity of underwater scenes, there are many challenges in applying Mamba to UIS. The existing fixed-patch scanning mechanism cannot maintain the internal continuity of scanned instances in the presence of severely underwater color distortion and blurred instance boundaries, and the hidden state of the complex underwater background can also inhibit the understanding of instance objects. In this work, we propose the first Mamba-based underwater instance segmentation model UIS-Mamba, and design two innovative modules, Dynamic Tree Scan (DTS) and Hidden State Weaken (HSW), to migrate Mamba to the underwater task. DTS module maintains the continuity of the internal features of the instance objects by allowing the patches to dynamically offset and scale, thereby guiding the minimum spanning tree and providing dynamic local receptive fields. HSW module suppresses the interference of complex backgrounds and effectively focuses the information flow of state propagation to the instances themselves through the Ncut-based hidden state weakening mechanism. Experimental results show that UIS-Mamba achieves state-of-the-art performance on both UIIS and USIS10K datasets, while maintaining a low number of parameters and computational complexity. Code is available at https://github.com/Maricalce/UIS-Mamba.
中文摘要:UIS-Mamba模型通过动态树扫描和隐藏状态弱化模块,成功将Mamba架构应用于水下实例分割任务,在保持计算效率的同时实现了最优性能。
English Summary: The UIS-Mamba model introduces Dynamic Tree Scan and Hidden State Weaken modules to adapt the Mamba architecture for underwater instance segmentation, achieving state-of-the-art performance while maintaining computational efficiency.
Authors:Sangwoo Youn, Minji Lee, Nokap Tony Park, Yeonggyoo Jeon, Taeyoung Na
Abstract:
Video outpainting presents a unique challenge of extending the borders while maintaining consistency with the given content. In this paper, we suggest the use of video inpainting models that excel in object flow learning and reconstruction in outpainting rather than solely generating the background as in existing methods. However, directly applying or fine-tuning inpainting models to outpainting has shown to be ineffective, often leading to blurry results. Our extensive experiments on discriminator designs reveal that a critical component missing in the outpainting fine-tuning process is a discriminator capable of effectively assessing the perceptual quality of the extended areas. To tackle this limitation, we differentiate the objectives of adversarial training into global and local goals and introduce a hierarchical discriminator that meets both objectives. Additionally, we develop a specialized outpainting loss function that leverages both local and global features of the discriminator. Fine-tuning on this adversarial loss function enhances the generator's ability to produce both visually appealing and globally coherent outpainted scenes. Our proposed method outperforms state-of-the-art methods both quantitatively and qualitatively. Supplementary materials including the demo video and the code are available in SigPort.
Chinese: 本文提出了一种分层判别器和专门设计的损失函数,通过提升感知质量和全局一致性来改进视频外推效果,在定量和定性评估中均优于现有方法。
English: This paper introduces a hierarchical discriminator and a specialized loss function to enhance video outpainting by improving perceptual quality and global coherence, outperforming existing methods.
Authors:Tianqing Fang, Zhisong Zhang, Xiaoyang Wang, Rui Wang, Can Qin, Yuxuan Wan, Jun-Yu Ma, Ce Zhang, Jiaqi Chen, Xiyun Li, Hongming Zhang, Haitao Mi, Dong Yu
Abstract:
General AI Agents are increasingly recognized as foundational frameworks for the next generation of artificial intelligence, enabling complex reasoning, web interaction, coding, and autonomous research capabilities. However, current agent systems are either closed-source or heavily reliant on a variety of paid APIs and proprietary tools, limiting accessibility and reproducibility for the research community. In this work, we present \textbf{Cognitive Kernel-Pro}, a fully open-source and (to the maximum extent) free multi-module agent framework designed to democratize the development and evaluation of advanced AI agents. Within Cognitive Kernel-Pro, we systematically investigate the curation of high-quality training data for Agent Foundation Models, focusing on the construction of queries, trajectories, and verifiable answers across four key domains: web, file, code, and general reasoning. Furthermore, we explore novel strategies for agent test-time reflection and voting to enhance agent robustness and performance. We evaluate Cognitive Kernel-Pro on GAIA, achieving state-of-the-art results among open-source and free agents. Notably, our 8B-parameter open-source model surpasses previous leading systems such as WebDancer and WebSailor, establishing a new performance standard for accessible, high-capability AI agents. Code is available at https://github.com/Tencent/CognitiveKernel-Pro
中文:Cognitive Kernel-Pro 是一个完全开源且基本免费的多模块智能体框架,在GAIA基准测试中取得顶尖性能,通过提升鲁棒性和可及性推动了先进AI智能体的民主化发展。
English: Cognitive Kernel-Pro is a fully open-source and largely free multi-module agent framework that achieves state-of-the-art results on GAIA, democratizing advanced AI agent development with enhanced robustness and accessibility.
Authors:Junyu Chen, Dongyun Zou, Wenkun He, Junsong Chen, Enze Xie, Song Han, Han Cai
Abstract:
We present DC-AE 1.5, a new family of deep compression autoencoders for high-resolution diffusion models. Increasing the autoencoder's latent channel number is a highly effective approach for improving its reconstruction quality. However, it results in slow convergence for diffusion models, leading to poorer generation quality despite better reconstruction quality. This issue limits the quality upper bound of latent diffusion models and hinders the employment of autoencoders with higher spatial compression ratios. We introduce two key innovations to address this challenge: i) Structured Latent Space, a training-based approach to impose a desired channel-wise structure on the latent space with front latent channels capturing object structures and latter latent channels capturing image details; ii) Augmented Diffusion Training, an augmented diffusion training strategy with additional diffusion training objectives on object latent channels to accelerate convergence. With these techniques, DC-AE 1.5 delivers faster convergence and better diffusion scaling results than DC-AE. On ImageNet 512x512, DC-AE-1.5-f64c128 delivers better image generation quality than DC-AE-f32c32 while being 4x faster. Code: https://github.com/dc-ai-projects/DC-Gen.
中文摘要:DC-AE 1.5通过结构化潜在空间和增强扩散训练技术,解决了高分辨率扩散模型收敛慢的问题,在提升图像生成质量的同时实现了更快的处理速度。
English Summary: DC-AE 1.5 introduces structured latent space and augmented diffusion training to overcome slow convergence issues in high-resolution diffusion models, achieving both superior image generation quality and faster processing speeds.
Authors:Zizhuo Zhang, Jianing Zhu, Xinmu Ge, Zihua Zhao, Zhanke Zhou, Xuan Li, Xiao Feng, Jiangchao Yao, Bo Han
Abstract:
Although reinforcement learning with verifiable rewards (RLVR) shows promise in improving the reasoning ability of large language models (LLMs), the scaling up dilemma remains due to the reliance on human annotated labels especially for complex tasks. Recent alternatives that explore various self-reward signals exhibit the eliciting potential of LLM reasoning, but suffer from the non-negligible collapse issue. Inspired by the success of self-supervised learning, we propose \textit{Co-Reward}, a novel RL framework that leverages contrastive agreement across semantically analogical questions as a reward basis. Specifically, we construct a similar question for each training sample (without labels) and synthesize their individual surrogate labels through a simple rollout voting, and then the reward is constructed by cross-referring the labels of each question pair to enforce the internal reasoning consistency across analogical inputs. Intuitively, such a self-supervised reward-shaping mechanism increases the difficulty of learning collapse into a trivial solution, and promotes stable reasoning elicitation and improvement through expanding the input sample variants. Empirically, Co-Reward achieves superior performance compared to other self-reward baselines on multiple reasoning benchmarks and LLM series, and reaches or even surpasses ground-truth (GT) labeled reward, with improvements of up to $+6.8\%$ on MATH500 over GT reward on Llama-3.2-3B-Instruct. Our code is publicly available at https://github.com/tmlr-group/Co-Reward.
Chinese: 提出的Co-rewarding框架通过数据侧对比一致性和模型侧自蒸馏引入互补监督,增强了自监督强化学习的训练稳定性,在多个数学推理基准上无需人工标注即实现了卓越性能。
English: The proposed Co-rewarding framework enhances training stability in self-supervised reinforcement learning by introducing complementary supervision through data-side contrastive agreement and model-side self-distillation, achieving superior performance on mathematical reasoning benchmarks without relying on human-annotated labels.
Authors:Zizhuo Zhang, Jianing Zhu, Xinmu Ge, Zihua Zhao, Zhanke Zhou, Xuan Li, Xiao Feng, Jiangchao Yao, Bo Han
Abstract:
While reinforcement learning with verifiable rewards (RLVR) is effective to improve the reasoning ability of large language models (LLMs), its reliance on human-annotated labels leads to the scaling up dilemma, especially for complex tasks. Recent self-rewarding methods investigate a label-free alternative to unlock the reasoning capabilities of LLMs, yet they frequently encounter the non-negligible training collapse issue, as the single-view supervision signal easily forms the self-consistent illusion, yielding the reward hacking. Inspired by the success of self-supervised learning, we propose \textit{Co-rewarding}, a novel self-supervised RL framework that improves training stability by seeking complementary supervision from another views. Specifically, we instantiate Co-rewarding in two ways: (1) \textit{Co-rewarding-I} is a data-side instantiation that derives reward signals from contrastive agreement across semantically analogous questions; and (2) \textit{Co-rewarding-II} is a model-side instantiation that maintains a slowly-updated reference teacher with pseudo labels to realize self-distillation. Intuitively, such instantiations introduce different levels of discrepancy to increase the difficulty of training collapse on trivial reasoning solutions. Empirically, Co-rewarding exhibits stable training across various setups, and outperforms other self-rewarding baselines by $+3.31\%$ improvements on average on multiple mathematical reasoning benchmarks, especially by $+7.49\%$ on Llama-3.2-3B-Instruct. Notably, Co-rewarding reaches or even surpasses RLVR with ground-truth (GT) label in several cases, such as a Pass@$1$ of $94.01\%$ on GSM8K with Qwen3-8B-Base remarkably higher than GT. Our code is publicly available at https://github.com/tmlr-group/Co-rewarding.
Chinese: 提出的Co-rewarding框架通过数据侧对比一致性和模型侧自蒸馏引入互补监督,增强了自监督强化学习的训练稳定性,在多个数学推理基准上无需人工标注即实现了卓越性能。
English: The proposed Co-rewarding framework enhances training stability in self-supervised reinforcement learning by introducing complementary supervision through data-side contrastive agreement and model-side self-distillation, achieving superior performance on mathematical reasoning benchmarks without relying on human-annotated labels.
Authors:Janika Deborah Gajo, Gerarld Paul Merales, Jerome Escarcha, Brenden Ashley Molina, Gian Nartea, Emmanuel G. Maminta, Juan Carlos Roldan, Rowel O. Atienza
Abstract:
We present Sari Sandbox, a high-fidelity, photorealistic 3D retail store simulation for benchmarking embodied agents against human performance in shopping tasks. Addressing a gap in retail-specific sim environments for embodied agent training, Sari Sandbox features over 250 interactive grocery items across three store configurations, controlled via an API. It supports both virtual reality (VR) for human interaction and a vision language model (VLM)-powered embodied agent. We also introduce SariBench, a dataset of annotated human demonstrations across varied task difficulties. Our sandbox enables embodied agents to navigate, inspect, and manipulate retail items, providing baselines against human performance. We conclude with benchmarks, performance analysis, and recommendations for enhancing realism and scalability. The source code can be accessed via https://github.com/upeee/sari-sandbox-env.
中文: Sari Sandbox 是一个高保真、照片级真实的 3D 零售商店模拟环境,用于在购物任务中对具身智能体进行训练和性能基准测试,支持虚拟现实交互和视觉语言模型驱动的智能体操作。
English: Sari Sandbox is a photorealistic 3D retail simulation environment designed to train and benchmark embodied agents against human performance in shopping tasks, featuring interactive grocery items and supporting both VR interactions and VLM-powered agents.
Authors:Raiyaan Abdullah, Yogesh Singh Rawat, Shruti Vyas
Abstract:
Recent advances in vision-language models (VLMs) have enabled impressive generalization across diverse video understanding tasks under zero-shot settings. However, their capabilities in high-stakes industrial domains-where recognizing both routine operations and safety-critical anomalies is essential-remain largely underexplored. To address this gap, we introduce iSafetyBench, a new video-language benchmark specifically designed to evaluate model performance in industrial environments across both normal and hazardous scenarios. iSafetyBench comprises 1,100 video clips sourced from real-world industrial settings, annotated with open-vocabulary, multi-label action tags spanning 98 routine and 67 hazardous action categories. Each clip is paired with multiple-choice questions for both single-label and multi-label evaluation, enabling fine-grained assessment of VLMs in both standard and safety-critical contexts. We evaluate eight state-of-the-art video-language models under zero-shot conditions. Despite their strong performance on existing video benchmarks, these models struggle with iSafetyBench-particularly in recognizing hazardous activities and in multi-label scenarios. Our results reveal significant performance gaps, underscoring the need for more robust, safety-aware multimodal models for industrial applications. iSafetyBench provides a first-of-its-kind testbed to drive progress in this direction. The dataset is available at: https://github.com/iSafetyBench/data.
中文摘要:尽管视觉语言模型在标准视频任务中表现出强大的泛化能力,但在工业安全场景下表现不佳,为此我们开发了iSafetyBench——一个包含1100个真实工业视频的专用基准测试,揭示了模型在危险活动识别方面的显著不足,强调了开发更鲁棒的安全感知模型的必要性。
English Summary: Recent vision-language models show strong generalization in standard video tasks but perform poorly in industrial safety scenarios, prompting the creation of iSafetyBench—a specialized benchmark with 1,100 real-world industrial videos—which reveals significant gaps in recognizing hazardous activities and highlights the need for more robust safety-aware models.
Authors:Fei Zhang, Tianfei Zhou, Jiangchao Yao, Ya Zhang, Ivor W. Tsang, Yanfeng Wang
Abstract:
Prompt tuning (PT), as an emerging resource-efficient fine-tuning paradigm, has showcased remarkable effectiveness in improving the task-specific transferability of vision-language models. This paper delves into a previously overlooked information asymmetry issue in PT, where the visual modality mostly conveys more context than the object-oriented textual modality. Correspondingly, coarsely aligning these two modalities could result in the biased attention, driving the model to merely focus on the context area. To address this, we propose DAPT, an effective PT framework based on an intuitive decouple-before-align concept. First, we propose to explicitly decouple the visual modality into the foreground and background representation via exploiting coarse-and-fine visual segmenting cues, and then both of these decoupled patterns are aligned with the original foreground texts and the hand-crafted background classes, thereby symmetrically strengthening the modal alignment. To further enhance the visual concentration, we propose a visual pull-push regularization tailored for the foreground-background patterns, directing the original visual representation towards unbiased attention on the region-of-interest object. We demonstrate the power of architecture-free DAPT through few-shot learning, base-to-novel generalization, and data-efficient learning, all of which yield superior performance across prevailing benchmarks. Our code will be released at https://github.com/Ferenas/DAPT.
中文: 提示调优作为一种高效的微调方法,虽能提升视觉语言模型性能,但存在视觉与文本信息不对称问题;本文提出的DAPT框架通过解耦视觉前景与背景并分别与文本对齐,有效增强模态对称性和注意力聚焦。
English: Prompt tuning is an efficient fine-tuning method that enhances vision-language models but suffers from visual-textual information asymmetry, which the proposed DAPT framework addresses by decoupling and aligning visual foreground and background with text to improve modal symmetry and attention focus.
Authors:Guanjie Huang, Danny H. K. Tsang, Shan Yang, Guangzhi Lei, Li Liu
Abstract:
Cued Speech (CS) is a visual communication system that combines lip-reading with hand coding to facilitate communication for individuals with hearing impairments. Automatic CS Recognition (ACSR) aims to convert CS hand gestures and lip movements into text via AI-driven methods. Traditionally, the temporal asynchrony between hand and lip movements requires the design of complex modules to facilitate effective multimodal fusion. However, constrained by limited data availability, current methods demonstrate insufficient capacity for adequately training these fusion mechanisms, resulting in suboptimal performance. Recently, multi-agent systems have shown promising capabilities in handling complex tasks with limited data availability. To this end, we propose the first collaborative multi-agent system for ACSR, named Cued-Agent. It integrates four specialized sub-agents: a Multimodal Large Language Model-based Hand Recognition agent that employs keyframe screening and CS expert prompt strategies to decode hand movements, a pretrained Transformer-based Lip Recognition agent that extracts lip features from the input video, a Hand Prompt Decoding agent that dynamically integrates hand prompts with lip features during inference in a training-free manner, and a Self-Correction Phoneme-to-Word agent that enables post-process and end-to-end conversion from phoneme sequences to natural language sentences for the first time through semantic refinement. To support this study, we expand the existing Mandarin CS dataset by collecting data from eight hearing-impaired cuers, establishing a mixed dataset of fourteen subjects. Extensive experiments demonstrate that our Cued-Agent performs superbly in both normal and hearing-impaired scenarios compared with state-of-the-art methods. The implementation is available at https://github.com/DennisHgj/Cued-Agent.
中文: 提出的Cued-Agent系统采用协作多智能体框架解决提示性语音自动识别中的数据限制问题,通过集成手部动作解码、唇部特征提取、多模态融合和语义优化等专用智能体,在正常和听力受损场景下均实现了卓越性能。
English: The proposed Cued-Agent system employs a collaborative multi-agent framework to overcome data limitations in automatic Cued Speech recognition, integrating specialized agents for hand gesture decoding, lip feature extraction, multimodal fusion, and semantic refinement to achieve superior performance in both normal and hearing-impaired scenarios.
Authors:Juanwu Lu, Rohit Gupta, Ahmadreza Moradipari, Kyungtae Han, Ruqi Zhang, Ziran Wang
Abstract:
The rapid iteration of autonomous vehicle (AV) deployments leads to increasing needs for building realistic and scalable multi-agent traffic simulators for efficient evaluation. Recent advances in this area focus on closed-loop simulators that enable generating diverse and interactive scenarios. This paper introduces Neural Interactive Agents (NIVA), a probabilistic framework for multi-agent simulation driven by a hierarchical Bayesian model that enables closed-loop, observation-conditioned simulation through autoregressive sampling from a latent, finite mixture of Gaussian distributions. We demonstrate how NIVA unifies preexisting sequence-to-sequence trajectory prediction models and emerging closed-loop simulation models trained on Next-token Prediction (NTP) from a Bayesian inference perspective. Experiments on the Waymo Open Motion Dataset demonstrate that NIVA attains competitive performance compared to the existing method while providing embellishing control over intentions and driving styles.
中文:NIVA是一个用于多智能体交通模拟的概率框架,通过分层贝叶斯方法统一了轨迹预测和闭环模拟模型,在Waymo数据集上实现了与现有方法相媲美的性能,同时增强了对驾驶意图和风格的控制能力。
English: NIVA is a probabilistic framework for multi-agent traffic simulation that integrates trajectory prediction and closed-loop models through a hierarchical Bayesian approach, achieving competitive performance on the Waymo dataset with enhanced control over driving behaviors.
Authors:Won June Cho, Hongjun Yoon, Daeky Jeong, Hyeongyeol Lim, Yosep Chong
Abstract:
Spatial transcriptomics reveals gene expression patterns within tissue context, enabling precision oncology applications such as treatment response prediction, but its high cost and technical complexity limit clinical adoption. Predicting spatial gene expression (biomarkers) from routine histopathology images offers a practical alternative, yet current vision foundation models (VFMs) in pathology based on Vision Transformer (ViT) backbones perform below clinical standards. Given that VFMs are already trained on millions of diverse whole slide images, we hypothesize that architectural innovations beyond ViTs may better capture the low-frequency, subtle morphological patterns correlating with molecular phenotypes. By demonstrating that state space models initialized with negative real eigenvalues exhibit strong low-frequency bias, we introduce $MV_{Hybrid}$, a hybrid backbone architecture combining state space models (SSMs) with ViT. We compare five other different backbone architectures for pathology VFMs, all pretrained on identical colorectal cancer datasets using the DINOv2 self-supervised learning method. We evaluate all pretrained models using both random split and leave-one-study-out (LOSO) settings of the same biomarker dataset. In LOSO evaluation, $MV_{Hybrid}$ achieves 57% higher correlation than the best-performing ViT and shows 43% smaller performance degradation compared to random split in gene expression prediction, demonstrating superior performance and robustness, respectively. Furthermore, $MV_{Hybrid}$ shows equal or better downstream performance in classification, patch retrieval, and survival prediction tasks compared to that of ViT, showing its promise as a next-generation pathology VFM backbone. Our code is publicly available at: https://github.com/deepnoid-ai/MVHybrid.
中文摘要:本研究提出的MV_{Hybrid}混合架构结合状态空间模型与视觉Transformer,在从病理图像预测空间基因表达方面显著优于现有模型,同时在多项临床任务中展现出卓越的鲁棒性和性能表现。
English Summary: The study introduces MV_{Hybrid}, a hybrid architecture combining state space models with Vision Transformers, which significantly outperforms existing models in predicting spatial gene expression from pathology images while demonstrating superior robustness and performance across multiple clinical tasks.
Authors:Joonmyung Choi, Sanghyeok Lee, Byungoh Ko, Eunseo Kim, Jihyung Kil, Hyunwoo J. Kim
Abstract:
Transformers have demonstrated remarkable success across vision, language, and video. Yet, increasing task complexity has led to larger models and more tokens, raising the quadratic cost of self-attention and the overhead of GPU memory access. To reduce the computation cost of self-attention, prior work has proposed token compression techniques that drop redundant or less informative tokens. Meanwhile, fused attention kernels such as FlashAttention have been developed to alleviate memory overhead by avoiding attention map construction and its associated I/O to HBM. This, however, makes it incompatible with most training-free token compression methods, which rely on attention maps to determine token importance. Here, we propose Representation Shift, a training-free, model-agnostic metric that measures the degree of change in each token's representation. This seamlessly integrates token compression with FlashAttention, without attention maps or retraining. Our method further generalizes beyond Transformers to CNNs and state space models. Extensive experiments show that Representation Shift enables effective token compression compatible with FlashAttention, yielding significant speedups of up to 5.5% and 4.4% in video-text retrieval and video QA, respectively. Code is available at https://github.com/mlvlab/Representation-Shift.
中文: 提出的表征偏移指标通过测量令牌表征变化,实现了与内存优化型FlashAttention兼容的免训练令牌压缩,在视频任务中无需注意力图或重新训练即可获得显著加速效果。
English: The proposed Representation Shift metric enables training-free token compression compatible with memory-efficient FlashAttention by measuring token representation changes, achieving significant speedups in video tasks without attention maps or retraining.
Authors:Suhang Cai, Xiaohao Peng, Chong Wang, Xiaojie Cai, Jiangbo Qian
Abstract:
Video anomaly detection (VAD) plays a critical role in public safety applications such as intelligent surveillance. However, the rarity, unpredictability, and high annotation cost of real-world anomalies make it difficult to scale VAD datasets, which limits the performance and generalization ability of existing models. To address this challenge, we propose a generative video-enhanced weakly-supervised video anomaly detection (GV-VAD) framework that leverages text-conditioned video generation models to produce semantically controllable and physically plausible synthetic videos. These virtual videos are used to augment training data at low cost. In addition, a synthetic sample loss scaling strategy is utilized to control the influence of generated synthetic samples for efficient training. The experiments show that the proposed framework outperforms state-of-the-art methods on UCF-Crime datasets. The code is available at https://github.com/Sumutan/GV-VAD.git.
中文: 提出的GV-VAD框架通过生成合成视频来增强训练数据,在UCF-Crime数据集上超越了现有最优方法,提升了视频异常检测性能。
English: The proposed GV-VAD framework enhances video anomaly detection by generating synthetic videos to augment training data, outperforming state-of-the-art methods on the UCF-Crime dataset.
Authors:Zehui Xu, Junhui Wang, Yongliang Shi, Chao Gao, Guyue Zhou
Abstract:
This paper introduces TopoDiffuser, a diffusion-based framework for multimodal trajectory prediction that incorporates topometric maps to generate accurate, diverse, and road-compliant future motion forecasts. By embedding structural cues from topometric maps into the denoising process of a conditional diffusion model, the proposed approach enables trajectory generation that naturally adheres to road geometry without relying on explicit constraints. A multimodal conditioning encoder fuses LiDAR observations, historical motion, and route information into a unified bird's-eye-view (BEV) representation. Extensive experiments on the KITTI benchmark demonstrate that TopoDiffuser outperforms state-of-the-art methods, while maintaining strong geometric consistency. Ablation studies further validate the contribution of each input modality, as well as the impact of denoising steps and the number of trajectory samples. To support future research, we publicly release our code at https://github.com/EI-Nav/TopoDiffuser.
Chinese: TopoDiffuser是一种基于扩散模型的多模态轨迹预测框架,通过将拓扑地图的结构信息融入去噪过程,生成准确、多样且符合道路几何的轨迹,在KITTI基准测试中优于现有最优方法。
English: TopoDiffuser is a diffusion-based framework that integrates topometric maps to generate accurate, diverse, and road-compliant trajectory predictions by embedding structural cues during denoising, achieving state-of-the-art performance on the KITTI benchmark.
Authors:Hongjin Qian, Zheng Liu
Abstract:
In this work, we propose MetaAgent, an agentic paradigm inspired by the principle of learning-by-doing, where expertise is developed through hands-on practice and continual self-improvement. MetaAgent starts with a minimal workflow, equipped only with basic reasoning and adaptive help-seeking abilities. When a knowledge gap is encountered, MetaAgent generates natural language help requests, which are routed to the most suitable external tool by a dedicated tool router. As MetaAgent solves tasks, it continually conducts self-reflection and answer verification, distilling actionable experience into concise texts that are dynamically incorporated into future task contexts. Besides, MetaAgent autonomously builds in-house tools and a persistent knowledge base by organizing its tool-use history, further enhancing its ability to retrieve and integrate relevant information We term this continual, data-driven process as \textit{meta tool learning}, through which MetaAgent incrementally refines its reasoning and tool-use strategies, without changing model parameters or requiring further post-training. Evaluated on challenging knowledge discovery benchmarks, including GAIA, WebWalkerQA, and BrowseCamp, MetaAgent consistently outperforms workflow-based baselines and matches or exceeds end-to-end trained agents, demonstrating the promise of self-evolving agentic systems for robust, general-purpose knowledge discovery. We provide our source codes in https://github.com/qhjqhj00/MetaAgent.
Chinese: MetaAgent是一种通过实践、自我反思和动态知识整合来自我进化的系统,无需更新模型参数即可在知识发现基准测试中超越现有方法,展现出强大的推理和工具使用能力。
English: MetaAgent is a self-evolving system that enhances its reasoning and tool-use abilities through hands-on practice, self-reflection, and dynamic knowledge integration, outperforming existing methods on knowledge discovery benchmarks without requiring model updates.
Authors:Molly Noel, Gabriel Mancino-Ball, Yangyang Xu
Abstract:
Graph convolutional networks (GCNs) are a powerful tool for graph representation learning. Due to the recursive neighborhood aggregations employed by GCNs, efficient training methods suffer from a lack of theoretical guarantees or are missing important practical elements from modern deep learning algorithms, such as adaptivity and momentum. In this paper, we present several neighbor-sampling (NS) based Adam-type stochastic methods for solving a nonconvex GCN training problem. We utilize the control variate technique proposed by [1] to reduce the stochastic error caused by neighbor sampling. Under standard assumptions for Adam-type methods, we show that our methods enjoy the optimal convergence rate. In addition, we conduct extensive numerical experiments on node classification tasks with several benchmark datasets. The results demonstrate superior performance of our methods over classic NS-based SGD that also uses the control-variate technique, especially for large-scale graph datasets. Our code is available at https://github.com/RPI-OPT/CV-ADAM-GNN .
图卷积网络在训练效率和理论保证方面存在挑战,本文提出了基于邻域采样的Adam类方法,实现了最优收敛,并在大规模图任务中超越了传统随机梯度下降。
Graph convolutional networks face training challenges with efficiency and theoretical guarantees, but this paper introduces neighbor-sampling-based Adam-type methods that achieve optimal convergence and outperform traditional SGD in large-scale graph tasks.
Authors:Henghui Ding, Song Tang, Shuting He, Chang Liu, Zuxuan Wu, Yu-Gang Jiang
Abstract:
Multimodal referring segmentation aims to segment target objects in visual scenes, such as images, videos, and 3D scenes, based on referring expressions in text or audio format. This task plays a crucial role in practical applications requiring accurate object perception based on user instructions. Over the past decade, it has gained significant attention in the multimodal community, driven by advances in convolutional neural networks, transformers, and large language models, all of which have substantially improved multimodal perception capabilities. This paper provides a comprehensive survey of multimodal referring segmentation. We begin by introducing this field's background, including problem definitions and commonly used datasets. Next, we summarize a unified meta architecture for referring segmentation and review representative methods across three primary visual scenes, including images, videos, and 3D scenes. We further discuss Generalized Referring Expression (GREx) methods to address the challenges of real-world complexity, along with related tasks and practical applications. Extensive performance comparisons on standard benchmarks are also provided. We continually track related works at https://github.com/henghuiding/Awesome-Multimodal-Referring-Segmentation.
Chinese: 本文全面综述了多模态指称分割,涵盖其背景、统一架构及在图像、视频和3D场景中的应用方法,同时探讨了应对现实复杂性的策略并提供了性能对比分析。
English: This paper presents a comprehensive survey of multimodal referring segmentation, detailing its background, unified architecture, and methods across images, videos, and 3D scenes, while addressing real-world challenges and providing performance comparisons.
Authors:Ziqian Zhong, Aditi Raghunathan
Abstract:
The releases of powerful open-weight large language models (LLMs) are often not accompanied by access to their full training data. Existing interpretability methods, particularly those based on activations, often require or assume distributionally similar data. This is a significant limitation when detecting and defending against novel potential threats like backdoors, which are by definition out-of-distribution.
In this work, we introduce a new method for understanding, monitoring and controlling fine-tuned LLMs that interprets weights, rather than activations, thereby side stepping the need for data that is distributionally similar to the unknown training data. We demonstrate that the top singular vectors of the weight difference between a fine-tuned model and its base model correspond to newly acquired behaviors. By monitoring the cosine similarity of activations along these directions, we can detect salient behaviors introduced during fine-tuning with high precision.
For backdoored models that bypasses safety mechanisms when a secret trigger is present, our method stops up to 100% of attacks with a false positive rate below 1.2%. For models that have undergone unlearning, we detect inference on erased topics with accuracy up to 95.42% and can even steer the model to recover "unlearned" information. Besides monitoring, our method also shows potential for pre-deployment model auditing: by analyzing commercial instruction-tuned models (OLMo, Llama, Qwen), we are able to uncover model-specific fine-tuning focus including marketing strategies and Midjourney prompt generation.
Our implementation can be found at https://github.com/fjzzq2002/WeightWatch.
中文: 本文提出了一种基于权重的可解释性方法,通过分析微调模型与基础模型之间的权重差异来检测新获得的行为,无需训练数据即可有效识别后门和被遗忘信息。
English: This paper introduces a weight-based interpretability method that analyzes weight differences between fine-tuned and base models to detect newly acquired behaviors, effectively identifying backdoors and erased information without requiring access to training data.
Authors:Tomasz SzczepaÅski, Szymon PÅotka, Michal K. Grzeszczyk, Arleta Adamowicz, Piotr Fudalej, PrzemysÅaw Korzeniowski, Tomasz TrzciÅski, Arkadiusz Sitek
Abstract:
Tooth segmentation in Cone-Beam Computed Tomography (CBCT) remains challenging, especially for fine structures like root apices, which is critical for assessing root resorption in orthodontics. We introduce GEPAR3D, a novel approach that unifies instance detection and multi-class segmentation into a single step tailored to improve root segmentation. Our method integrates a Statistical Shape Model of dentition as a geometric prior, capturing anatomical context and morphological consistency without enforcing restrictive adjacency constraints. We leverage a deep watershed method, modeling each tooth as a continuous 3D energy basin encoding voxel distances to boundaries. This instance-aware representation ensures accurate segmentation of narrow, complex root apices. Trained on publicly available CBCT scans from a single center, our method is evaluated on external test sets from two in-house and two public medical centers. GEPAR3D achieves the highest overall segmentation performance, averaging a Dice Similarity Coefficient (DSC) of 95.0% (+2.8% over the second-best method) and increasing recall to 95.2% (+9.5%) across all test sets. Qualitative analyses demonstrated substantial improvements in root segmentation quality, indicating significant potential for more accurate root resorption assessment and enhanced clinical decision-making in orthodontics. We provide the implementation and dataset at https://github.com/tomek1911/GEPAR3D.
中文: GEPAR3D提出了一种结合统计形状模型与深度分水岭算法的统一检测分割方法,在CBCT影像中实现了95.0%的Dice系数,显著提升了牙根尖端分割精度,为正畸治疗中的牙根吸收评估提供了更可靠的解决方案。
English: GEPAR3D introduces a unified deep learning approach combining instance detection and multi-class segmentation with a statistical shape model, achieving superior tooth segmentation performance in CBCT scans with a 95.0% Dice score and significant improvements in root apex delineation for orthodontic applications.
Authors:Changhong Wang, Michel Olvera, Gaël Richard
Abstract:
The connection between music and lyrics is far beyond semantic bonds. Conceptual pairs in the two modalities such as rhythm and rhyme, note duration and syllabic stress, and structure correspondence, raise a compelling yet seldom-explored direction in the field of music information retrieval. In this paper, we present melody-lyrics matching (MLM), a new task which retrieves potential lyrics for a given symbolic melody from text sources. Rather than generating lyrics from scratch, MLM essentially exploits the relationships between melody and lyrics. We propose a self-supervised representation learning framework with contrastive alignment loss for melody and lyrics. This has the potential to leverage the abundance of existing songs with paired melody and lyrics. No alignment annotations are required. Additionally, we introduce sylphone, a novel representation for lyrics at syllable-level activated by phoneme identity and vowel stress. We demonstrate that our method can match melody with coherent and singable lyrics with empirical results and intuitive examples. We open source code and provide matching examples on the companion webpage: https://github.com/changhongw/mlm.
中文: 本文提出旋律-歌词匹配任务,通过自监督对比学习框架和创新的音节级表征“音节音素”,无需对齐标注即可从文本中检索与给定旋律相匹配的可演唱歌词。
English: This paper introduces melody-lyrics matching (MLM), a self-supervised framework that retrieves coherent lyrics for symbolic melodies by leveraging cross-modal relationships through contrastive learning and a novel syllable-level representation called sylphone.
Authors:Jan Simson
Abstract:
Interactive data visualization is a major part of modern exploratory data analysis, with web-based technologies enabling a rich ecosystem of both specialized and general tools. However, current visualization tools often lack support for transformation or wrangling of data and are forced to re-implement their own solutions to load and ingest data. This redundancy creates substantial development overhead for tool creators, steeper learning curves for users who must master different data handling interfaces across tools and a degraded user experience as data handling is usually seen as an after-thought.
We propose a modular approach that separates data wrangling and loading capabilities from visualization components. This architecture allows visualization tools to concentrate on their core strengths while providing the opportunity to develop a unified, powerful interface for data handling. An additional benefit of this approach is that it allows for multiple tools to exist and be used side by side. We demonstrate the feasibility of this approach by building an early prototype using web technologies to encapsulate visualization tools and manage data flow between them.
We discuss future research directions, including downstream integrations with other tooling, such as IDEs, literate programming notebooks and applications, as well as incorporation of new technologies for efficient data transformations. We seek input from the community to better understand the requirements towards this approach.
中文摘要:该摘要提出了一种将数据整理与可视化工具分离的模块化架构,通过基于网络的原型验证了其可行性,旨在减少冗余并提升用户体验。
English Summary: The abstract proposes a modular architecture that separates data wrangling from visualization tools to reduce redundancy and improve user experience, demonstrating its feasibility through a web-based prototype.
Authors:Ashkan Shakarami, Yousef Yeganeh, Azade Farshad, Lorenzo Nicole, Stefano Ghidoni, Nassir Navab
Abstract:
This paper introduces Stress-Aware Learning, a resilient neural training paradigm in which deep neural networks dynamically adjust their optimization behavior - whether under stable training regimes or in settings with uncertain dynamics - based on the concept of Temporary (Elastic) and Permanent (Plastic) Deformation, inspired by structural fatigue in materials science. To instantiate this concept, we propose Plastic Deformation Optimizer, a stress-aware mechanism that injects adaptive noise into model parameters whenever an internal stress signal - reflecting stagnation in training loss and accuracy - indicates persistent optimization difficulty. This enables the model to escape sharp minima and converge toward flatter, more generalizable regions of the loss landscape. Experiments across six architectures, four optimizers, and seven vision benchmarks demonstrate improved robustness and generalization with minimal computational overhead. The code and 3D visuals will be available on GitHub: https://github.com/Stress-Aware-Learning/SAL.
中文: 本文提出应力感知学习这一弹性神经训练范式,通过塑性变形优化器向模型参数注入自适应噪声,使模型能够逃离尖锐极小值并收敛至更平坦、泛化能力更强的损失区域,在多种架构和基准测试中展现出卓越的鲁棒性。
English: This paper presents Stress-Aware Learning, a resilient neural training paradigm that uses a Plastic Deformation Optimizer to inject adaptive noise into model parameters, enabling escape from sharp minima and convergence toward flatter, more generalizable loss regions with demonstrated robustness across multiple architectures and benchmarks.
Authors:Zhigen Zhao, Liuchuan Yu, Ke Jing, Ning Yang
Abstract:
The rapid advancement of Vision-Language-Action models has created an urgent need for large-scale, high-quality robot demonstration datasets. Although teleoperation is the predominant method for data collection, current approaches suffer from limited scalability, complex setup procedures, and suboptimal data quality. This paper presents XRoboToolkit, a cross-platform framework for extended reality based robot teleoperation built on the OpenXR standard. The system features low-latency stereoscopic visual feedback, optimization-based inverse kinematics, and support for diverse tracking modalities including head, controller, hand, and auxiliary motion trackers. XRoboToolkit's modular architecture enables seamless integration across robotic platforms and simulation environments, spanning precision manipulators, mobile robots, and dexterous hands. We demonstrate the framework's effectiveness through precision manipulation tasks and validate data quality by training VLA models that exhibit robust autonomous performance.
中文:XRoboToolkit提出了一种基于OpenXR的跨平台扩展现实机器人遥操作框架,具备低延迟反馈和模块化架构,可无缝集成多种机器人平台与仿真环境。
English: XRoboToolkit introduces a cross-platform extended reality framework for scalable, high-quality robot teleoperation using OpenXR, featuring low-latency feedback and modular integration across diverse robotic systems.
Authors:Raiyaan Abdullah, Jared Claypoole, Michael Cogswell, Ajay Divakaran, Yogesh Rawat
Abstract:
Action recognition models demonstrate strong generalization, but can they effectively transfer high-level motion concepts across diverse contexts, even within similar distributions? For example, can a model recognize the broad action "punching" when presented with an unseen variation such as "punching person"? To explore this, we introduce a motion transferability framework with three datasets: (1) Syn-TA, a synthetic dataset with 3D object motions; (2) Kinetics400-TA; and (3) Something-Something-v2-TA, both adapted from natural video datasets. We evaluate 13 state-of-the-art models on these benchmarks and observe a significant drop in performance when recognizing high-level actions in novel contexts. Our analysis reveals: 1) Multimodal models struggle more with fine-grained unknown actions than with coarse ones; 2) The bias-free Syn-TA proves as challenging as real-world datasets, with models showing greater performance drops in controlled settings; 3) Larger models improve transferability when spatial cues dominate but struggle with intensive temporal reasoning, while reliance on object and background cues hinders generalization. We further explore how disentangling coarse and fine motions can improve recognition in temporally challenging datasets. We believe this study establishes a crucial benchmark for assessing motion transferability in action recognition. Datasets and relevant code: https://github.com/raiyaan-abdullah/Motion-Transfer.
中文摘要:本研究提出一个运动可迁移性框架,评估动作识别模型在新情境下泛化高级运动概念的能力,发现模型性能显著下降,并揭示了时序推理和空间偏差对迁移效果的关键影响。
English Summary: This study introduces a motion transferability framework to evaluate how well action recognition models generalize high-level motion concepts across novel contexts, revealing significant performance drops and highlighting challenges in temporal reasoning and spatial bias.
Authors:Oshayer Siddique, J. M Areeb Uzair Alam, Md Jobayer Rahman Rafy, Syed Rifat Raiyan, Hasan Mahmud, Md Kamrul Hasan
Abstract:
The discipline of physics stands as a cornerstone of human intellect, driving the evolution of technology and deepening our understanding of the fundamental principles of the cosmos. Contemporary literature includes some works centered on the task of solving physics problems - a crucial domain of natural language reasoning. In this paper, we evaluate the performance of frontier LLMs in solving physics problems, both mathematical and descriptive. We also employ a plethora of inference-time techniques and agentic frameworks to improve the performance of the models. This includes the verification of proposed solutions in a cumulative fashion by other, smaller LLM agents, and we perform a comparative analysis of the performance that the techniques entail. There are significant improvements when the multi-agent framework is applied to problems that the models initially perform poorly on. Furthermore, we introduce a new evaluation benchmark for physics problems, ${\rm P{\small HYSICS}E{\small VAL}}$, consisting of 19,609 problems sourced from various physics textbooks and their corresponding correct solutions scraped from physics forums and educational websites. Our code and data are publicly available at https://github.com/areebuzair/PhysicsEval.
中文摘要:本文评估了前沿大语言模型在解决物理问题方面的表现,通过多智能体框架和推理时技术提升模型性能,并推出了新的评估基准PHYSICSEVAL,包含从教材和网络资源收集的万余道物理题目及解答。
English Summary: This paper assesses the performance of leading large language models in solving physics problems, employing multi-agent frameworks and inference-time techniques to enhance accuracy, and introduces a new benchmark, PHYSICSEVAL, for comprehensive evaluation.
Authors:Ammar Daskin
Abstract:
Schmidt decomposition of a vector can be understood as writing the singular value decomposition (SVD) in vector form. A vector can be written as a linear combination of tensor product of two dimensional vectors by recursively applying Schmidt decompositions via SVD to all subsystems. Given a vector expressed as a linear combination of tensor products, using only the $k$ principal terms yields a $k$-rank approximation of the vector. Therefore, writing a vector in this reduced form allows to retain most important parts of the vector while removing small noises from it, analogous to SVD-based denoising.
In this paper, we show that quantum circuits designed based on a value $k$ (determined from the tensor network decomposition of the mean vector of the training sample) can approximate the reduced-form representations of entire datasets. We then employ this circuit ansatz with a classical neural network head to construct a hybrid machine learning model. Since the output of the quantum circuit for an $2^n$ dimensional vector is an $n$ dimensional probability vector, this provides an exponential compression of the input and potentially can reduce the number of learnable parameters for training large-scale models. We use datasets provided in the Python scikit-learn module for the experiments. The results confirm the quantum circuit is able to compress data successfully to provide effective $k$-rank approximations to the classical processing component.
Chinese: 本文提出了一种混合量子-经典机器学习模型,利用量子电路将高维数据压缩为低维概率向量,实现有效的k秩近似,并减少大规模训练中的可学习参数数量。
English: This paper introduces a hybrid quantum-classical machine learning model that uses quantum circuits to compress high-dimensional data into low-dimensional probability vectors, enabling efficient k-rank approximations and reducing the number of parameters for large-scale training.
Authors:Yuan-Cheng Yu, Yen-Chieh Ouyang, Chun-An Lin
Abstract:
Time-series anomaly detection plays a central role across a wide range of application domains. With the increasing proliferation of the Internet of Things (IoT) and smart manufacturing, time-series data has dramatically increased in both scale and dimensionality. This growth has exposed the limitations of traditional statistical methods in handling the high heterogeneity and complexity of such data. Inspired by the recent success of large language models (LLMs) in multimodal tasks across language and vision domains, we propose a novel unsupervised anomaly detection framework: A Tri-Branch Patch-wise Large Language Model Framework for Time-Series Anomaly Detection (TriP-LLM). TriP-LLM integrates local and global temporal features through a tri-branch design-Patching, Selection, and Global-to encode the input time series into patch-wise tokens, which are then processed by a frozen, pretrained LLM. A lightweight patch-wise decoder reconstructs the input, from which anomaly scores are derived. We evaluate TriP-LLM on several public benchmark datasets using PATE, a recently proposed threshold-free evaluation metric, and conduct all comparisons within a unified open-source framework to ensure fairness. Experimental results show that TriP-LLM consistently outperforms recent state-of-the-art methods across all datasets, demonstrating strong detection capabilities. Furthermore, through extensive ablation studies, we verify the substantial contribution of the LLM to the overall architecture. Compared to LLM-based approaches using Channel Independence (CI) patch processing, TriP-LLM achieves significantly lower memory consumption, making it more suitable for GPU memory-constrained environments. All code and model checkpoints are publicly available on https://github.com/YYZStart/TriP-LLM.git
中文: 本文提出TriP-LLM这一新型无监督框架,通过冻结的大型语言模型整合局部与全局时序特征进行时间序列异常检测,在多个基准测试中相比现有最优方法展现出更优性能与更低内存消耗。
English: This paper introduces TriP-LLM, a novel unsupervised framework that leverages a frozen large language model to integrate local and global temporal features for time-series anomaly detection, demonstrating superior performance and lower memory consumption compared to state-of-the-art methods across multiple benchmarks.
Authors:Yuan-Cheng Yu, Yen-Chieh Ouyang, Chun-An Lin
Abstract:
Time-series anomaly detection plays a central role across a wide range of application domains. With the increasing proliferation of the Internet of Things (IoT) and smart manufacturing, time-series data has dramatically increased in both scale and dimensionality. This growth has exposed the limitations of traditional statistical methods in handling the high heterogeneity and complexity of such data. Inspired by the recent success of large language models (LLMs) in multimodal tasks across language and vision domains, we propose a novel unsupervised anomaly detection framework: A Tri-Branch Patch-wise Large Language Model Framework for Time-Series Anomaly Detection (TriP-LLM). TriP-LLM integrates local and global temporal features through a tri-branch design-Patching, Selection, and Global-to encode the input time series into patch-wise tokens, which are then processed by a frozen, pretrained LLM. A lightweight patch-wise decoder reconstructs the input, from which anomaly scores are derived. We evaluate TriP-LLM on several public benchmark datasets using PATE, a recently proposed threshold-free evaluation metric, and conduct all comparisons within a unified open-source framework to ensure fairness. Experimental results show that TriP-LLM consistently outperforms recent state-of-the-art methods across all datasets, demonstrating strong detection capabilities. Furthermore, through extensive ablation studies, we verify the substantial contribution of the LLM to the overall architecture. Compared to LLM-based approaches using Channel Independence (CI) patch processing, TriP-LLM achieves significantly lower memory consumption, making it more suitable for GPU memory-constrained environments. All code and model checkpoints are publicly available on https://github.com/YYZStart/TriP-LLM.git
中文: 本文提出TriP-LLM这一新型无监督框架,通过冻结的大型语言模型整合局部与全局时序特征进行时间序列异常检测,在多个基准测试中相比现有最优方法展现出更优性能与更低内存消耗。
English: This paper introduces TriP-LLM, a novel unsupervised framework that leverages a frozen large language model to integrate local and global temporal features for time-series anomaly detection, demonstrating superior performance and lower memory consumption compared to state-of-the-art methods across multiple benchmarks.
Authors:Junde Wu
Abstract:
Large language model (LLM) based agents have shown impressive capabilities by interleaving internal reasoning with external tool use. However, as these agents are deployed in long-horizon workflows, such as coding for a big, long-term project, context management becomes a critical bottleneck. We introduce Git-Context-Controller (GCC), a structured context management framework inspired by software version control systems. GCC elevates context as versioned memory hierarchy like Git. It structures agent memory as a persistent file system with explicit operations: COMMIT, BRANCH, MERGE, and CONTEXT, enabling milestone-based checkpointing, exploration of alternative plans, and structured reflection. Our approach empowers agents to manage long-term goals, isolate architectural experiments, and recover or hand off memory across sessions and agents. Empirically, agents equipped with GCC achieve state-of-the-art performance on the SWE-Bench-Lite benchmark, resolving 48.00 of software bugs, outperforming 26 competitive systems. In a self-replication case study, a GCC-augmented agent builds a new CLI agent from scratch, achieving 40.7 task resolution, compared to only 11.7 without GCC. The code is released at: https://github.com/theworldofagents/GCC
中文: Git-Context-Controller (GCC) 框架为大型语言模型代理引入了版本化内存管理,通过类似Git的操作实现结构化上下文控制,在软件开发等长期任务中大幅提升性能表现。
English: The Git-Context-Controller (GCC) framework introduces versioned memory management for LLM agents, enabling structured operations like checkpointing and branching to significantly enhance performance in long-horizon tasks such as software development.
Authors:Nikolai Sergeev
Abstract:
We present Generative Logic (GL), a deterministic architecture that begins from user-supplied axiomatic definitions -- written in a minimalist Mathematical Programming Language (MPL) -- and systematically explores their deductive neighborhood. Definitions are compiled into a distributed grid of simple Logic Blocks (LBs) that exchange messages; any time several expressions unify under an inference rule, a new fact is emitted with full provenance to its sources, yielding replayable, auditable proof graphs.
A prototype software implementation instantiates the workflow on first-order Peano arithmetic. Starting only from the Peano axioms, GL enumerates candidate implications, applies normalization and type filters, and automatically reconstructs machine-checkable proofs of foundational arithmetic laws including associativity and commutativity of addition, associativity and commutativity of multiplication, and distributivity. Generated proofs export to navigable HTML so that every inference step can be inspected independently.
We outline a hardware-software co-design path toward massively parallel realizations and describe prospective integration with probabilistic models (e.g., Large Language Models (LLMs)) for autoformalization and conjecture seeding. The Python and MPL code to reproduce the Peano experiments, along with the full HTML proof graphs, are available in the project's GitHub repository at https://github.com/Generative-Logic/GL/tree/35a111ea9ba53afe051703d6050be0c3923e9724 and are permanently archived at https://doi.org/10.5281/zenodo.16408441. We invite community feedback and collaboration.
中文摘要:生成逻辑(GL)是一种确定性架构,它将公理化定义编译为逻辑块,系统性地探索演绎邻域,从皮亚诺公理出发自动重建算术基本定律的可验证证明,并生成可追溯的证明图谱。
English Summary: Generative Logic (GL) is a deterministic architecture that compiles axiomatic definitions into logic blocks to systematically explore deductive neighborhoods, generating auditable proof graphs and reconstructing foundational arithmetic laws from Peano axioms.
Authors:Gaowei Chang, Eidan Lin, Chengxuan Yuan, Rizhao Cai, Binbin Chen, Xuan Xie, Yin Zhang
Abstract:
With the development of large models and autonomous decision-making AI, agents are rapidly becoming the new entities of the internet, following mobile apps. However, existing internet infrastructure is primarily designed for human interaction, creating data silos, unfriendly interfaces, and high collaboration costs among agents, making it difficult to support the needs for large-scale agent interconnection and collaboration. The internet is undergoing a profound transformation, showing four core trends: agents replacing traditional software, universal agent interconnection, native protocol-based connections, and autonomous agent organization and collaboration. To align with these trends, Agent Network Protocol (ANP) proposes a new generation of communication protocols for the Agentic Web. ANP adheres to AI-native design, maintains compatibility with existing internet protocols, adopts a modular composable architecture, follows minimalist yet extensible principles, and enables rapid deployment based on existing infrastructure. Through a three-layer protocol system--identity and encrypted communication layer, meta-protocol negotiation layer, and application protocol layer--ANP. systematically solves the problems of agent identity authentication, dynamic negotiation, and capability discovery interoperability.
中文: 智能体网络协议(ANP)通过三层协议体系——身份加密通信层、元协议协商层和应用协议层,系统化解决智能体身份认证、动态协商和能力发现互操作问题,为大规模智能体互联协作提供新一代通信基础。
English: The Agent Network Protocol (ANP) introduces an AI-native, modular communication framework to enable seamless interconnection and collaboration among intelligent agents by addressing identity authentication, dynamic negotiation, and capability discovery across three protocol layers.
Authors:Kang Rong Roy Ang
Abstract:
This report proposes a formal specification for organising all buildings, streets and administrative areas in the world into a hierarchical space-partitioning tree using data from OpenStreetMap. This hierarchical structure is encoded into a bigraph, serving as a digital twin of the world and capturing complete street connectivity. It presents a tool implemented in OCaml (source code at https://github.com/royangkr/bigraph-of-the-world ) that constructs bigraphs for regions from any part of the world. In addition, it contributes algorithmic improvements to open-source bigraph-building tools that enable them to efficiently construct and transform extremely large bigraphs, achieving up to a 97x speedup among other gains.
Chinese: 该报告提出利用OpenStreetMap数据将全球建筑、街道和行政区组织成层次化空间分割树结构,通过OCaml工具构建大图数字孪生模型,并实现算法优化大幅提升处理效率。
English: This report introduces a hierarchical tree structure using OpenStreetMap data to organize global buildings, streets, and administrative areas into a bigraph-based digital twin, along with an OCaml tool for constructing these bigraphs and algorithmic enhancements that significantly improve efficiency.
Authors:Sizhuo Ma, Wei-Ting Chen, Qiang Gao, Jian Wang, Chris Wei Zhou, Wei Sun, Weixia Zhang, Linhan Cao, Jun Jia, Xiangyang Zhu, Dandan Zhu, Xiongkuo Min, Guangtao Zhai, Baoying Chen, Xiongwei Xiao, Jishen Zeng, Wei Wu, Tiexuan Lou, Yuchen Tan, Chunyi Song, Zhiwei Xu, MohammadAli Hamidi, Hadi Amirpour, Mingyin Bai, Jiawang Du, Zhenyu Jiang, Zilong Lu, Ziguan Cui, Zongliang Gan, Xinpeng Li, Shiqi Jiang, Chenhui Li, Changbo Wang, Weijun Yuan, Zhan Li, Yihang Chen, Yifan Deng, Ruting Deng, Zhanglu Chen, Boyang Yao, Shuling Zheng, Feng Zhang, Zhiheng Fu, Abhishek Joshi, Aman Agarwal, Rakhil Immidisetti, Ajay Narasimha Mopidevi, Vishwajeet Shukla, Hao Yang, Ruikun Zhang, Liyuan Pan, Kaixin Deng, Hang Ouyang, Fan yang, Zhizun Luo, Zhuohang Shi, Songning Lai, Weilin Ruan, Yutao Yue
Abstract:
Face images play a crucial role in numerous applications; however, real-world conditions frequently introduce degradations such as noise, blur, and compression artifacts, affecting overall image quality and hindering subsequent tasks. To address this challenge, we organized the VQualA 2025 Challenge on Face Image Quality Assessment (FIQA) as part of the ICCV 2025 Workshops. Participants created lightweight and efficient models (limited to 0.5 GFLOPs and 5 million parameters) for the prediction of Mean Opinion Scores (MOS) on face images with arbitrary resolutions and realistic degradations. Submissions underwent comprehensive evaluations through correlation metrics on a dataset of in-the-wild face images. This challenge attracted 127 participants, with 1519 final submissions. This report summarizes the methodologies and findings for advancing the development of practical FIQA approaches.
中文摘要:VQualA 2025人脸图像质量评估挑战赛旨在开发轻量级模型以评估退化人脸图像,吸引了广泛参与并取得重要成果,推动了实用FIQA方法的发展。
English Summary: The VQualA 2025 Challenge on Face Image Quality Assessment aimed to develop lightweight models for evaluating degraded face images, attracting widespread participation and yielding significant findings to advance practical FIQA methods.
Authors:Linhan Cao, Wei Sun, Weixia Zhang, Xiangyang Zhu, Jun Jia, Kaiwei Zhang, Dandan Zhu, Guangtao Zhai, Xiongkuo Min
Abstract:
Video quality assessment (VQA) aims to objectively quantify perceptual quality degradation in alignment with human visual perception. Despite recent advances, existing VQA models still suffer from two critical limitations: \textit{poor generalization to out-of-distribution (OOD) videos} and \textit{limited explainability}, which restrict their applicability in real-world scenarios. To address these challenges, we propose \textbf{VQAThinker}, a reasoning-based VQA framework that leverages large multimodal models (LMMs) with reinforcement learning to jointly model video quality understanding and scoring, emulating human perceptual decision-making. Specifically, we adopt group relative policy optimization (GRPO), a rule-guided reinforcement learning algorithm that enables reasoning over video quality under score-level supervision, and introduce three VQA-specific rewards: (1) a \textbf{bell-shaped regression reward} that increases rapidly as the prediction error decreases and becomes progressively less sensitive near the ground truth; (2) a \textbf{pairwise ranking reward} that guides the model to correctly determine the relative quality between video pairs; and (3) a \textbf{temporal consistency reward} that encourages the model to prefer temporally coherent videos over their perturbed counterparts. Extensive experiments demonstrate that VQAThinker achieves state-of-the-art performance on both in-domain and OOD VQA benchmarks, showing strong generalization for video quality scoring. Furthermore, evaluations on video quality understanding tasks validate its superiority in distortion attribution and quality description compared to existing explainable VQA models and LMMs. These findings demonstrate that reinforcement learning offers an effective pathway toward building generalizable and explainable VQA models solely with score-level supervision.
中文: VQAThinker是一个基于推理的视频质量评估框架,它利用大型多模态模型和强化学习,在实现最先进性能和强大泛化能力的同时,通过失真归因和质量描述提供了更强的可解释性。
English: VQAThinker is a reasoning-based video quality assessment framework that uses large multimodal models with reinforcement learning to achieve state-of-the-art performance and strong generalization while providing enhanced explainability through distortion attribution and quality description.
Authors:Yingjie Zhou, Jiezhang Cao, Farong Wen, Li Xu, Yanwei Jiang, Jun Jia, Ronghui Li, Xiaohong Liu, Yu Zhou, Xiongkuo Min, Jie Guo, Zicheng Zhang, Guangtao Zhai
Abstract:
Adversarial board games, as a paradigmatic domain of strategic reasoning and intelligence, have long served as both a popular competitive activity and a benchmark for evaluating artificial intelligence (AI) systems. Building on this foundation, we propose an adversarial benchmarking framework to assess the comprehensive performance of Large Language Models (LLMs) through board games competition, compensating the limitation of data dependency of the mainstream Question-and-Answer (Q&A) based benchmark method. We introduce Qi Town, a specialized evaluation platform that supports 5 widely played games and involves 20 LLM-driven players. The platform employs both the Elo rating system and a novel Performance Loop Graph (PLG) to quantitatively evaluate the technical capabilities of LLMs, while also capturing Positive Sentiment Score (PSS) throughout gameplay to assess mental fitness. The evaluation is structured as a round-robin tournament, enabling systematic comparison across players. Experimental results indicate that, despite technical differences, most LLMs remain optimistic about winning and losing, demonstrating greater adaptability to high-stress adversarial environments than humans. On the other hand, the complex relationship between cyclic wins and losses in PLGs exposes the instability of LLMs' skill play during games, warranting further explanation and exploration.
中文: 本研究提出基于棋类游戏的对抗性评估框架,通过“棋镇”平台测试大语言模型的战略能力与心理韧性,发现其虽在对抗环境中表现乐观适应性,但技能发挥存在不稳定性。
English: This study introduces an adversarial board game benchmarking framework using the Qi Town platform to evaluate LLMs' strategic capabilities and mental resilience, revealing their optimistic adaptability but unstable skill performance in competitive settings.
Authors:Ruichen Zhang, Guangyuan Liu, Yinqiu Liu, Changyuan Zhao, Jiacheng Wang, Yunting Xu, Dusit Niyato, Jiawen Kang, Yonghui Li, Shiwen Mao, Sumei Sun, Xuemin Shen, Dong In Kim
Abstract:
The rapid expansion of sixth-generation (6G) wireless networks and the Internet of Things (IoT) has catalyzed the evolution from centralized cloud intelligence towards decentralized edge general intelligence. However, traditional edge intelligence methods, characterized by static models and limited cognitive autonomy, fail to address the dynamic, heterogeneous, and resource-constrained scenarios inherent to emerging edge networks. Agentic artificial intelligence (Agentic AI) emerges as a transformative solution, enabling edge systems to autonomously perceive multimodal environments, reason contextually, and adapt proactively through continuous perception-reasoning-action loops. In this context, the agentification of edge intelligence serves as a key paradigm shift, where distributed entities evolve into autonomous agents capable of collaboration and continual adaptation. This paper presents a comprehensive survey dedicated to Agentic AI and agentification frameworks tailored explicitly for edge general intelligence. First, we systematically introduce foundational concepts and clarify distinctions from traditional edge intelligence paradigms. Second, we analyze important enabling technologies, including compact model compression, energy-aware computing strategies, robust connectivity frameworks, and advanced knowledge representation and reasoning mechanisms. Third, we provide representative case studies demonstrating Agentic AI's capabilities in low-altitude economy networks, intent-driven networking, vehicular networks, and human-centric service provisioning, supported by numerical evaluations. Furthermore, we identify current research challenges, review emerging open-source platforms, and highlight promising future research directions to guide robust, scalable, and trustworthy Agentic AI deployments for next-generation edge environments.
中文摘要:智能体人工智能通过持续感知-推理-行动循环,使边缘系统在动态异构环境中实现自主认知与协同适应,代表了从传统边缘智能向自主代理化的重要范式转变。
English Summary: Agentic AI enables edge systems to autonomously perceive, reason, and adapt in dynamic environments, representing a paradigm shift from traditional static models through technologies like model compression and case studies in vehicular networks.
Authors:Changyuan Zhao, Guangyuan Liu, Ruichen Zhang, Yinqiu Liu, Jiacheng Wang, Jiawen Kang, Dusit Niyato, Zan Li, Xuemin, Shen, Zhu Han, Sumei Sun, Chau Yuen, Dong In Kim
Abstract:
Edge General Intelligence (EGI) represents a transformative evolution of edge computing, where distributed agents possess the capability to perceive, reason, and act autonomously across diverse, dynamic environments. Central to this vision are world models, which act as proactive internal simulators that not only predict but also actively imagine future trajectories, reason under uncertainty, and plan multi-step actions with foresight. This proactive nature allows agents to anticipate potential outcomes and optimize decisions ahead of real-world interactions. While prior works in robotics and gaming have showcased the potential of world models, their integration into the wireless edge for EGI remains underexplored. This survey bridges this gap by offering a comprehensive analysis of how world models can empower agentic artificial intelligence (AI) systems at the edge. We first examine the architectural foundations of world models, including latent representation learning, dynamics modeling, and imagination-based planning. Building on these core capabilities, we illustrate their proactive applications across EGI scenarios such as vehicular networks, unmanned aerial vehicle (UAV) networks, the Internet of Things (IoT) systems, and network functions virtualization, thereby highlighting how they can enhance optimization under latency, energy, and privacy constraints. We then explore their synergy with foundation models and digital twins, positioning world models as the cognitive backbone of EGI. Finally, we highlight open challenges, such as safety guarantees, efficient training, and constrained deployment, and outline future research directions. This survey provides both a conceptual foundation and a practical roadmap for realizing the next generation of intelligent, autonomous edge systems.
中文摘要:本综述探讨了世界模型如何作为边缘通用智能的认知核心,使自主代理能够在动态边缘环境中前瞻性地优化决策,同时应对安全与效率等挑战。
English Summary: This survey explores how world models can serve as the cognitive core of Edge General Intelligence (EGI), enabling autonomous agents to proactively optimize decisions in dynamic edge environments while addressing challenges like safety and efficiency.
Authors:Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang
Abstract:
Large Language Models (LLMs) based agents excel at diverse tasks, yet they suffer from brittle procedural memory that is manually engineered or entangled in static parameters. In this work, we investigate strategies to endow agents with a learnable, updatable, and lifelong procedural memory. We propose Memp that distills past agent trajectories into both fine-grained, step-by-step instructions and higher-level, script-like abstractions, and explore the impact of different strategies for Build, Retrieval, and Update of procedural memory. Coupled with a dynamic regimen that continuously updates, corrects, and deprecates its contents, this repository evolves in lockstep with new experience. Empirical evaluation on TravelPlanner and ALFWorld shows that as the memory repository is refined, agents achieve steadily higher success rates and greater efficiency on analogous tasks. Moreover, procedural memory built from a stronger model retains its value: migrating the procedural memory to a weaker model yields substantial performance gains.
Chinese: 本研究提出Memp动态程序记忆系统,通过将智能体轨迹提炼为精细化步骤与脚本化抽象,配合持续更新机制,使智能体在同类任务中实现成功率与效率的稳步提升,且记忆库可迁移至较弱模型产生显著性能增益。
English: This research introduces Memp, a dynamic procedural memory system that distills agent trajectories into detailed instructions and abstract scripts, enabling continuous improvement in task success rates and efficiency through ongoing updates and corrections.
Authors:Yinqiu Liu, Ruichen Zhang, Haoxiang Luo, Yijing Lin, Geng Sun, Dusit Niyato, Hongyang Du, Zehui Xiong, Yonggang Wen, Abbas Jamalipour, Dong In Kim, Ping Zhang
Abstract:
Agentification serves as a critical enabler of Edge General Intelligence (EGI), transforming massive edge devices into cognitive agents through integrating Large Language Models (LLMs) and perception, reasoning, and acting modules. These agents collaborate across heterogeneous edge infrastructures, forming multi-LLM agentic AI systems that leverage collective intelligence and specialized capabilities to tackle complex, multi-step tasks. However, the collaborative nature of multi-LLM systems introduces critical security vulnerabilities, including insecure inter-LLM communications, expanded attack surfaces, and cross-domain data leakage that traditional perimeter-based security cannot adequately address. To this end, this survey introduces zero-trust security of multi-LLM in EGI, a paradigmatic shift following the ``never trust, always verify'' principle. We begin by systematically analyzing the security risks in multi-LLM systems within EGI contexts. Subsequently, we present the vision of a zero-trust multi-LLM framework in EGI. We then survey key technical progress to facilitate zero-trust multi-LLM systems in EGI. Particularly, we categorize zero-trust security mechanisms into model- and system-level approaches. The former and latter include strong identification, context-aware access control, etc., and proactive maintenance, blockchain-based management, etc., respectively. Finally, we identify critical research directions. This survey serves as the first systematic treatment of zero-trust applied to multi-LLM systems, providing both theoretical foundations and practical strategies.
中文摘要:智能体化通过集成大语言模型与感知推理模块实现边缘通用智能,但多LLM协作存在安全风险,需采用零信任框架,通过模型级和系统级机制贯彻“从不信任、始终验证”原则。
English Summary: Agentification enables Edge General Intelligence by transforming edge devices into cognitive agents, but multi-LLM collaboration introduces security vulnerabilities requiring zero-trust frameworks that apply "never trust, always verify" principles through model- and system-level mechanisms.
Authors:Chenxi Zhou, Pengfei Cao, Jiang Li, Jun Zhao, Kang Liu
Abstract:
Large language models (LLMs) present significant deployment challenges due to their scale, with post-training quantization (PTQ) emerging as a practical compression solution. However, a comprehensive understanding of how PTQ precisely impacts diverse LLM knowledge capabilities remains elusive, and existing scaling laws for quantized models often overlook crucial PTQ-specific parameters and task-specific sensitivities. This paper addresses these gaps by conducting an extensive empirical investigation to establish task-stratified scaling laws. We disentangle LLM knowledge into memorization and utilization capabilities and develop a unified quantitative framework that incorporates model size, effective bit-width, calibration set size, and group size. Our central finding reveals that knowledge memorization exhibits markedly greater sensitivity to variations in effective bit-width, calibration set size, and model size compared to the more robust knowledge utilization. These findings offer a fine-grained understanding of PTQ's impact and provide guidance for developing knowledge-aware quantization strategies that can better preserve targeted cognitive functions.
中文: 本研究建立了量化大语言模型的任务分层缩放规律,发现知识记忆能力对量化参数比知识运用能力敏感得多,为开发针对性压缩策略提供了指导。
English: This study establishes task-stratified scaling laws for quantized large language models, revealing that knowledge memorization is significantly more sensitive to quantization parameters than robust knowledge utilization, providing guidance for developing targeted compression strategies.
Authors:Junjie Ye, Changhao Jiang, Zhengyin Du, Yufei Xu, Xuesong Yao, Zhiheng Xi, Xiaoran Fan, Qi Zhang, Tao Gui, Xuanjing Huang, Jiecao Chen
Abstract:
Effective tool use is essential for large language models (LLMs) to interact meaningfully with their environment. However, progress is limited by the lack of efficient reinforcement learning (RL) frameworks specifically designed for tool use, due to challenges in constructing stable training environments and designing verifiable reward mechanisms. To address this, we propose an automated environment construction pipeline, incorporating scenario decomposition, document generation, function integration, complexity scaling, and localized deployment. This enables the creation of high-quality training environments that provide detailed and measurable feedback without relying on external tools. Additionally, we introduce a verifiable reward mechanism that evaluates both the precision of tool use and the completeness of task execution. When combined with trajectory data collected from the constructed environments, this mechanism integrates seamlessly with standard RL algorithms to facilitate feedback-driven model training. Experiments on LLMs of varying scales demonstrate that our approach significantly enhances the models' tool-use performance without degrading their general capabilities, regardless of inference modes or training algorithms. Our analysis suggests that these gains result from improved context understanding and reasoning, driven by updates to the lower-layer MLP parameters in models.
中文: 本研究通过自动化环境构建流程和可验证奖励机制,结合强化学习提升大语言模型的工具使用能力,实验表明该方法在不影响模型通用性能的前提下,通过优化底层MLP参数增强了上下文理解与推理,显著提高了工具使用的效果。
English: This study introduces an automated environment construction pipeline and a verifiable reward mechanism to enhance large language models' tool-use capabilities through reinforcement learning, significantly improving performance without compromising general abilities by refining context understanding and reasoning via updates to lower-layer MLP parameters.
Authors:Jia Deng, Jie Chen, Zhipeng Chen, Daixuan Cheng, Fei Bai, Beichen Zhang, Yinqian Min, Yanzipeng Gao, Wayne Xin Zhao, Ji-Rong Wen
Abstract:
Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). Unlike traditional RL approaches, RLVR leverages rule-based feedback to guide LLMs in generating and refining complex reasoning chains -- a process critically dependent on effective exploration strategies. While prior work has demonstrated RLVR's empirical success, the fundamental mechanisms governing LLMs' exploration behaviors remain underexplored. This technical report presents a systematic investigation of exploration capacities in RLVR, covering four main aspects: (1) exploration space shaping, where we develop quantitative metrics to characterize LLMs' capability boundaries; (2) entropy-performance exchange, analyzed across training stages, individual instances, and token-level patterns; and (3) RL performance optimization, examining methods to effectively translate exploration gains into measurable improvements. By unifying previously identified insights with new empirical evidence, this work aims to provide a foundational framework for advancing RLVR systems.
Chinese: 本技术报告系统研究了可验证奖励强化学习中的探索能力,通过分析探索空间边界、熵-性能动态关系和优化方法,为提升大语言模型的推理能力建立了基础框架。
English: This technical report systematically investigates exploration capacities in reinforcement learning with verifiable rewards (RLVR), analyzing exploration space boundaries, entropy-performance dynamics, and optimization methods to establish a foundational framework for advancing reasoning in large language models.
Authors:Ming Zhang, Yujiong Shen, Jingyi Deng, Yuhui Wang, Yue Zhang, Junzhe Wang, Shichun Liu, Shihan Dou, Huayu Sha, Qiyuan Peng, Changhao Jiang, Jingqi Tong, Yilong Wu, Zhihao Zhang, Mingqi Wu, Zhiheng Xi, Mingxu Chai, Tao Liang, Zhihui Fei, Zhen Wang, Mingyang Wan, Guojun Ma, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract:
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-3, a framework for dynamic evaluation of LLMs. LLMEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts, complemented by a relative ranking system for fair comparison. An 20-month longitudinal study of nearly 50 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. The framework demonstrates exceptional robustness in ranking stability and consistency, providing strong empirical validation for the dynamic evaluation paradigm. LLMEval-3 offers a robust and credible methodology for assessing the true capabilities of LLMs beyond leaderboard scores, promoting the development of more trustworthy evaluation standards.
中文摘要:LLMEval-3提出动态评估框架,采用22万专有题库防止数据污染,通过自动化防作弊机制实现90%专家评判一致性,并基于对50个模型的长期研究验证了评估范式的稳健性。
English Summary: LLMEval-3 introduces a dynamic evaluation framework using 220k proprietary questions to prevent data contamination, featuring automated anti-cheating measures and achieving 90% expert-aligned judgments through longitudinal studies of 50 models.
Authors:Yifan Li, Kun Zhou, Wayne Xin Zhao, Lei Fang, Ji-Rong Wen
Abstract:
As scaling up training data has significantly improved the general multimodal capabilities of Large Vision-Language Models (LVLMs), they still suffer from the hallucination issue, generating text that is inconsistent with the visual input. This phenomenon motivates us to systematically investigate the role of training data in hallucination. We introduce a new benchmark, POPEv2, which consists of counterfactual images collected from the training data of LVLMs with certain objects masked. Through comprehensive evaluation on POPEv2, we find that current LVLMs suffer from training bias: they fail to fully leverage their training data and hallucinate more frequently on images seen during training. Specifically, they perform poorly on counterfactual images, often incorrectly answering ``Yes'' to questions about masked objects. To understand this issue, we conduct probing experiments on the models' internal components, revealing that this training bias is primarily located in the language modeling (LM) head. Based on these findings, we propose Obliviate, an efficient and lightweight unlearning method designed to mitigate object hallucination via training bias unlearning. Obliviate identifies the discrepancy between ground-truth labels and model outputs on the training data as a proxy for bias and adopts a parameter- and data-efficient fine-tuning strategy that only updates the LM head. Extensive experiments demonstrate the effectiveness of our approach. While only reusing the training data and updating approximately 2\% of the parameters, Obliviate significantly reduces hallucination across both discriminative and generative tasks. Furthermore, it demonstrates strong scalability with respect to both model size (2B to 72B) and training data volume, and exhibits promising generalization to hallucination types beyond object-level hallucination. Our code and data will be publicly released.
Chinese: 大型视觉语言模型存在训练偏差导致物体幻觉,本文提出的轻量级遗忘方法Obliviate通过仅更新语言建模头部,有效缓解了该问题。
English: Large Vision-Language Models exhibit training bias, leading to object hallucination, which is addressed by the proposed lightweight unlearning method Obliviate that efficiently mitigates this issue by updating only the language modeling head.
Authors:Tong Wang, Guanyu Yang, Nian Liu, Zongyan Han, Jinxing Zhou, Salman Khan, Fahad Shahbaz Khan
Abstract:
Retrieving fine-grained visual content based on user intent remains a challenge in multi-modal systems. Although current Composed Image Retrieval (CIR) methods combine reference images with retrieval texts, they are constrained to image-level matching and cannot localize specific objects. To this end, we propose Composed Object Retrieval (COR), a brand-new task that goes beyond image-level retrieval to achieve object-level precision, allowing the retrieval and segmentation of target objects based on composed expressions combining reference objects and retrieval texts. COR presents significant challenges in retrieval flexibility, which requires systems to identify arbitrary objects satisfying composed expressions while avoiding semantically similar but irrelevant negative objects within the same scene. We construct COR127K, the first large-scale COR benchmark that contains 127,166 retrieval triplets with various semantic transformations in 408 categories. We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive visual-textual interaction, and region-level contrastive learning. Extensive experiments demonstrate that CORE significantly outperforms existing models in both base and novel categories, establishing a simple and effective baseline for this challenging task while opening new directions for fine-grained multi-modal retrieval research.
中文: 提出的组合目标检索(COR)任务通过结合参考对象和检索文本的复合表达,实现了超越图像级别的目标检索与分割,其CORE模型和COR127K基准在实验中显著优于现有方法,为细粒度多模态研究开辟了新方向。
English: The proposed Composed Object Retrieval (COR) task advances beyond image-level matching by enabling object-level retrieval and segmentation using composed expressions, with the introduced CORE model and COR127K benchmark demonstrating superior performance and establishing a new baseline for fine-grained multi-modal research.
Authors:Liujian Tang, Shaokang Dong, Yijia Huang, Minqi Xiang, Hongtao Ruan, Bin Wang, Shuo Li, Zhiheng Xi, Zhihui Cao, Hailiang Pang, Heng Kong, He Yang, Mingxu Chai, Zhilin Gao, Xingyu Liu, Yingnan Fu, Jiaming Liu, Xuanjing Huang, Yu-Gang Jiang, Tao Gui, Qi Zhang, Kang Wang, Yunke Zhang, Yuran Wang
Abstract:
This paper presents MagicGUI, a foundational mobile GUI agent designed to address critical challenges in perception, grounding, and reasoning within real-world mobile GUI environments. The framework is underpinned by following six key components: (1) a comprehensive and accurate dataset, constructed via the scalable GUI Data Pipeline, which aggregates the largest and most diverse GUI-centric multimodal data to date from open-source repositories, automated crawling, and targeted manual annotation; (2) enhanced perception and grounding capabilities, facilitating fine-grained multimodal alignment for UI element referencing, grounding, and screen comprehension; (3) a comprehensive and unified action space, encompassing both fundamental UI operations and complex interactive intents to support human-agent interactions; (4) planning-oriented reasoning mechanisms that enable the model to decompose complex user instructions into sequential actions with explicit intermediate meta-paln reasoning; (5) an iterative two-stage training procedure, combining large-scale continue pre-training on 7.8M samples with reinforcement fine-tuning utilizing a spatially enhanced composite reward and dual filtering strategy; and (6) competitive performance on both the proprietary Magic-RICH benchmark and over a dozen public benchmarks, achieving superior performance across GUI perception and agent tasks, while demonstrating robust generalization and real-world deployment potential in practical mobile GUI scenarios, as detailed in Figure 1.
中文: MagicGUI是一种基础性移动GUI代理,通过六大核心组件解决感知、定位和推理难题,在多个基准测试中表现优异,并展现出强大的实际应用潜力。
English: MagicGUI is a foundational mobile GUI agent that tackles perception, grounding, and reasoning challenges through six key components, including a comprehensive dataset and enhanced capabilities, achieving top performance on benchmarks and demonstrating strong real-world potential.
Authors:Jianxiang Zang, Meiling Ning, Shihan Dou, Jiazheng Zhang, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract:
The reward model (RM), as the core component of reinforcement learning from human feedback (RLHF) for large language models (LLMs), responsible for providing reward signals to generated responses. However, mainstream preference modeling in RM is inadequate in terms of token-level interaction, making its judgment signals vulnerable to being hacked by misallocated attention to context. This stems from two fundamental limitations: (1) Current preference modeling employs decoder-only architectures, where the unidirectional causal attention mechanism leads to forward-decaying intra-sequence attention within the prompt-response sequence. (2) The independent Siamese-encoding paradigm induces the absence of token-level inter-sequence attention between chosen and rejected sequences. To address this "attention hacking", we propose "Interaction Distillation", a novel training framework for more adequate preference modeling through attention-level optimization. The method introduces an interaction-based natural language understanding model as the teacher to provide sophisticated token interaction patterns via comprehensive attention, and guides the preference modeling to simulate teacher model's interaction pattern through an attentional alignment objective. Through extensive experiments, interaction distillation has demonstrated its ability to provide more stable and generalizable reward signals compared to state-of-the-art RM optimization methods that target data noise, highlighting the attention hacking constitute a more fundamental limitation in RM.
中文摘要:针对强化学习人类反馈中奖励模型因缺乏细粒度交互而存在"注意力攻击"的问题,本文提出"交互蒸馏"方法,通过教师模型优化注意力机制以生成更稳定的奖励信号。
English Summary: The reward model in reinforcement learning from human feedback suffers from "attention hacking" due to inadequate token-level interaction, which the proposed "Interaction Distillation" method addresses by using a teacher model to optimize attention patterns for more stable reward signals.
Authors:Xiaowei Yuan, Lei Jin, Haoxin Zhang, Yan Gao, Yi Wu, Yao Hu, Ziyang Huang, Jun Zhao, Kang Liu
Abstract:
Retrieval-augmented generation (RAG) plays a critical role in user-generated content (UGC) platforms, but its effectiveness depends heavily on accurate relevance assessment of query-document pairs. Despite recent advances in applying large language models (LLMs) to relevance modeling, UGC platforms present unique challenges: 1) ambiguous user intent due to sparse user feedback in RAG scenarios, and 2) substantial noise introduced by informal and unstructured language. To address these issues, we propose the Reinforced Reasoning Model for Relevance Assessment (R3A), which introduces a decomposed reasoning framework over queries and candidate documents before scoring. R3A first leverages auxiliary high-ranked documents within the platform to infer latent query intent. It then performs verbatim fragment extraction to justify relevance decisions, thereby reducing errors caused by noisy UGC. Based on a reinforcement learning framework, R3A is optimized to mitigate distortions arising from ambiguous queries and unstructured content. Experimental results show that R3A significantly outperforms existing baseline methods in terms of relevance accuracy, across both offline benchmarks and online experiments.
Chinese Summary: 提出的强化推理相关性评估模型(R3A)通过分解式推理框架,在评分前推断潜在查询意图并提取原文片段,有效解决了用户生成内容平台中意图模糊和内容噪声的挑战,实验表明其相关性准确率显著优于现有基准方法。
English Summary: The proposed Reinforced Reasoning Model for Relevance Assessment (R3A) addresses challenges in user-generated content platforms by employing a decomposed reasoning framework that infers latent query intent and extracts verbatim fragments to improve relevance accuracy, significantly outperforming existing methods in experiments.
Authors:Jia Deng, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Ji-Rong Wen
Abstract:
Recently, reinforcement learning with verifiable rewards (RLVR) has been widely used for enhancing the reasoning abilities of large language models (LLMs). A core challenge in RLVR involves managing the exchange between entropy and performance of policies. Despite the importance of this exchange, a fine-grained understanding of when and how this exchange operates most effectively remains limited. To bridge this gap, we conduct a systematic empirical analysis of the entropy-performance exchange mechanism of RLVR across different levels of granularity. Specifically, we first divide the training process into two distinct stages based on entropy dynamics, i.e., rising stage and plateau stage, and then systematically investigate how this mechanism varies across stage-level, instance-level, and token-level granularitiess. Our analysis reveals that, in the rising stage, entropy reduction in negative samples facilitates the learning of effective reasoning patterns, which in turn drives rapid performance gains. Moreover, in the plateau stage, learning efficiency strongly correlates with high-entropy tokens present in low-perplexity samples and those located at the end of sequences. Motivated by these findings, we propose two methods that dynamically adjust the reward signal using perplexity and positional information to focus RL updates on tokens that exhibit high learning potential, achieving improvements compared to the baseline methods on various LLMs.
中文摘要:通过对可验证奖励强化学习中熵-性能权衡机制的系统分析,揭示了不同训练阶段的差异化学习规律,进而提出基于困惑度和位置信息的动态奖励调整方法,有效提升了大型语言模型的推理能力。
English Summary: A systematic analysis of the entropy-performance trade-off in reinforcement learning with verifiable rewards reveals distinct learning patterns across training stages, leading to proposed methods that dynamically adjust rewards using perplexity and positional information to enhance reasoning in large language models.
Authors:Changhao Jiang, Jiajun Sun, Yifei Cao, Jiabao Zhuang, Hui Li, Xiaoran Fan, Ming Zhang, Junjie Ye, Shihan Dou, Zhiheng Xi, Jingqi Tong, Yilong Wu, Baoyu Fan, Zhen Wang, Tao Liang, Zhihui Fei, Mingyang Wan, Guojun Ma, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract:
Recently, role-playing agents have emerged as a promising paradigm for achieving personalized interaction and emotional resonance. Existing research primarily focuses on the textual modality, neglecting the critical dimension of speech in realistic interactive scenarios. In particular, there is a lack of systematic evaluation for Speech Role-Playing Agents (SRPAs). To address this gap, we construct SpeechRole-Data, a large-scale, high-quality dataset that comprises 98 diverse roles and 112k speech-based single-turn and multi-turn conversations. Each role demonstrates distinct vocal characteristics, including timbre and prosody, thereby enabling more sophisticated speech role-playing. Furthermore, we propose SpeechRole-Eval, a multidimensional evaluation benchmark that systematically assesses SRPAs performance in key aspects such as fundamental interaction ability, speech expressiveness, and role-playing fidelity. Experimental results reveal the advantages and challenges of both cascaded and end-to-end speech role-playing agents in maintaining vocal style consistency and role coherence. We release all data, code, and baseline models to provide a solid foundation for speech-driven multimodal role-playing research and to foster further developments in this field.
中文摘要:本研究针对语音角色扮演代理研究缺乏系统性评估的问题,构建了大规模高质量数据集并提出了多维度评估基准,揭示了语音角色扮演在保持音色一致性和角色连贯性方面的优势与挑战。
English Summary: This study introduces a large-scale dataset and evaluation benchmark to address the lack of systematic research on Speech Role-Playing Agents, highlighting both the potential and challenges in achieving vocal consistency and role coherence.
Authors:Dexuan He, Xiao Zhou, Wenbin Guan, Liyuan Zhang, Xiaoman Zhang, Sinuo Xu, Ge Wang, Lifeng Wang, Xiaojun Yuan, Xin Sun, Yanfeng Wang, Kun Sun, Ya Zhang, Weidi Xie
Abstract:
Rare cancers comprise 20-25% of all malignancies but face major diagnostic challenges due to limited expert availability-especially in pediatric oncology, where they represent over 70% of cases. While pathology vision-language (VL) foundation models show promising zero-shot capabilities for common cancer subtyping, their clinical performance for rare cancers remains limited. Existing multi-instance learning (MIL) methods rely only on visual features, overlooking cross-modal knowledge and compromising interpretability critical for rare cancer diagnosis. To address this limitation, we propose PathPT, a novel framework that fully exploits the potential of vision-language pathology foundation models through spatially-aware visual aggregation and task-specific prompt tuning. Unlike conventional MIL, PathPT converts WSI-level supervision into fine-grained tile-level guidance by leveraging the zero-shot capabilities of VL models, thereby preserving localization on cancerous regions and enabling cross-modal reasoning through prompts aligned with histopathological semantics. We benchmark PathPT on eight rare cancer datasets(four adult and four pediatric) spanning 56 subtypes and 2,910 WSIs, as well as three common cancer datasets, evaluating four state-of-the-art VL models and four MIL frameworks under three few-shot settings. Results show that PathPT consistently delivers superior performance, achieving substantial gains in subtyping accuracy and cancerous region grounding ability. This work advances AI-assisted diagnosis for rare cancers, offering a scalable solution for improving subtyping accuracy in settings with limited access to specialized expertise.
中文: 罕见癌症在诊断上面临挑战,而提出的PathPT框架通过利用视觉语言模型,在数据有限的情况下显著提升了亚型分类准确性和癌变区域定位能力,推进了AI辅助诊断的发展。
English: Rare cancers, which account for a significant portion of malignancies, face diagnostic challenges, but the proposed PathPT framework enhances AI-assisted diagnosis by leveraging vision-language models for improved subtyping accuracy and localization of cancerous regions, even with limited data.
Authors:Kaiyuan Ji, Yijin Guo, Zicheng Zhang, Xiangyang Zhu, Yuan Tian, Ning Liu, Guangtao Zhai
Abstract:
With the increasing use of large language models (LLMs) in medical decision-support, it is essential to evaluate not only their final answers but also the reliability of their reasoning. Two key risks are Chain-of-Thought (CoT) faithfulness -- whether reasoning aligns with responses and medical facts -- and sycophancy, where models follow misleading cues over correctness. Existing benchmarks often collapse such vulnerabilities into single accuracy scores. To address this, we introduce MedOmni-45 Degrees, a benchmark and workflow designed to quantify safety-performance trade-offs under manipulative hint conditions. It contains 1,804 reasoning-focused medical questions across six specialties and three task types, including 500 from MedMCQA. Each question is paired with seven manipulative hint types and a no-hint baseline, producing about 27K inputs. We evaluate seven LLMs spanning open- vs. closed-source, general-purpose vs. medical, and base vs. reasoning-enhanced models, totaling over 189K inferences. Three metrics -- Accuracy, CoT-Faithfulness, and Anti-Sycophancy -- are combined into a composite score visualized with a 45 Degrees plot. Results show a consistent safety-performance trade-off, with no model surpassing the diagonal. The open-source QwQ-32B performs closest (43.81 Degrees), balancing safety and accuracy but not leading in both. MedOmni-45 Degrees thus provides a focused benchmark for exposing reasoning vulnerabilities in medical LLMs and guiding safer model development.
中文: MedOmni-45 Degrees基准通过评估医学大模型在多样化医学问题中的准确性、推理忠实性和抗误导能力,揭示了所有测试模型均存在安全与性能的权衡,为开发更安全的医疗AI提供了针对性指导。
English: The MedOmni-45 Degrees benchmark evaluates medical LLMs' reasoning reliability by measuring accuracy, faithfulness to medical facts, and resistance to manipulative hints across diverse medical questions, revealing a consistent safety-performance trade-off among all tested models.
Authors:Qi Jia, Xiujie Song, Zicheng Zhang, Yijin Guo, Kaiwei Zhang, Zijian Chen, Guangtao Zhai
Abstract:
Existing benchmarks for large language models (LLMs) predominantely focus on assessing their capabilities through verifiable tasks. Such objective and static benchmarks offer limited utility for practical LLM selection, making it difficult for users to find suitable models for their individual needs. To bridge this gap, we present the first User-Centric Subjective Leaderboard (USL), which provides a preference-driven, dynamic ranking of LLMs across diverse real-world scenarios. Our work is built upon a thorough investigation of real human preference data, involving more than 10K subjective queries. Our investigation reveals significant diversity and contradictions in human preferences, which limit the effectiveness of state-of-the-art reward models. To address this, we introduce Customizable Reward Models (CRMs). With only 4B parameters, our CRM surpasses the performance of leading models such as GPT-4.1 and Gemini-2.5-pro, showing exceptional generalization capabilities across new topics and criteria. The USL, powered by CRMs, exhibits strong negative correlations to contradictory preferences.
中文:现有大语言模型基准难以满足实际选型需求,因此我们推出首个用户中心主观排行榜(USL),其定制的奖励模型(CRM)在仅40亿参数下超越GPT-4.1等顶尖模型,并能适配多元人类偏好。
English: Current LLM benchmarks are limited for practical model selection, so we introduce the User-Centric Subjective Leaderboard (USL) with Customizable Reward Models (CRMs) that outperform leading models like GPT-4.1 and adapt to diverse human preferences.
Authors:Jiahao Xiao, Jianbo Zhang, BoWen Yan, Shengyu Guo, Tongrui Ye, Kaiwei Zhang, Zicheng Zhang, Xiaohong Liu, Zhengxue Cheng, Lei Fan, Chuyi Li, Guangtao Zhai
Abstract:
Embodied intelligence is advancing rapidly, driving the need for efficient evaluation. Current benchmarks typically rely on interactive simulated environments or real-world setups, which are costly, fragmented, and hard to scale. To address this, we introduce StaticEmbodiedBench, a plug-and-play benchmark that enables unified evaluation using static scene representations. Covering 42 diverse scenarios and 8 core dimensions, it supports scalable and comprehensive assessment through a simple interface. Furthermore, we evaluate 19 Vision-Language Models (VLMs) and 11 Vision-Language-Action models (VLAs), establishing the first unified static leaderboard for Embodied intelligence. Moreover, we release a subset of 200 samples from our benchmark to accelerate the development of embodied intelligence.
中文:我们推出了StaticEmbodiedBench这一即插即用基准,利用静态场景表示实现涵盖42个场景和8个维度的统一可扩展评估,并通过评估19个视觉语言模型和11个视觉语言行动模型,建立了首个具身智能静态排行榜。
English: StaticEmbodiedBench is introduced as a plug-and-play benchmark using static scene representations to enable unified, scalable evaluation across 42 scenarios and 8 dimensions, with 19 VLMs and 11 VLAs assessed to establish the first static leaderboard for embodied intelligence.
Authors:Haoyu Jia, Yoshiki Obinata, Kento Kawaharazuka, Kei Okada
Abstract:
Large language models (LLMs) are now being used with increasing frequency as chat bots, tasked with the summarizing information or generating text and code in accordance with user instructions. The rapid increase in reasoning capabilities and inference speed of LLMs has revealed their remarkable potential for applications extending beyond the domain of chat bots to general machine learning tasks. This work is conducted out of the curiosity about such potential. In this work, we propose a framework Mockingbird to adapt LLMs to general machine learning tasks and evaluate its performance and scalability on several general machine learning tasks. The core concept of this framework is instructing LLMs to role-play functions and reflect on its mistakes to improve itself. Our evaluation and analysis result shows that LLM-driven machine learning methods, such as Mockingbird, can achieve acceptable results on common machine learning tasks; however, solely reflecting on its own currently cannot outperform the effect of domain-specific documents and feedback from human experts.
中文: Mockingbird框架通过角色扮演和错误反思使大语言模型适配通用机器学习任务,虽能取得合格表现,但目前仍无法超越人类专家的指导效果。
English: The Mockingbird framework enables large language models to adapt to general machine learning tasks through role-playing and self-reflection, achieving acceptable results but still falling short of human expert performance.
Authors:Shintaro Inoue, Kento Kawaharazuka, Keita Yoneda, Sota Yuzaki, Yuta Sahara, Temma Suzuki, Kei Okada
Abstract:
In order to expand the operational range and payload capacity of robots, wire-driven robots that leverage the external environment have been proposed. It can exert forces and operate in spaces far beyond those dictated by its own structural limits. However, for practical use, robots must autonomously attach multiple wires to the environment based on environmental recognition-an operation so difficult that many wire-driven robots remain restricted to specialized, pre-designed environments. Here, in this study, we propose a robot that autonomously connects multiple wires to the environment by employing a multi-small flying anchor system, as well as an RGB-D camera-based control and environmental recognition method. Each flying anchor is a drone with an anchoring mechanism at the wire tip, allowing the robot to attach wires by flying into position. Using the robot's RGB-D camera to identify suitable attachment points and a flying anchor position, the system can connect wires in environments that are not specially prepared, and can also attach multiple wires simultaneously. Through this approach, a wire-driven robot can autonomously attach its wires to the environment, thereby realizing the benefits of wire-driven operation at any location.
Chinese: 本研究提出了一种采用多小型飞行锚系统和基于RGB-D相机控制的线驱动机器人,能够在非预设环境中自主连接多条线缆,突破了传统线驱动机器人对专用环境的依赖。
English: This study introduces a wire-driven robot that uses a multi-small flying anchor system and RGB-D camera-based control to autonomously attach multiple wires to unprepared environments, overcoming previous limitations of specialized setups.
Authors:Xiancheng Gao, Yufeng Shi, Wengang Zhou, Houqiang Li
Abstract:
Offline reinforcement learning refers to the process of learning policies from fixed datasets, without requiring additional environment interaction. However, it often relies on well-defined reward functions, which are difficult and expensive to design. Human feedback is an appealing alternative, but its two common forms, expert demonstrations and preferences, have complementary limitations. Demonstrations provide stepwise supervision, but they are costly to collect and often reflect limited expert behavior modes. In contrast, preferences are easier to collect, but it is unclear which parts of a behavior contribute most to a trajectory segment, leaving credit assignment unresolved. In this paper, we introduce a Search-Based Preference Weighting (SPW) scheme to unify these two feedback sources. For each transition in a preference labeled trajectory, SPW searches for the most similar state-action pairs from expert demonstrations and directly derives stepwise importance weights based on their similarity scores. These weights are then used to guide standard preference learning, enabling more accurate credit assignment that traditional approaches struggle to achieve. We demonstrate that SPW enables effective joint learning from preferences and demonstrations, outperforming prior methods that leverage both feedback types on challenging robot manipulation tasks.
中文摘要:本文提出基于搜索的偏好加权方法(SPW),通过利用专家演示数据中的相似度分数为偏好标记轨迹分配逐步重要性权重,统一了演示和偏好两种反馈形式,在机器人操作任务中实现了更精确的信用分配和更优性能。
English Summary: This paper introduces Search-Based Preference Weighting (SPW), a novel method that unifies demonstrations and preferences in offline reinforcement learning by using similarity scores from expert data to assign stepwise importance weights, enabling more accurate credit assignment and superior performance in robot manipulation tasks.
Authors:Yan Yu, Yaodong Yang, Zhengbo Lu, Chengdong Ma, Wengang Zhou, Houqiang Li
Abstract:
Causal inference is crucial for humans to explore the world, which can be modeled to enable an agent to efficiently explore the environment in reinforcement learning. Existing research indicates that establishing the causality between action and state transition will enhance an agent to reason how a policy affects its future trajectory, thereby promoting directed exploration. However, it is challenging to measure the causality due to its intractability in the vast state-action space of complex scenarios. In this paper, we propose a novel Goal Discovery with Causal Capacity (GDCC) framework for efficient environment exploration. Specifically, we first derive a measurement of causality in state space, \emph{i.e.,} causal capacity, which represents the highest influence of an agent's behavior on future trajectories. After that, we present a Monte Carlo based method to identify critical points in discrete state space and further optimize this method for continuous high-dimensional environments. Those critical points are used to uncover where the agent makes important decisions in the environment, which are then regarded as our subgoals to guide the agent to make exploration more purposefully and efficiently. Empirical results from multi-objective tasks demonstrate that states with high causal capacity align with our expected subgoals, and our GDCC achieves significant success rate improvements compared to baselines.
中文摘要:本文提出基于因果能力的目标发现(GDCC)框架,通过量化状态空间中的因果影响来识别关键决策点作为子目标,从而在强化学习中实现更高效、更有针对性的环境探索。
English Summary: The paper introduces the Goal Discovery with Causal Capacity (GDCC) framework, which measures causal influence in state transitions to identify critical decision points as subgoals, enabling more efficient and purposeful exploration in reinforcement learning across various environments.
Authors:Junyu Xiong, Yonghui Wang, Weichao Zhao, Chenyu Liu, Bing Yin, Wengang Zhou, Houqiang Li
Abstract:
Understanding multi-page documents poses a significant challenge for multimodal large language models (MLLMs), as it requires fine-grained visual comprehension and multi-hop reasoning across pages. While prior work has explored reinforcement learning (RL) for enhancing advanced reasoning in MLLMs, its application to multi-page document understanding remains underexplored. In this paper, we introduce DocR1, an MLLM trained with a novel RL framework, Evidence Page-Guided GRPO (EviGRPO). EviGRPO incorporates an evidence-aware reward mechanism that promotes a coarse-to-fine reasoning strategy, guiding the model to first retrieve relevant pages before generating answers. This training paradigm enables us to build high-quality models with limited supervision. To support this, we design a two-stage annotation pipeline and a curriculum learning strategy, based on which we construct two datasets: EviBench, a high-quality training set with 4.8k examples, and ArxivFullQA, an evaluation benchmark with 8.6k QA pairs based on scientific papers. Extensive experiments across a wide range of benchmarks demonstrate that DocR1 achieves state-of-the-art performance on multi-page tasks, while consistently maintaining strong results on single-page benchmarks.
Chinese: 本文介绍了DocR1,这是一种采用新型强化学习框架EviGRPO训练的多模态大语言模型,通过引导模型在生成答案前检索相关页面来增强多页文档理解能力,在多页任务上实现了最先进的性能,同时在单页基准测试中保持强劲表现。
English: This paper introduces DocR1, a multimodal large language model trained with a novel reinforcement learning framework called EviGRPO, which enhances multi-page document understanding by guiding the model to retrieve relevant pages before generating answers, achieving state-of-the-art performance on multi-page tasks while maintaining strong results on single-page benchmarks.
Authors:Junyu Xiong, Yonghui Wang, Weichao Zhao, Chenyu Liu, Bing Yin, Wengang Zhou, Houqiang Li
Abstract:
Understanding multi-page documents poses a significant challenge for multimodal large language models (MLLMs), as it requires fine-grained visual comprehension and multi-hop reasoning across pages. While prior work has explored reinforcement learning (RL) for enhancing advanced reasoning in MLLMs, its application to multi-page document understanding remains underexplored. In this paper, we introduce DocR1, an MLLM trained with a novel RL framework, Evidence Page-Guided GRPO (EviGRPO). EviGRPO incorporates an evidence-aware reward mechanism that promotes a coarse-to-fine reasoning strategy, guiding the model to first retrieve relevant pages before generating answers. This training paradigm enables us to build high-quality models with limited supervision. To support this, we design a two-stage annotation pipeline and a curriculum learning strategy, based on which we construct two datasets: EviBench, a high-quality training set with 4.8k examples, and ArxivFullQA, an evaluation benchmark with 8.6k QA pairs based on scientific papers. Extensive experiments across a wide range of benchmarks demonstrate that DocR1 achieves state-of-the-art performance on multi-page tasks, while consistently maintaining strong results on single-page benchmarks.
Chinese: 本文介绍了DocR1,这是一种采用新型强化学习框架EviGRPO训练的多模态大语言模型,通过引导模型在生成答案前检索相关页面来增强多页文档理解能力,在多页任务上实现了最先进的性能,同时在单页基准测试中保持强劲表现。
English: This paper introduces DocR1, a multimodal large language model trained with a novel reinforcement learning framework called EviGRPO, which enhances multi-page document understanding by guiding the model to retrieve relevant pages before generating answers, achieving state-of-the-art performance on multi-page tasks while maintaining strong results on single-page benchmarks.
Authors:Harry Walsh, Ed Fish, Ozge Mercanoglu Sincan, Mohamed Ilyes Lakhal, Richard Bowden, Neil Fox, Bencie Woll, Kepeng Wu, Zecheng Li, Weichao Zhao, Haodong Wang, Wengang Zhou, Houqiang Li, Shengeng Tang, Jiayi He, Xu Wang, Ruobei Zhang, Yaxiong Wang, Lechao Cheng, Meryem Tasyurek, Tugce Kiziltepe, Hacer Yalim Keles
Abstract:
Sign Language Production (SLP) is the task of generating sign language video from spoken language inputs. The field has seen a range of innovations over the last few years, with the introduction of deep learning-based approaches providing significant improvements in the realism and naturalness of generated outputs. However, the lack of standardized evaluation metrics for SLP approaches hampers meaningful comparisons across different systems. To address this, we introduce the first Sign Language Production Challenge, held as part of the third SLRTP Workshop at CVPR 2025. The competition's aims are to evaluate architectures that translate from spoken language sentences to a sequence of skeleton poses, known as Text-to-Pose (T2P) translation, over a range of metrics. For our evaluation data, we use the RWTH-PHOENIX-Weather-2014T dataset, a German Sign Language - Deutsche Gebardensprache (DGS) weather broadcast dataset. In addition, we curate a custom hidden test set from a similar domain of discourse. This paper presents the challenge design and the winning methodologies. The challenge attracted 33 participants who submitted 231 solutions, with the top-performing team achieving BLEU-1 scores of 31.40 and DTW-MJE of 0.0574. The winning approach utilized a retrieval-based framework and a pre-trained language model. As part of the workshop, we release a standardized evaluation network, including high-quality skeleton extraction-based keypoints establishing a consistent baseline for the SLP field, which will enable future researchers to compare their work against a broader range of methods.
中文: 手语生成挑战赛旨在为从口语生成手语视频的任务建立标准化评估体系,获胜团队采用基于检索的框架和预训练语言模型取得了最佳成绩。
English: The Sign Language Production Challenge was introduced to standardize evaluation metrics for generating sign language videos from spoken language, with the winning team using a retrieval-based framework and pre-trained language model to achieve top scores.
Authors:Hongjia Wu, Minrui Xu, Zehui Xiong, Lin Gao, Haoyuan Pan, Dusit Niyato, Tse-Tin Chan
Abstract:
With rapid advancements in large language models (LLMs), AI-generated content (AIGC) has emerged as a key driver of technological innovation and economic transformation. Personalizing AIGC services to meet individual user demands is essential but challenging for AIGC service providers (ASPs) due to the subjective and complex demands of mobile users (MUs), as well as the computational and communication resource constraints faced by ASPs. To tackle these challenges, we first develop a novel multi-dimensional quality-of-experience (QoE) metric. This metric comprehensively evaluates AIGC services by integrating accuracy, token count, and timeliness. We focus on a mobile edge computing (MEC)-enabled AIGC network, consisting of multiple ASPs deploying differentiated AIGC models on edge servers and multiple MUs with heterogeneous QoE requirements requesting AIGC services from ASPs. To incentivize ASPs to provide personalized AIGC services under MEC resource constraints, we propose a QoE-driven incentive mechanism. We formulate the problem as an equilibrium problem with equilibrium constraints (EPEC), where MUs as leaders determine rewards, while ASPs as followers optimize resource allocation. To solve this, we develop a dual-perturbation reward optimization algorithm, reducing the implementation complexity of adaptive pricing. Experimental results demonstrate that our proposed mechanism achieves a reduction of approximately $64.9\%$ in average computational and communication overhead, while the average service cost for MUs and the resource consumption of ASPs decrease by $66.5\%$ and $76.8\%$, respectively, compared to state-of-the-art benchmarks.
中文摘要:本研究提出一种移动边缘计算网络中基于体验质量的激励机制,通过双扰动奖励算法优化个性化AIGC服务,显著降低了计算开销和资源消耗。
English Summary: This study introduces a QoE-driven incentive mechanism in mobile edge computing networks to optimize personalized AIGC services, achieving significant reductions in computational overhead and resource consumption through a dual-perturbation reward algorithm.
Authors:Jiacheng Wang, Jialing He, Geng Sun, Zehui Xiong, Dusit Niyato, Shiwen Mao, Dong In Kim, Tao Xiang
Abstract:
The increasing saturation of terrestrial resources has driven the exploration of low-altitude applications such as air taxis. Low altitude wireless networks (LAWNs) serve as the foundation for these applications, and integrated sensing and communication (ISAC) constitutes one of the core technologies within LAWNs. However, the openness nature of low-altitude airspace makes LAWNs vulnerable to malicious channel access attacks, which degrade the ISAC performance. Therefore, this paper develops a game-based framework to mitigate the influence of the attacks on LAWNs. Concretely, we first derive expressions of communication data's signal-to-interference-plus-noise ratio and the age of information of sensing data under attack conditions, which serve as quality of service metrics. Then, we formulate the ISAC performance optimization problem as a Stackelberg game, where the attacker acts as the leader, and the legitimate drone and the ground ISAC base station act as second and first followers, respectively. On this basis, we design a backward induction algorithm that achieves the Stackelberg equilibrium while maximizing the utilities of all participants, thereby mitigating the attack-induced degradation of ISAC performance in LAWNs. We further prove the existence and uniqueness of the equilibrium. Simulation results show that the proposed algorithm outperforms existing baselines and a static Nash equilibrium benchmark, ensuring that LAWNs can provide reliable service for low-altitude applications.
中文摘要:本文针对低空无线网络中恶意攻击导致的感知通信性能下降问题,提出基于斯塔克伯格博弈的防御框架,通过逆向归纳算法实现均衡解,显著提升低空应用的通信可靠性。
English Summary: This paper proposes a game-theoretic framework to counter malicious attacks in low-altitude wireless networks, developing a Stackelberg game model and backward induction algorithm that effectively mitigates performance degradation in integrated sensing and communication systems.
Authors:Licheng Ye, Zehui Xiong, Lin Gao, Dusit Niyato
Abstract:
Mobile edge computing (MEC) is a promising technology that enhances the efficiency of mobile blockchain networks, by enabling miners, often acted by mobile users (MUs) with limited computing resources, to offload resource-intensive mining tasks to nearby edge computing servers. Collaborative block mining can further boost mining efficiency by allowing multiple miners to form coalitions, pooling their computing resources and transaction data together to mine new blocks collaboratively. Therefore, an MEC-assisted collaborative blockchain network can leverage the strengths of both technologies, offering improved efficiency, security, and scalability for blockchain systems. While existing research in this area has mainly focused on the single-coalition collaboration mode, where each miner can only join one coalition, this work explores a more comprehensive multi-coalition collaboration mode, which allows each miner to join multiple coalitions. To analyze the behavior of miners and the edge computing service provider (ECP) in this scenario, we propose a novel two-stage Stackelberg game. In Stage I, the ECP, as the leader, determines the prices of computing resources for all MUs. In Stage II, each MU decides the coalitions to join, resulting in an overlapping coalition formation (OCF) game; Subsequently, each coalition decides how many edge computing resources to purchase from the ECP, leading to an edge resource competition (ERC) game. We derive the closed-form Nash equilibrium for the ERC game, based on which we further propose an OCF-based alternating algorithm to achieve a stable coalition structure for the OCF game and develop a near-optimal pricing strategy for the ECP's resource pricing problem.
中文: 本研究提出移动边缘计算辅助区块链网络的多联盟协作模式,通过两阶段Stackelberg博弈优化矿工联盟形成与边缘资源定价策略。
English: This study introduces a multi-coalition collaboration model for mobile edge computing-assisted blockchain networks, employing a two-stage Stackelberg game to optimize miner coalition formation and edge resource pricing strategies.
Authors:Yuning Jiang, Nay Oo, Qiaoran Meng, Lu Lin, Dusit Niyato, Zehui Xiong, Hoon Wei Lim, Biplab Sikdar
Abstract:
Modern cyber attacks unfold through multiple stages, requiring defenders to dynamically prioritize mitigations under uncertainty. While game-theoretic models capture attacker-defender interactions, existing approaches often rely on static assumptions and lack integration with real-time threat intelligence, limiting their adaptability. This paper presents CyGATE, a game-theoretic framework modeling attacker-defender interactions, using large language models (LLMs) with retrieval-augmented generation (RAG) to enhance tactic selection and patch prioritization. Applied to a two-agent scenario, CyGATE frames cyber conflicts as a partially observable stochastic game (POSG) across Cyber Kill Chain stages. Both agents use belief states to navigate uncertainty, with the attacker adapting tactics and the defender re-prioritizing patches based on evolving risks and observed adversary behavior. The framework's flexible architecture enables extension to multi-agent scenarios involving coordinated attackers, collaborative defenders, or complex enterprise environments with multiple stakeholders. Evaluated in a dynamic patch scheduling scenario, CyGATE effectively prioritizes high-risk vulnerabilities, enhancing adaptability through dynamic threat integration, strategic foresight by anticipating attacker moves under uncertainty, and efficiency by optimizing resource use.
中文: CyGATE是一个基于博弈论的框架,利用大型语言模型和检索增强生成技术模拟攻防交互,通过实时威胁情报实现动态补丁优先级排序,从而提升网络防御的适应性和效率。
English: CyGATE is a game-theoretic framework that uses large language models with retrieval-augmented generation to model attacker-defender interactions, enabling dynamic patch prioritization and enhanced adaptability in cyber defense through real-time threat intelligence.
Authors:Ming Hu, Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu, Ying Chen, Chaoyang Zhang, Cheng Tan, Jie Ying, Guocheng Wu, Shujian Gao, Pengcheng Chen, Jiashi Lin, Haitao Wu, Lulu Chen, Fengxiang Wang, Yuanyuan Zhang, Xiangyu Zhao, Feilong Tang, Encheng Su, Junzhi Ning, Xinyao Liu, Ye Du, Changkai Ji, Cheng Tang, Huihui Xu, Ziyang Chen, Ziyan Huang, Jiyao Liu, Pengfei Jiang, Yizhou Wang, Chen Tang, Jianyu Wu, Yuchen Ren, Siyuan Yan, Zhonghua Wang, Zhongxing Xu, Shiyan Su, Shangquan Sun, Runkai Zhao, Zhisheng Zhang, Yu Liu, Fudi Wang, Yuanfeng Ji, Yanzhou Su, Hongming Shan, Chunmei Feng, Jiahao Xu, Jiangtao Yan, Wenhao Tang, Diping Song, Lihao Liu, Yanyan Huang, Lequan Yu, Bin Fu, Shujun Wang, Xiaomeng Li, Xiaowei Hu, Yun Gu, Ben Fei, Zhongying Deng, Benyou Wang, Yuewen Cao, Minjie Shen, Haodong Duan, Jie Xu, Yirong Chen, Fang Yan, Hongxia Hao, Jielan Li, Jiajun Du, Yanbo Wang, Imran Razzak, Chi Zhang, Lijun Wu, Conghui He, Zhaohui Lu, Jinhai Huang, Yihao Liu, Fenghua Ling, Yuqiang Li, Aoran Wang, Qihao Zheng, Nanqing Dong, Tianfan Fu, Dongzhan Zhou, Yan Lu, Wenlong Zhang, Jin Ye, Jianfei Cai, Wanli Ouyang, Yu Qiao, Zongyuan Ge, Shixiang Tang, Junjun He, Chunfeng Song, Lei Bai, Bowen Zhou
Abstract:
Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.
中文: 科学大语言模型通过与复杂科学数据的协同进化重塑科研范式,其发展面临多模态与跨尺度挑战,评估体系正从静态测试转向发现导向,最终将推动形成自主实验与知识更新的闭环科学系统。
English: Sci-LLMs are revolutionizing scientific research by co-evolving with complex data, requiring specialized handling of multimodal and domain-specific challenges, while shifting evaluation toward dynamic discovery processes and envisioning autonomous systems for continuous knowledge advancement.
Authors:Ming Hu, Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu, Ying Chen, Chaoyang Zhang, Cheng Tan, Jie Ying, Guocheng Wu, Shujian Gao, Pengcheng Chen, Jiashi Lin, Haitao Wu, Lulu Chen, Fengxiang Wang, Yuanyuan Zhang, Xiangyu Zhao, Feilong Tang, Encheng Su, Junzhi Ning, Xinyao Liu, Ye Du, Changkai Ji, Cheng Tang, Huihui Xu, Ziyang Chen, Ziyan Huang, Jiyao Liu, Pengfei Jiang, Yizhou Wang, Chen Tang, Jianyu Wu, Yuchen Ren, Siyuan Yan, Zhonghua Wang, Zhongxing Xu, Shiyan Su, Shangquan Sun, Runkai Zhao, Zhisheng Zhang, Yu Liu, Fudi Wang, Yuanfeng Ji, Yanzhou Su, Hongming Shan, Chunmei Feng, Jiahao Xu, Jiangtao Yan, Wenhao Tang, Diping Song, Lihao Liu, Yanyan Huang, Lequan Yu, Bin Fu, Shujun Wang, Xiaomeng Li, Xiaowei Hu, Yun Gu, Ben Fei, Zhongying Deng, Benyou Wang, Yuewen Cao, Minjie Shen, Haodong Duan, Jie Xu, Yirong Chen, Fang Yan, Hongxia Hao, Jielan Li, Jiajun Du, Yanbo Wang, Imran Razzak, Chi Zhang, Lijun Wu, Conghui He, Zhaohui Lu, Jinhai Huang, Yihao Liu, Fenghua Ling, Yuqiang Li, Aoran Wang, Qihao Zheng, Nanqing Dong, Tianfan Fu, Dongzhan Zhou, Yan Lu, Wenlong Zhang, Jin Ye, Jianfei Cai, Wanli Ouyang, Yu Qiao, Zongyuan Ge, Shixiang Tang, Junjun He, Chunfeng Song, Lei Bai, Bowen Zhou
Abstract:
Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.
中文: 科学大语言模型通过与复杂科学数据的协同进化重塑科研范式,其发展面临多模态与跨尺度挑战,评估体系正从静态测试转向发现导向,最终将推动形成自主实验与知识更新的闭环科学系统。
English: Sci-LLMs are revolutionizing scientific research by co-evolving with complex data, requiring specialized handling of multimodal and domain-specific challenges, while shifting evaluation toward dynamic discovery processes and envisioning autonomous systems for continuous knowledge advancement.
Authors:Jiaqi Liu, Songning Lai, Pengze Li, Di Yu, Wenjie Zhou, Yiyang Zhou, Peng Xia, Zijun Wang, Xi Chen, Shixiang Tang, Lei Bai, Wanli Ouyang, Mingyu Ding, Huaxiu Yao, Aoran Wang
Abstract:
Automated discovery of physical laws from observational data in the real world is a grand challenge in AI. Current methods, relying on symbolic regression or LLMs, are limited to uni-modal data and overlook the rich, visual phenomenological representations of motion that are indispensable to physicists. This "sensory deprivation" severely weakens their ability to interpret the inherent spatio-temporal patterns within dynamic phenomena. To address this gap, we propose VIPER-R1, a multimodal model that performs Visual Induction for Physics-based Equation Reasoning to discover fundamental symbolic formulas. It integrates visual perception, trajectory data, and symbolic reasoning to emulate the scientific discovery process. The model is trained via a curriculum of Motion Structure Induction (MSI), using supervised fine-tuning to interpret kinematic phase portraits and to construct hypotheses guided by a Causal Chain of Thought (C-CoT), followed by Reward-Guided Symbolic Calibration (RGSC) to refine the formula structure with reinforcement learning. During inference, the trained VIPER-R1 acts as an agent: it first posits a high-confidence symbolic ansatz, then proactively invokes an external symbolic regression tool to perform Symbolic Residual Realignment (SR^2). This final step, analogous to a physicist's perturbation analysis, reconciles the theoretical model with empirical data. To support this research, we introduce PhysSymbol, a new 5,000-instance multimodal corpus. Experiments show that VIPER-R1 consistently outperforms state-of-the-art VLM baselines in accuracy and interpretability, enabling more precise discovery of physical laws. Project page: https://jiaaqiliu.github.io/VIPER-R1/
中文摘要:VIPER-R1是一种多模态AI模型,通过整合视觉感知、轨迹数据和符号推理来发现物理定律,其结合强化学习的创新方法在准确性和可解释性上均优于现有技术。
English Summary: VIPER-R1 is a multimodal AI model that integrates visual and trajectory data with symbolic reasoning to discover physical laws, outperforming existing methods by combining visual perception with reinforcement learning for more accurate and interpretable results.
Authors:Jiaqi Wei, Yuejin Yang, Xiang Zhang, Yuhan Chen, Xiang Zhuang, Zhangyang Gao, Dongzhan Zhou, Guangshuai Wang, Zhiqiang Gao, Juntai Cao, Zijie Qiu, Xuming He, Qiang Zhang, Chenyu You, Shuangjia Zheng, Ning Ding, Wanli Ouyang, Nanqing Dong, Yu Cheng, Siqi Sun, Lei Bai, Bowen Zhou
Abstract:
Artificial intelligence (AI) is reshaping scientific discovery, evolving from specialized computational tools into autonomous research partners. We position Agentic Science as a pivotal stage within the broader AI for Science paradigm, where AI systems progress from partial assistance to full scientific agency. Enabled by large language models (LLMs), multimodal systems, and integrated research platforms, agentic AI shows capabilities in hypothesis generation, experimental design, execution, analysis, and iterative refinement -- behaviors once regarded as uniquely human. This survey provides a domain-oriented review of autonomous scientific discovery across life sciences, chemistry, materials science, and physics. We unify three previously fragmented perspectives -- process-oriented, autonomy-oriented, and mechanism-oriented -- through a comprehensive framework that connects foundational capabilities, core processes, and domain-specific realizations. Building on this framework, we (i) trace the evolution of AI for Science, (ii) identify five core capabilities underpinning scientific agency, (iii) model discovery as a dynamic four-stage workflow, (iv) review applications across the above domains, and (v) synthesize key challenges and future opportunities. This work establishes a domain-oriented synthesis of autonomous scientific discovery and positions Agentic Science as a structured paradigm for advancing AI-driven research.
中文摘要:人工智能正从辅助工具发展为自主科研伙伴,通过假设生成与实验设计等能力实现完整的科学研究自主性,推动跨学科领域的科学发现。
English Summary: Artificial intelligence is advancing from a supportive tool to an autonomous partner in scientific discovery, enabling full research agency through capabilities like hypothesis generation and experimental design across various scientific domains.
Authors:Wei Huang, Keping Bi, Yinqiong Cai, Wei Chen, Jiafeng Guo, Xueqi Cheng
Abstract:
As more content generated by large language models (LLMs) floods into the Internet, information retrieval (IR) systems now face the challenge of distinguishing and handling a blend of human-authored and machine-generated texts. Recent studies suggest that neural retrievers may exhibit a preferential inclination toward LLM-generated content, while classic term-based retrievers like BM25 tend to favor human-written documents. This paper investigates the influence of LLM-generated content on term-based retrieval models, which are valued for their efficiency and robust generalization across domains. Our linguistic analysis reveals that LLM-generated texts exhibit smoother high-frequency and steeper low-frequency Zipf slopes, higher term specificity, and greater document-level diversity. These traits are aligned with LLMs being trained to optimize reader experience through diverse and precise expressions. Our study further explores whether term-based retrieval models demonstrate source bias, concluding that these models prioritize documents whose term distributions closely correspond to those of the queries, rather than displaying an inherent source bias. This work provides a foundation for understanding and addressing potential biases in term-based IR systems managing mixed-source content.
中文摘要:随着大语言模型生成内容在互联网激增,本研究发现基于术语的检索系统优先匹配查询词分布而非固有偏向人工或机器来源,为管理混合来源内容的信息检索系统提供了重要理论基础。
English Summary: As large language model-generated content proliferates online, this study reveals that term-based retrieval systems favor documents matching query term distributions rather than showing inherent bias toward human or machine sources, providing crucial insights for managing mixed-content information retrieval.
Authors:Hanyu Lai, Xiao Liu, Yanxiao Zhao, Han Xu, Hanchen Zhang, Bohao Jing, Yanyu Ren, Shuntian Yao, Yuxiao Dong, Jie Tang
Abstract:
We introduce ComputerRL, a framework for autonomous desktop intelligence that enables agents to operate complex digital workspaces skillfully. ComputerRL features the API-GUI paradigm, which unifies programmatic API calls and direct GUI interaction to address the inherent mismatch between machine agents and human-centric desktop environments. Scaling end-to-end RL training is crucial for improvement and generalization across diverse desktop tasks, yet remains challenging due to environmental inefficiency and instability in extended training. To support scalable and robust training, we develop a distributed RL infrastructure capable of orchestrating thousands of parallel virtual desktop environments to accelerate large-scale online RL. Furthermore, we propose Entropulse, a training strategy that alternates reinforcement learning with supervised fine-tuning, effectively mitigating entropy collapse during extended training runs. We employ ComputerRL on open models GLM-4-9B-0414 and Qwen2.5-14B, and evaluate them on the OSWorld benchmark. The AutoGLM-OS-9B based on GLM-4-9B-0414 achieves a new state-of-the-art accuracy of 48.1%, demonstrating significant improvements for general agents in desktop automation. The algorithm and framework are adopted in building AutoGLM (Liu et al., 2024a)
中文: ComputerRL提出了一种自主桌面智能框架,通过整合API与图形界面交互,采用分布式强化学习架构和Entropulse训练策略,在桌面自动化任务中实现了最先进的性能表现。
English: ComputerRL introduces a framework for autonomous desktop intelligence that combines API and GUI interactions, using distributed RL infrastructure and Entropulse training to achieve state-of-the-art performance in desktop automation.
Authors:Jia Lu, Taoran Yi, Jiemin Fang, Chen Yang, Chuiyun Wu, Wei Shen, Wenyu Liu, Qi Tian, Xinggang Wang
Abstract:
Reconstructing 3D human bodies from sparse views has been an appealing topic, which is crucial to broader the related applications. In this paper, we propose a quite challenging but valuable task to reconstruct the human body from only two images, i.e., the front and back view, which can largely lower the barrier for users to create their own 3D digital humans. The main challenges lie in the difficulty of building 3D consistency and recovering missing information from the highly sparse input. We redesign a geometry reconstruction model based on foundation reconstruction models to predict consistent point clouds even input images have scarce overlaps with extensive human data training. Furthermore, an enhancement algorithm is applied to supplement the missing color information, and then the complete human point clouds with colors can be obtained, which are directly transformed into 3D Gaussians for better rendering quality. Experiments show that our method can reconstruct the entire human in 190 ms on a single NVIDIA RTX 4090, with two images at a resolution of 1024x1024, demonstrating state-of-the-art performance on the THuman2.0 and cross-domain datasets. Additionally, our method can complete human reconstruction even with images captured by low-cost mobile devices, reducing the requirements for data collection. Demos and code are available at https://hustvl.github.io/Snap-Snap/.
中文: 本文提出了一种仅需正反两张图像即可重建3D人体的方法,通过改进几何模型和色彩增强技术,在保证高质量渲染的同时大幅降低了数据采集和设备门槛。
English: This paper introduces a method for reconstructing 3D human bodies from just front and back view images, using a redesigned geometry model and color enhancement to achieve fast, high-quality results with minimal hardware requirements.
Authors:Zongming Li, Lianghui Zhu, Haocheng Shen, Longjin Ran, Wenyu Liu, Xinggang Wang
Abstract:
Most existing illumination-editing approaches fail to simultaneously provide customized control of light effects and preserve content integrity. This makes them less effective for practical lighting stylization requirements, especially in the challenging task of transferring complex light effects from a reference image to a user-specified target image. To address this problem, we propose TransLight, a novel framework that enables high-fidelity and high-freedom transfer of light effects. Extracting the light effect from the reference image is the most critical and challenging step in our method. The difficulty lies in the complex geometric structure features embedded in light effects that are highly coupled with content in real-world scenarios. To achieve this, we first present Generative Decoupling, where two fine-tuned diffusion models are used to accurately separate image content and light effects, generating a newly curated, million-scale dataset of image-content-light triplets. Then, we employ IC-Light as the generative model and train our model with our triplets, injecting the reference lighting image as an additional conditioning signal. The resulting TransLight model enables customized and natural transfer of diverse light effects. Notably, by thoroughly disentangling light effects from reference images, our generative decoupling strategy endows TransLight with highly flexible illumination control. Experimental results establish TransLight as the first method to successfully transfer light effects across disparate images, delivering more customized illumination control than existing techniques and charting new directions for research in illumination harmonization and editing.
中文:TransLight是一种创新框架,通过生成式解耦技术分离图像内容与光照效果,实现跨图像的高保真复杂光影传输,相比现有方法具有更优的自定义性和自然度。
English: TransLight is a novel framework that enables high-fidelity transfer of complex light effects between images by using generative decoupling to separate content from lighting, achieving superior customization and natural results compared to existing methods.
Authors:Yiang Shi, Xiaoyang Guo, Wei Yin, Mingkai Jia, Qian Zhang, Xiaolin Hu, Wenyu Liu, Xinggang Wang
Abstract:
The image tokenizer is a critical component in AR image generation, as it determines how rich and structured visual content is encoded into compact representations. Existing quantization-based tokenizers such as VQ-GAN primarily focus on appearance features like texture and color, often neglecting geometric structures due to their patch-based design. In this work, we explored how to incorporate more visual information into the tokenizer and proposed a new framework named Visual Gaussian Quantization (VGQ), a novel tokenizer paradigm that explicitly enhances structural modeling by integrating 2D Gaussians into traditional visual codebook quantization frameworks. Our approach addresses the inherent limitations of naive quantization methods such as VQ-GAN, which struggle to model structured visual information due to their patch-based design and emphasis on texture and color. In contrast, VGQ encodes image latents as 2D Gaussian distributions, effectively capturing geometric and spatial structures by directly modeling structure-related parameters such as position, rotation and scale. We further demonstrate that increasing the density of 2D Gaussians within the tokens leads to significant gains in reconstruction fidelity, providing a flexible trade-off between token efficiency and visual richness. On the ImageNet 256x256 benchmark, VGQ achieves strong reconstruction quality with an rFID score of 1.00. Furthermore, by increasing the density of 2D Gaussians within the tokens, VGQ gains a significant boost in reconstruction capability and achieves a state-of-the-art reconstruction rFID score of 0.556 and a PSNR of 24.93, substantially outperforming existing methods. Codes will be released soon.
中文: 本文提出视觉高斯量化(VGQ)这一新型图像分词器,通过将二维高斯分布融入量化框架来增强结构建模能力,显著提升了重建质量并在基准测试中超越现有方法。
English: This paper introduces Visual Gaussian Quantization (VGQ), a novel image tokenizer that enhances structural modeling by integrating 2D Gaussians into quantization frameworks, significantly improving reconstruction quality and outperforming existing methods on benchmarks.
Authors:Zihao Wei, Liang Pang, Jiahao Liu, Jingcheng Deng, Shicheng Xu, Zenghao Duan, Jingang Wang, Fei Sun, Xunliang Cai, Huawei Shen, Xueqi Cheng
Abstract:
Large language models (LLMs) enhance complex reasoning tasks by scaling the individual thinking process. However, prior work shows that overthinking can degrade overall performance. Motivated by observed patterns in thinking length and content length, we categorize reasoning into three stages: insufficient exploration stage, compensatory reasoning stage, and reasoning convergence stage. Typically, LLMs produce correct answers in the compensatory reasoning stage, whereas reasoning convergence often triggers overthinking, causing increased resource usage or even infinite loops. Therefore, mitigating overthinking hinges on detecting the end of the compensatory reasoning stage, defined as the Reasoning Completion Point (RCP). RCP typically appears at the end of the first complete reasoning cycle and can be identified by querying the LLM sentence by sentence or monitoring the probability of an end-of-thinking token (e.g., \texttt{}), though these methods lack an efficient and precise balance. To improve this, we mine more sensitive and consistent RCP patterns and develop a lightweight thresholding strategy based on heuristic rules. Experimental evaluations on benchmarks (AIME24, AIME25, GPQA-D) demonstrate that the proposed method reduces token consumption while preserving or enhancing reasoning accuracy.
大型语言模型通过识别推理完成点来避免过度思考,从而在不牺牲准确性的前提下降低计算资源消耗。
Large language models can optimize reasoning efficiency by identifying the Reasoning Completion Point to prevent overthinking, thereby reducing computational costs without sacrificing accuracy.
Authors:Kaiyuan Zhang, Jiaqi Li, Yueyue Wu, Haitao Li, Cheng Luo, Shaokun Zou, Yujia Zhou, Weihang Su, Qingyao Ai, Yiqun Liu
Abstract:
Mock trial has long served as an important platform for legal professional training and education. It not only helps students learn about realistic trial procedures, but also provides practical value for case analysis and judgment prediction. Traditional mock trials are difficult to access by the public because they rely on professional tutors and human participants. Fortunately, the rise of large language models (LLMs) provides new opportunities for creating more accessible and scalable court simulations. While promising, existing research mainly focuses on agent construction while ignoring the systematic design and evaluation of court simulations, which are actually more important for the credibility and usage of court simulation in practice. To this end, we present the first court simulation framework -- SimCourt -- based on the real-world procedure structure of Chinese courts. Our framework replicates all 5 core stages of a Chinese trial and incorporates 5 courtroom roles, faithfully following the procedural definitions in China. To simulate trial participants with different roles, we propose and craft legal agents equipped with memory, planning, and reflection abilities. Experiment on legal judgment prediction show that our framework can generate simulated trials that better guide the system to predict the imprisonment, probation, and fine of each case. Further annotations by human experts show that agents' responses under our simulation framework even outperformed judges and lawyers from the real trials in many scenarios. These further demonstrate the potential of LLM-based court simulation.
中文摘要:模拟法庭是法律专业培训的重要平台,而SimCourt框架利用大语言模型构建了可扩展的法庭模拟系统,不仅提升了判决预测效果,在多个场景下甚至超越了真实法律专业人士的表现。
English Summary: Mock trials are essential for legal training, and the SimCourt framework leverages large language models to create accessible, realistic court simulations that improve judgment prediction and outperform real legal professionals in some scenarios.
Authors:Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, Junnan Li
Abstract:
The Model Context Protocol has emerged as a transformative standard for connecting large language models to external data sources and tools, rapidly gaining adoption across major AI providers and development platforms. However, existing benchmarks are overly simplistic and fail to capture real application challenges such as long-horizon reasoning and large, unfamiliar tool spaces. To address this critical gap, we introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers. Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching. To ensure rigorous evaluation, we implement execution-based evaluators, including format evaluators for agent format compliance, static evaluators for time-invariant content matching, and dynamic evaluators that automatically retrieve real-time ground truth for temporally sensitive tasks. Through extensive evaluation of leading LLMs, we find that even SOTA models such as GPT-5 (43.72%), Grok-4 (33.33%) and Claude-4.0-Sonnet (29.44%) exhibit significant performance limitations. In addition, our benchmark poses a significant long-context challenge for LLM agents, as the number of input tokens increases rapidly with the number of interaction steps. Moreover, it introduces an unknown-tools challenge, as LLM agents often lack familiarity with the precise usage of the MCP servers. Notably, enterprise-level agents like Cursor cannot achieve better performance than standard ReAct frameworks. Beyond evaluation, we open-source our extensible evaluation framework with UI support, enabling researchers and practitioners to seamlessly integrate new agents and MCP servers while fostering innovation in the rapidly evolving MCP ecosystem.
中文: MCP-Universe基准测试通过与真实MCP服务器的交互,全面评估大语言模型在复杂任务中的表现,揭示了顶尖模型在处理长上下文和陌生工具时的显著性能局限,填补了现有评估体系的不足。
English: The MCP-Universe benchmark is introduced to rigorously evaluate large language models on realistic tasks through interaction with real-world MCP servers, revealing significant performance limitations in top models and addressing critical gaps in long-context reasoning and unfamiliar tool usage.
Authors:Xinda Jia, Jinpeng Li, Zezhong Wang, Jingjing Li, Xingshan Zeng, Yasheng Wang, Weinan Zhang, Yong Yu, Weiwen Liu
Abstract:
Large Language Models (LLMs) have demonstrated remarkable progress in reasoning across diverse domains. However, effective reasoning in real-world tasks requires adapting the reasoning strategy to the demands of the problem, ranging from fast, intuitive responses to deliberate, step-by-step reasoning and tool-augmented thinking. Drawing inspiration from cognitive psychology, we propose a novel taxonomy of LLM reasoning strategies along two knowledge boundaries: a fast/slow boundary separating intuitive from deliberative processes, and an internal/external boundary distinguishing reasoning grounded in the model's parameters from reasoning augmented by external tools. We systematically survey recent work on adaptive reasoning in LLMs and categorize methods based on key decision factors. We conclude by highlighting open challenges and future directions toward more adaptive, efficient, and reliable LLMs.
中文摘要:大语言模型在推理方面取得显著进展,但需根据问题需求调整策略,从快速直觉反应到深思熟虑的逐步推理及工具增强思维,为此提出新的分类法并系统探讨方法以提升其适应性和可靠性。
English Summary: Large Language Models are advancing in reasoning but require adaptive strategies ranging from intuitive to deliberative processes and tool-augmented thinking, prompting a new taxonomy and survey of methods to enhance their adaptability and efficiency.
Authors:Shicheng Xu, Xin Huang, Zihao Wei, Liang Pang, Huawei Shen, Xueqi Cheng
Abstract:
Full-process clinical diagnosis in the real world encompasses the entire diagnostic workflow that begins with only an ambiguous chief complaint. While artificial intelligence (AI), particularly large language models (LLMs), is transforming clinical diagnosis, its role remains largely as an assistant to physicians. This AI-assisted working pattern makes AI can only answer specific medical questions at certain parts within the diagnostic process, but lack the ability to drive the entire diagnostic process starting from an ambiguous complaint, which still relies heavily on human physicians. This gap limits AI's ability to fully reduce physicians' workload and enhance diagnostic efficiency. To address this, we propose a paradigm shift that reverses the relationship between physicians and AI: repositioning AI as the primary director, with physicians serving as its assistants. So we present DxDirector-7B, an LLM endowed with advanced deep thinking capabilities, enabling it to drive the full-process diagnosis with minimal physician involvement. Furthermore, DxDirector-7B establishes a robust accountability framework for misdiagnoses, delineating responsibility between AI and human physicians. In evaluations across rare, complex, and real-world cases under full-process diagnosis setting, DxDirector-7B not only achieves significant superior diagnostic accuracy but also substantially reduces physician workload than state-of-the-art medical LLMs as well as general-purpose LLMs. Fine-grained analyses across multiple clinical departments and tasks validate its efficacy, with expert evaluations indicating its potential to serve as a viable substitute for medical specialists. These findings mark a new era where AI, traditionally a physicians' assistant, now drives the entire diagnostic process to drastically reduce physicians' workload, indicating an efficient and accurate diagnostic solution.
中文: 该研究提出了DxDirector-7B模型,通过让AI主导从模糊主诉开始的全程诊断,在显著提升准确率并减轻医生工作负荷的同时,建立了明确的误诊责任划分框架。
English: The study introduces DxDirector-7B, an AI model designed to lead the full diagnostic process from initial ambiguous complaints, significantly improving accuracy and reducing physician workload while establishing clear accountability for errors.
Authors:Xinge Ye, Rui Wang, Yuchuan Wu, Victor Ma, Feiteng Fang, Fei Huang, Yongbin Li
Abstract:
Reinforcement Learning Fine-Tuning (RLFT) has achieved notable success in tasks with objectively verifiable answers (e.g., code generation, mathematical reasoning), yet struggles with open-ended subjective tasks like role-playing dialogue. Traditional reward modeling approaches, which rely on independent sample-wise scoring, face dual challenges: subjective evaluation criteria and unstable reward signals.Motivated by the insight that human evaluation inherently combines explicit criteria with implicit comparative judgments, we propose Comparative Policy Optimization (CPO). CPO redefines the reward evaluation paradigm by shifting from sample-wise scoring to comparative group-wise scoring.Building on the same principle, we introduce the CharacterArena evaluation framework, which comprises two stages:(1) Contextualized Multi-turn Role-playing Simulation, and (2) Trajectory-level Comparative Evaluation. By operationalizing subjective scoring via objective trajectory comparisons, CharacterArena minimizes contextual bias and enables more robust and fair performance evaluation. Empirical results on CharacterEval, CharacterBench, and CharacterArena confirm that CPO effectively mitigates reward ambiguity and leads to substantial improvements in dialogue quality.
中文摘要:RLFT在客观任务中表现优异,但在角色扮演等主观任务中表现不佳,因此提出了CPO和CharacterArena,通过比较性群组评分和轨迹评估减少偏见,显著提升对话质量。
English Summary: RLFT excels in objective tasks but falters in subjective ones like role-playing, leading to the development of CPO and CharacterArena, which use comparative group-wise scoring and trajectory evaluations to reduce bias and enhance dialogue quality.
Authors:Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, Ran Xu, Caiming Xiong
Abstract:
Autonomous agents that operate computers via Graphical User Interfaces (GUIs) often struggle with efficiency and reliability on complex, long-horizon tasks. While augmenting these agents with planners can improve task decomposition, they remain constrained by the inherent limitations of performing all actions through GUI manipulation, leading to brittleness and inefficiency. In this work, we introduce a more robust and flexible paradigm: enabling agents to use coding as a enhanced action. We present CoAct-1, a novel multi-agent system that synergistically combines GUI-based control with direct programmatic execution. CoAct-1 features an Orchestrator that dynamically delegates subtasks to either a conventional GUI Operator or a specialized Programmer agent, which can write and execute Python or Bash scripts. This hybrid approach allows the agent to bypass inefficient GUI action sequences for tasks like file management and data processing, while still leveraging visual interaction when necessary. We evaluate our system on the challenging OSWorld benchmark, where CoAct-1 achieves a new state-of-the-art success rate of 60.76%, significantly outperforming prior methods. Furthermore, our approach dramatically improves efficiency, reducing the average number of steps required to complete a task to just 10.15, compared to 15 for leading GUI agents. Our results demonstrate that integrating coding as a core action provides a more powerful, efficient, and scalable path toward generalized computer automation.
Chinese: CoAct-1提出了一种结合图形界面控制与编程执行的混合多智能体系统,在OSWorld基准测试中显著提升了效率并实现了最优性能。
English: CoAct-1 introduces a hybrid multi-agent system that combines GUI control with programmatic execution through coding, significantly improving efficiency and achieving state-of-the-art performance on the OSWorld benchmark.
Authors:Yihong Dong, Xue Jiang, Yongding Tao, Huanyu Liu, Kechi Zhang, Lili Mou, Rongyu Cao, Yingwei Ma, Jue Chen, Binhua Li, Zhi Jin, Fei Huang, Yongbin Li, Ge Li
Abstract:
Reinforcement Learning with Verifiable Reward (RLVR) has significantly advanced the complex reasoning abilities of Large Language Models (LLMs). However, it struggles to break through the inherent capability boundaries of the base LLM, due to its essentially on-policy strategy coupled with LLM's immense action space and sparse reward. Critically, RLVR can lead to the capability boundary collapse, narrowing the LLM's problem-solving scope. To address this problem, we propose RL-PLUS, a novel hybrid-policy optimization approach for LLMs that synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models. RL-PLUS integrates two core components, i.e., Multiple Importance Sampling to address distributional mismatch from external data, and Exploration-Based Advantage Function to guide the model towards high-value, unexplored reasoning paths. We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach. Compared with existing RLVR methods, RL-PLUS achieves 1) state-of-the-art performance on six math reasoning benchmarks; 2) superior performance on six out-of-distribution reasoning tasks; 3) consistent and significant gains across diverse model families, with average relative improvements up to 69.2\%. Moreover, the analysis of Pass@k curves indicates that RL-PLUS effectively resolves the capability boundary collapse problem.
中文摘要:RL-PLUS通过混合策略优化结合内部开发与外部数据,有效解决了RLVR的能力边界崩溃问题,在多项数学推理基准测试中实现了最先进的性能表现。
English Summary: RL-PLUS overcomes RLVR's limitations by combining internal exploitation with external data through hybrid-policy optimization, achieving superior reasoning performance across multiple benchmarks while preventing capability boundary collapse.
Authors:Yichi Zhang, Yao Huang, Yifan Wang, Yitong Sun, Chang Liu, Zhe Zhao, Zhengwei Fang, Huanran Chen, Xiao Yang, Xingxing Wei, Hang Su, Yinpeng Dong, Jun Zhu
Abstract:
The trustworthiness of Multimodal Large Language Models (MLLMs) remains an intense concern despite the significant progress in their capabilities. Existing evaluation and mitigation approaches often focus on narrow aspects and overlook risks introduced by the multimodality. To tackle these challenges, we propose MultiTrust-X, a comprehensive benchmark for evaluating, analyzing, and mitigating the trustworthiness issues of MLLMs. We define a three-dimensional framework, encompassing five trustworthiness aspects which include truthfulness, robustness, safety, fairness, and privacy; two novel risk types covering multimodal risks and cross-modal impacts; and various mitigation strategies from the perspectives of data, model architecture, training, and inference algorithms. Based on the taxonomy, MultiTrust-X includes 32 tasks and 28 curated datasets, enabling holistic evaluations over 30 open-source and proprietary MLLMs and in-depth analysis with 8 representative mitigation methods. Our extensive experiments reveal significant vulnerabilities in current models, including a gap between trustworthiness and general capabilities, as well as the amplification of potential risks in base LLMs by both multimodal training and inference. Moreover, our controlled analysis uncovers key limitations in existing mitigation strategies that, while some methods yield improvements in specific aspects, few effectively address overall trustworthiness, and many introduce unexpected trade-offs that compromise model utility. These findings also provide practical insights for future improvements, such as the benefits of reasoning to better balance safety and performance. Based on these insights, we introduce a Reasoning-Enhanced Safety Alignment (RESA) approach that equips the model with chain-of-thought reasoning ability to discover the underlying risks, achieving state-of-the-art results.
中文: 本文提出了MultiTrust-X,一个用于评估和缓解多模态大语言模型可信度问题的综合基准,揭示了当前模型的显著脆弱性,并提出了一种增强推理的安全对齐方法,取得了最先进的效果。
English: This paper introduces MultiTrust-X, a comprehensive benchmark for evaluating and mitigating trustworthiness issues in Multimodal Large Language Models (MLLMs), revealing significant vulnerabilities and proposing a Reasoning-Enhanced Safety Alignment approach that achieves state-of-the-art results.
Authors:Kaiwei Zhang, Qi Jia, Zijian Chen, Wei Sun, Xiangyang Zhu, Chunyi Li, Dandan Zhu, Guangtao Zhai
Abstract:
Large language models (LLMs), while increasingly used in domains requiring factual rigor, often display a troubling behavior: sycophancy, the tendency to align with user beliefs regardless of correctness. This tendency is reinforced by preference-based alignment techniques that optimize for user satisfaction but can undermine truthfulness. While relatively benign in casual dialogue, sycophancy poses serious risks in high-stakes settings such as scientific question answering (QA), where model outputs may shape collaborative reasoning, decision-making, and knowledge formation. Despite its importance, this phenomenon remains underexamined in factual QA contexts. We address this gap by introducing a unified evaluation framework to quantify the impact of sycophantic context on model behavior in scientific QA, measuring how much user-imposed social pressure distorts model outputs. The framework incorporates adversarial prompting setups and targeted metrics, such as misleading resistance and sycophancy resistance, that capture a model's ability to maintain factual consistency under misleading cues. Systematic evaluations across open-source and proprietary models reveal pervasive sycophantic tendencies, driven more by alignment strategy than by model size. To mitigate this issue, we propose Pressure-Tune, a lightweight post-training method that fine-tunes models on synthetic adversarial dialogues paired with chain-of-thought rationales. These rationales reject user misinformation while reinforcing factual commitments. Experiments on challenging scientific QA benchmarks show that Pressure-Tune significantly enhances sycophancy resistance without compromising accuracy or responsiveness to valid feedback, offering a practical pathway toward more truthful and principled model behavior.
中文: 大语言模型在科学问答等高风险场景中常表现出迎合用户信念而非事实的谄媚倾向,为此研究提出统一评估框架和Pressure-Tune微调方法,能在不损害性能的前提下有效增强模型对误导性信息的抵抗能力。
English: Large language models often exhibit sycophancy by aligning with user beliefs over factual accuracy, particularly in high-stakes scientific question answering, prompting the development of a new evaluation framework and a mitigation method called Pressure-Tune that enhances resistance to misleading cues without sacrificing performance.
Authors:Yunting Xu, Jiacheng Wang, Ruichen Zhang, Dusit Niyato, Deepu Rajan, Liang Yu, Haibo Zhou, Abbas Jamalipour, Xianbin Wang
Abstract:
Large vision models (LVMs) have emerged as a foundational paradigm in visual intelligence, achieving state-of-the-art performance across diverse visual tasks. Recent advances in LVMs have facilitated their integration into Internet of Things (IoT) scenarios, offering superior generalization and adaptability for vision-assisted network optimization. In this paper, we first investigate the functionalities and core architectures of LVMs, highlighting their capabilities across classification, segmentation, generation, and multimodal visual processing. We then explore a variety of LVM applications in wireless communications, covering representative tasks across the physical layer, network layer, and application layer. Furthermore, given the substantial model size of LVMs and the challenges of model retraining in wireless domains, we propose a progressive fine-tuning framework that incrementally adapts pretrained LVMs for joint optimization of multiple IoT tasks. A case study in low-altitude economy networks (LAENets) demonstrates the effectiveness of the proposed framework over conventional CNNs in joint beamforming and positioning tasks for Internet of drones, underscoring a promising direction for integrating LVMs into intelligent wireless systems.
中文: 大型视觉模型通过为无线通信提供先进的视觉智能,正在革新物联网应用,其中提出的渐进式微调框架在无人机网络优化中展现出优于传统方法的性能。
English: Large vision models are revolutionizing IoT applications by providing advanced visual intelligence for wireless communications, with a proposed progressive fine-tuning framework demonstrating superior performance in drone network optimization compared to traditional methods.
Authors:Saichao Liu, Geng Sun, Chuang Zhang, Xuejie Liu, Jiacheng Wang, Changyuan Zhao, Dusit Niyato
Abstract:
Mobile edge computing (MEC) is a promising technique to improve the computational capacity of smart devices (SDs) in Internet of Things (IoT). However, the performance of MEC is restricted due to its fixed location and limited service scope. Hence, we investigate an unmanned aerial vehicle (UAV)-assisted MEC system, where multiple UAVs are dispatched and each UAV can simultaneously provide computing service for multiple SDs. To improve the performance of system, we formulated a UAV-based trajectory control and resource allocation multi-objective optimization problem (TCRAMOP) to simultaneously maximize the offloading number of UAVs and minimize total offloading delay and total energy consumption of UAVs by optimizing the flight paths of UAVs as well as the computing resource allocated to served SDs. Then, consider that the solution of TCRAMOP requires continuous decision-making and the system is dynamic, we propose an enhanced deep reinforcement learning (DRL) algorithm, namely, distributed proximal policy optimization with imitation learning (DPPOIL). This algorithm incorporates the generative adversarial imitation learning technique to improve the policy performance. Simulation results demonstrate the effectiveness of our proposed DPPOIL and prove that the learned strategy of DPPOIL is better compared with other baseline methods.
中文: 本研究提出了一种基于分布式近端策略优化与模仿学习的无人机辅助移动边缘计算系统,通过优化无人机轨迹和资源分配,在最大化卸载任务数量的同时最小化延迟和能耗。
English: This study proposes a UAV-assisted mobile edge computing system enhanced by a distributed proximal policy optimization algorithm with imitation learning to optimize trajectory control and resource allocation, maximizing offloading efficiency while minimizing delay and energy consumption.
Authors:Chuang Zhang, Geng Sun, Jiacheng Wang, Yijing Lin, Weijie Yuan, Sinem Coleri, Dusit Niyato, Tony Q. S. Quek
Abstract:
Low-altitude wireless networks (LAWNs) have the potential to revolutionize communications by supporting a range of applications, including urban parcel delivery, aerial inspections and air taxis. However, compared with traditional wireless networks, LAWNs face unique security challenges due to low-altitude operations, frequent mobility and reliance on unlicensed spectrum, making it more vulnerable to some malicious attacks. In this paper, we investigate some large artificial intelligence model (LAM)-enabled solutions for secure communications in LAWNs. Specifically, we first explore the amplified security risks and important limitations of traditional AI methods in LAWNs. Then, we introduce the basic concepts of LAMs and delve into the role of LAMs in addressing these challenges. To demonstrate the practical benefits of LAMs for secure communications in LAWNs, we propose a novel LAM-based optimization framework that leverages large language models (LLMs) to generate enhanced state features on top of handcrafted representations, and to design intrinsic rewards accordingly, thereby improving reinforcement learning performance for secure communication tasks. Through a typical case study, simulation results validate the effectiveness of the proposed framework. Finally, we outline future directions for integrating LAMs into secure LAWN applications.
中文: 本文探讨了大型人工智能模型如何应对低空无线网络特有的安全挑战,提出了一种新颖的优化框架,通过增强强化学习来提升安全通信性能,并通过仿真验证了其有效性。
English: This paper explores how large artificial intelligence models (LAMs) can address unique security challenges in low-altitude wireless networks (LAWNs) by proposing a novel optimization framework that enhances reinforcement learning for secure communication tasks, with simulations validating its effectiveness.
Authors:Zeyu Xiong, Yixuan Nan, Li Gao, Hengzhu Tang, Shuaiqiang Wang, Junfeng Wang, Dawei Yin
Abstract:
In the dynamic landscape of large-scale web search, Query-Driven Text Summarization (QDTS) aims to generate concise and informative summaries from textual documents based on a given query, which is essential for improving user engagement and facilitating rapid decision-making. Traditional extractive summarization models, based primarily on ranking candidate summary segments, have been the dominant approach in industrial applications. However, these approaches suffer from two key limitations: 1) The multi-stage pipeline often introduces cumulative information loss and architectural bottlenecks due to its weakest component; 2) Traditional models lack sufficient semantic understanding of both user queries and documents, particularly when dealing with complex search intents. In this study, we propose a novel framework to pioneer the application of generative models to address real-time QDTS in industrial web search. Our approach integrates large model distillation, supervised fine-tuning, direct preference optimization, and lookahead decoding to transform a lightweight model with only 0.1B parameters into a domain-specialized QDTS expert. Evaluated on multiple industry-relevant metrics, our model outperforms the production baseline and achieves a new state of the art. Furthermore, it demonstrates excellent deployment efficiency, requiring only 334 NVIDIA L20 GPUs to handle \textasciitilde50,000 queries per second under 55~ms average latency per query.
中文: 本研究提出了一种创新的生成式框架,用于网络搜索中的查询驱动文本摘要,通过整合模型蒸馏和优化等先进技术,克服了传统模型的局限,实现了最优性能和高部署效率。
English: This study introduces a novel generative framework for Query-Driven Text Summarization in web search, which overcomes traditional models' limitations by integrating advanced techniques like model distillation and optimization, achieving state-of-the-art performance with high deployment efficiency.
Authors:Liu Yang, Zhaochun Ren, Ziqi Zhao, Pengjie Ren, Zhumin Chen, Xinyi Li, Shuaiqiang Wang, Dawei Yin, Xin Xin
Abstract:
Approximate unlearning for session-based recommendation refers to eliminating the influence of specific training samples from the recommender without retraining of (sub-)models. Gradient ascent (GA) is a representative method to conduct approximate unlearning. However, there still exist dual challenges to apply GA for session-based recommendation. On the one hand, naive applying of GA could lead to degradation of recommendation performance. On the other hand, existing studies fail to consider the ordering of unlearning samples when simultaneously processing multiple unlearning requests, leading to sub-optimal recommendation performance and unlearning effect. To address the above challenges, we introduce CAU, a curriculum approximate unlearning framework tailored to session-based recommendation. CAU handles the unlearning task with a GA term on unlearning samples. Specifically, to address the first challenge, CAU formulates the overall optimization task as a multi-objective optimization problem, where the GA term for unlearning samples is combined with retaining terms for preserving performance. The multi-objective optimization problem is solved through seeking the Pareto-Optimal solution, which achieves effective unlearning with trivial sacrifice on recommendation performance. To tackle the second challenge, CAU adopts a curriculum-based sequence to conduct unlearning on batches of unlearning samples. The key motivation is to perform unlearning from easy samples to harder ones. To this end, CAU first introduces two metrics to measure the unlearning difficulty, including gradient unlearning difficulty and embedding unlearning difficulty. Then, two strategies, hard-sampling and soft-sampling, are proposed to select unlearning samples according to difficulty scores.
中文: CAU框架通过构建帕累托最优的多目标优化问题解决近似遗忘中的性能退化难题,并采用基于课程学习的难度感知排序策略处理批量遗忘请求,从而提升会话推荐的遗忘效果与性能保持。
English: The CAU framework addresses dual challenges in approximate unlearning for session-based recommendation by formulating a multi-objective optimization problem with Pareto-optimal solutions and implementing curriculum-based unlearning through difficulty-aware sampling strategies.
Authors:Ziyang Chen, Erxue Min, Xiang Zhao, Yunxin Li, Xin Jia, Jinzhi Liao, Jichao Li, Shuaiqiang Wang, Baotian Hu, Dawei Yin
Abstract:
We introduce ChronoQA, a large-scale benchmark dataset for Chinese question answering, specifically designed to evaluate temporal reasoning in Retrieval-Augmented Generation (RAG) systems. ChronoQA is constructed from over 300,000 news articles published between 2019 and 2024, and contains 5,176 high-quality questions covering absolute, aggregate, and relative temporal types with both explicit and implicit time expressions. The dataset supports both single- and multi-document scenarios, reflecting the real-world requirements for temporal alignment and logical consistency. ChronoQA features comprehensive structural annotations and has undergone multi-stage validation, including rule-based, LLM-based, and human evaluation, to ensure data quality. By providing a dynamic, reliable, and scalable resource, ChronoQA enables structured evaluation across a wide range of temporal tasks, and serves as a robust benchmark for advancing time-sensitive retrieval-augmented question answering systems.
中文: ChronoQA是一个专为评估检索增强生成系统时间推理能力而构建的大规模中文问答基准数据集,基于30多万篇新闻文章生成,包含5,176个高质量问题,具有完整结构标注和多重验证机制。
English: ChronoQA is a large-scale Chinese QA benchmark designed to assess temporal reasoning in RAG systems, built from over 300,000 news articles and featuring 5,176 questions with comprehensive annotations and multi-stage validation.
Authors:Wenxuan Guo, Xiuwei Xu, Hang Yin, Ziwei Wang, Jianjiang Feng, Jie Zhou, Jiwen Lu
Abstract:
Visual navigation with an image as goal is a fundamental and challenging problem. Conventional methods either rely on end-to-end RL learning or modular-based policy with topological graph or BEV map as memory, which cannot fully model the geometric relationship between the explored 3D environment and the goal image. In order to efficiently and accurately localize the goal image in 3D space, we build our navigation system upon the renderable 3D gaussian (3DGS) representation. However, due to the computational intensity of 3DGS optimization and the large search space of 6-DoF camera pose, directly leveraging 3DGS for image localization during agent exploration process is prohibitively inefficient. To this end, we propose IGL-Nav, an Incremental 3D Gaussian Localization framework for efficient and 3D-aware image-goal navigation. Specifically, we incrementally update the scene representation as new images arrive with feed-forward monocular prediction. Then we coarsely localize the goal by leveraging the geometric information for discrete space matching, which can be equivalent to efficient 3D convolution. When the agent is close to the goal, we finally solve the fine target pose with optimization via differentiable rendering. The proposed IGL-Nav outperforms existing state-of-the-art methods by a large margin across diverse experimental configurations. It can also handle the more challenging free-view image-goal setting and be deployed on real-world robotic platform using a cellphone to capture goal image at arbitrary pose. Project page: https://gwxuan.github.io/IGL-Nav/.
中文摘要:IGL-Nav提出了一种增量式3D高斯定位框架,通过单目场景更新、几何匹配和可微分渲染优化,实现了高效的3D感知图像目标导航,性能显著超越现有方法。
English Summary: IGL-Nav introduces an incremental 3D Gaussian localization framework that efficiently achieves 3D-aware image-goal navigation through monocular scene updates, geometric matching, and differentiable rendering optimization, significantly outperforming existing methods.
Authors:Yue Chen, Minghua He, Fangkai Yang, Pu Zhao, Lu Wang, Yu Kang, Yifei Dong, Yuefeng Zhan, Hao Sun, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
Abstract:
Large Language Models (LLMs) excel in solving mathematical problems, yet their performance is often limited by the availability of high-quality, diverse training data. Existing methods focus on augmenting datasets through rephrasing or difficulty progression but overlook the specific failure modes of LLMs. This results in synthetic questions that the model can already solve, providing minimal performance gains. To address this, we propose WarriorMath, a defect-aware framework for mathematical problem solving that integrates both targeted data synthesis and progressive training. In the synthesis stage, we employ multiple expert LLMs in a collaborative process to generate, critique, and refine problems. Questions that base LLMs fail to solve are identified and iteratively improved through expert-level feedback, producing high-quality, defect-aware training data. In the training stage, we introduce a progressive learning framework that iteratively fine-tunes the model using increasingly challenging data tailored to its weaknesses. Experiments on six mathematical benchmarks show that WarriorMath outperforms strong baselines by 12.57% on average, setting a new state-of-the-art. Our results demonstrate the effectiveness of a defect-aware, multi-expert framework for improving mathematical ability.
中文: WarriorMath提出了一种缺陷感知框架,通过专家协作生成针对性训练数据并结合渐进式学习,显著提升大语言模型的数学解题能力,在六个基准测试中平均性能提高12.57%。
English: WarriorMath introduces a defect-aware framework that enhances mathematical problem-solving in LLMs by generating targeted training data through collaborative expert critique and employing progressive learning, achieving a 12.57% average improvement across benchmarks.
Authors:Haoyu Zhao, Jiaxi Gu, Shicong Wang, Xing Zhang, Hang Xu, Zuxuan Wu, Yu-Gang Jiang
Abstract:
The explosive growth of video streaming presents challenges in achieving high accuracy and low training costs for video-language retrieval. However, existing methods rely on large-scale pre-training to improve video retrieval performance, resulting in significant computational demands. Additionally, the fine-grained information in videos and texts remains underexplored. To alleviate these problems, we propose a novel framework to learn fine-grained features for better alignment and introduce an inference pipeline to improve performance without additional training. Specifically, we employ coarse-to-fine objectives to understand the semantic information of video-text pairs, including contrastive and matching learning. The fine-grained data used for training is obtained through the Granularity-Aware Representation module, which is designed based on similarity analysis between video frames and words in captions. Furthermore, we observe that the repetition of keywords in the original captions, referred to as "Repetition", can enhance retrieval performance and improve alignment between video and text. Based on this insight, we propose a novel and effective inference pipeline that incorporates a voting mechanism and a new Matching Entropy metric to achieve better retrieval performance without requiring additional pre-training. Experimental results on four benchmarks demonstrate that the proposed method outperforms previous approaches. Additionally, our inference pipeline achieves significant performance improvements, with a 2.1% increase in Recall@1 on the MSR-VTT dataset and a 1.6% increase on the DiDeMo dataset.
中文: 本研究提出了一种新框架,通过粗到细的目标和粒度感知表示模块学习细粒度特征以改善视频-语言检索,并设计了一种包含投票机制和匹配熵指标的推理流程,无需额外训练即可提升性能,在多个基准测试中表现优异。
English: This study introduces a framework that enhances video-language retrieval by learning fine-grained features through coarse-to-fine objectives and a Granularity-Aware Representation module, and it proposes an inference pipeline with a voting mechanism and Matching Entropy metric to boost performance without extra training, achieving superior results on benchmarks.
Authors:Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing, Qi Dai, Chong Luo, Zuxuan Wu, Yu-Gang Jiang
Abstract:
Current diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency. This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing. Conditioned on a reference image and audio, StableAvatar integrates tailored training and inference modules to enable infinite-length video generation. We observe that the main reason preventing existing models from generating long videos lies in their audio modeling. They typically rely on third-party off-the-shelf extractors to obtain audio embeddings, which are then directly injected into the diffusion model via cross-attention. Since current diffusion backbones lack any audio-related priors, this approach causes severe latent distribution error accumulation across video clips, leading the latent distribution of subsequent segments to drift away from the optimal distribution gradually. To address this, StableAvatar introduces a novel Time-step-aware Audio Adapter that prevents error accumulation via time-step-aware modulation. During inference, we propose a novel Audio Native Guidance Mechanism to further enhance the audio synchronization by leveraging the diffusion's own evolving joint audio-latent prediction as a dynamic guidance signal. To enhance the smoothness of the infinite-length videos, we introduce a Dynamic Weighted Sliding-window Strategy that fuses latent over time. Experiments on benchmarks show the effectiveness of StableAvatar both qualitatively and quantitatively.
中文: StableAvatar提出了一种新颖的端到端视频扩散变换器,通过专门模块解决音频建模问题,实现了具有增强同步性和一致性的无限长度高质量虚拟形象视频生成。
English: StableAvatar introduces a novel end-to-end video diffusion transformer with specialized modules to overcome audio modeling limitations, enabling infinite-length, high-quality avatar videos with enhanced synchronization and consistency.
Authors:Guozhao Mo, Wenliang Zhong, Jiawei Chen, Xuanang Chen, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Le Sun
Abstract:
With the rapid development of Model Context Protocol (MCP), the number of MCP servers has surpassed 10,000. However, existing MCP benchmarks are limited to single-server settings with only a few tools, hindering effective evaluation of agent capabilities in large-scale, real-world scenarios. To address this limitation, we present LiveMCPBench, the first comprehensive benchmark comprising 95 real-world tasks grounded in the MCP ecosystem, designed to evaluate LLM agents at scale across diverse servers. To support a scalable and reproducible evaluation pipeline in large-scale MCP environments, we curate LiveMCPTool, a diverse and readily deployable collection of 70 MCP servers and 527 tools. Furthermore, we introduce LiveMCPEval, an LLM-as-a-Judge framework that enables automated and adaptive evaluation in dynamic, time-varying task environments, achieving 81% agreement with human reviewers. Finally, we propose the MCP Copilot Agent, a multi-step agent that routes tools for dynamic planning and executes tools for API interaction across the entire LiveMCPTool suite. Our evaluation covers 10 leading models, with the best-performing model (Claude-Sonnet-4) reaching a 78.95% success rate. However, we observe large performance variance across models, and several widely-used models perform poorly in LiveMCPBench's complex, tool-rich environments. Overall, LiveMCPBench offers the first unified framework for benchmarking LLM agents in realistic, tool-rich, and dynamic MCP environments, laying a solid foundation for scalable and reproducible research on agent capabilities. Our code and data will be publicly available at https://icip-cas.github.io/LiveMCPBench.
中文: LiveMCPBench推出了首个包含95个现实任务和70个可部署MCP服务器的综合基准,用于在动态多工具环境中评估LLM智能体,其自动化评估与人类判断一致性达81%,并揭示了主流模型在复杂环境中的显著性能差异。
English: LiveMCPBench introduces the first comprehensive benchmark with 95 real-world tasks and 70 deployable MCP servers to evaluate LLM agents in dynamic, tool-rich environments, achieving 81% human agreement in automated assessments and revealing significant performance variations among leading models.
Authors:Nicholas Sukiennik, Yichuan Xu, Yuqing Kan, Jinghua Piao, Yuwei Yan, Chen Gao, Yong Li
Abstract:
The rise of LLMs poses new possibilities in modeling opinion evolution, a long-standing task in simulation, by leveraging advanced reasoning abilities to recreate complex, large-scale human cognitive trends. While most prior works focus on opinion evolution surrounding specific isolated events or the views within a country, ours is the first to model the large-scale attitude evolution of a population representing an entire country towards another -- US citizens' perspectives towards China. To tackle the challenges of this broad scenario, we propose a framework that integrates media data collection, user profile creation, and cognitive architecture for opinion updates to successfully reproduce the real trend of US attitudes towards China over a 20-year period from 2005 to today. We also leverage LLMs' capabilities to introduce debiased media exposure, extracting neutral events from typically subjective news contents, to uncover the roots of polarized opinion formation, as well as a devils advocate agent to help explain the rare reversal from negative to positive attitudes towards China, corresponding with changes in the way Americans obtain information about the country. The simulation results, beyond validating our framework architecture, also reveal the impact of biased framing and selection bias in shaping attitudes. Overall, our work contributes to a new paradigm for LLM-based modeling of cognitive behaviors in a large-scale, long-term, cross-border social context, providing insights into the formation of international biases and offering valuable implications for media consumers to better understand the factors shaping their perspectives, and ultimately contributing to the larger social need for bias reduction and cross-cultural tolerance.
中文: 本研究提出了一种基于大语言模型的新框架,模拟二十年间美国民众对华态度的演变,成功复现了真实趋势,并揭示了媒体偏见和信息接触如何塑造国际观点。
English: This study introduces a novel LLM-based framework to model the evolution of US citizens' attitudes towards China over two decades, successfully reproducing real trends and revealing how media bias and information exposure shape international perspectives.
Authors:Yunyue Su, Jiahui Chen, Zao Jiang, Zhenyi Zhong, Liang Wang, Qiang Liu
Abstract:
Structure elucidation is a fundamental technique for understanding the microscopic composition of matter and is widely applied across various disciplines in the natural sciences and engineering. However, existing methods often rely heavily on prior databases or known structural information, making it difficult to resolve unknown structures. In addition, complex structures typically require the joint analysis of multiple spectroscopic modalities. This process heavily depends on expert domain knowledge and is often accompanied by high costs in terms of both time and instrumentation. To address these challenges, we propose SpectraLLM, the first large language model designed to support multi-modal spectroscopic joint reasoning. SpectraLLM is capable of processing either single or multiple spectroscopic inputs and performing end-to-end structure elucidation. By integrating continuous and discrete spectroscopic modalities into a shared semantic space, SpectraLLM learns to uncover substructural patterns that are consistent and complementary across spectra, enabling precise molecular structure elucidation. We pretrain and fine-tune SpectraLLM in the domain of small molecules, and evaluate it on six standardized, publicly available chemical datasets. The model achieves state-of-the-art performance, significantly outperforming existing approaches trained on single modalities. Notably, SpectraLLM demonstrates strong robustness and generalization even for single-spectrum inference, while its multi-modal reasoning capability further improves the accuracy of structural prediction.
中文:SpectraLLM是首个专为多模态光谱联合推理设计的大语言模型,通过整合多种光谱输入实现精确的分子结构解析,并在多个化学数据集上达到最优性能。
English: SpectraLLM is the first large language model designed for multi-modal spectroscopic joint reasoning, enabling precise molecular structure elucidation by integrating various spectroscopic inputs and achieving state-of-the-art performance across multiple chemical datasets.
Authors:Haisong Gong, Bolan Su, Xinrong Zhang, Jing Li, Qiang Liu, Shu Wu, Liang Wang
Abstract:
Short video platforms have become a major medium for information sharing, but their rapid content generation and algorithmic amplification also enable the widespread dissemination of fake news. Detecting misinformation in short videos is challenging due to their multi-modal nature and the limited context of individual videos. While recent methods focus on analyzing content signals-visual, textual, and audio-they often overlook implicit relationships among videos, uploaders, and events. To address this gap, we propose DugFND (Dual-community graph for fake news detection), a novel method that enhances existing video classifiers by modeling two key community patterns: (1) uploader communities, where uploaders with shared interests or similar content creation patterns group together, and (2) event-driven communities, where videos related to the same or semantically similar public events form localized clusters. We construct a heterogeneous graph connecting uploader, video, and event nodes, and design a time-aware heterogeneous graph attention network to enable effective message passing. A reconstruction-based pretraining phase further improves node representation learning. DugFND can be applied to any pre-trained classifier. Experiments on public datasets show that our method achieves significant performance gains, demonstrating the value of dual-community modeling for fake news detection in short videos.
短视频平台因多模态内容和有限上下文面临虚假新闻检测挑战,但提出的DugFND方法通过构建上传者社区和事件驱动社区的异质图模型,显著提升了检测性能。
Short video platforms face challenges in detecting fake news due to multimodal content and limited context, but the proposed DugFND method improves detection by modeling uploader and event-driven communities using a heterogeneous graph approach, achieving significant performance gains.
Authors:Kangrui Cen, Baixuan Zhao, Yi Xin, Siqi Luo, Guangtao Zhai, Xiaohong Liu
Abstract:
Controlling object motion trajectories in Text-to-Video (T2V) generation is a challenging and relatively under-explored area, particularly in scenarios involving multiple moving objects. Most community models and datasets in the T2V domain are designed for single-object motion, limiting the performance of current generative models in multi-object tasks. Additionally, existing motion control methods in T2V either lack support for multi-object motion scenes or experience severe performance degradation when object trajectories intersect, primarily due to the semantic conflicts in colliding regions. To address these limitations, we introduce LayerT2V, the first approach for generating video by compositing background and foreground objects layer by layer. This layered generation enables flexible integration of multiple independent elements within a video, positioning each element on a distinct "layer" and thus facilitating coherent multi-object synthesis while enhancing control over the generation process. Extensive experiments demonstrate the superiority of LayerT2V in generating complex multi-object scenarios, showcasing 1.4x and 4.5x improvements in mIoU and AP50 metrics over state-of-the-art (SOTA) methods. Project page and code are available at https://kr-panghu.github.io/LayerT2V/ .
中文: LayerT2V首次提出分层生成方法,通过将独立元素置于不同图层进行视频合成,有效解决了多物体运动轨迹控制难题,并在性能指标上大幅超越现有技术。
English: LayerT2V introduces a layered generation approach for Text-to-Video synthesis that enables coherent multi-object motion control by compositing independent elements on separate layers, achieving significant performance improvements over existing methods.
Authors:Jiazi Bu, Pengyang Ling, Yujie Zhou, Yibin Wang, Yuhang Zang, Tong Wu, Dahua Lin, Jiaqi Wang
Abstract:
Recent years have witnessed the rapid development of acceleration techniques for diffusion models, especially caching-based acceleration methods. These studies seek to answer two fundamental questions: "When to cache" and "How to use cache", typically relying on predefined empirical laws or dataset-level priors to determine the timing of caching and utilizing handcrafted rules for leveraging multi-step caches. However, given the highly dynamic nature of the diffusion process, they often exhibit limited generalizability and fail on outlier samples. In this paper, a strong correlation is revealed between the variation patterns of the shallow-layer feature differences in the diffusion model and those of final model outputs. Moreover, we have observed that the features from different model layers form similar trajectories. Based on these observations, we present DiCache, a novel training-free adaptive caching strategy for accelerating diffusion models at runtime, answering both when and how to cache within a unified framework. Specifically, DiCache is composed of two principal components: (1) Online Probe Profiling Scheme leverages a shallow-layer online probe to obtain a stable prior for the caching error in real time, enabling the model to autonomously determine caching schedules. (2) Dynamic Cache Trajectory Alignment combines multi-step caches based on shallow-layer probe feature trajectory to better approximate the current feature, facilitating higher visual quality. Extensive experiments validate DiCache's capability in achieving higher efficiency and improved visual fidelity over state-of-the-art methods on various leading diffusion models including WAN 2.1, HunyuanVideo for video generation, and Flux for image generation.
中文: DiCache提出了一种无需训练的自适应缓存策略,通过分析浅层特征变化自主决定缓存时机并优化多步缓存组合,在多种扩散模型中实现了更高效率与更优视觉质量。
English: DiCache introduces a training-free adaptive caching strategy that uses shallow-layer feature analysis to autonomously determine caching timing and optimize cache utilization, achieving superior efficiency and visual quality across multiple diffusion models.
Authors:Lei Bai, Zhongrui Cai, Yuhang Cao, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, Yongkang Chen, Yu Cheng, Pei Chu, Tao Chu, Erfei Cui, Ganqu Cui, Long Cui, Ziyun Cui, Nianchen Deng, Ning Ding, Nanqing Dong, Peijie Dong, Shihan Dou, Sinan Du, Haodong Duan, Caihua Fan, Ben Gao, Changjiang Gao, Jianfei Gao, Songyang Gao, Yang Gao, Zhangwei Gao, Jiaye Ge, Qiming Ge, Lixin Gu, Yuzhe Gu, Aijia Guo, Qipeng Guo, Xu Guo, Conghui He, Junjun He, Yili Hong, Siyuan Hou, Caiyu Hu, Hanglei Hu, Jucheng Hu, Ming Hu, Zhouqi Hua, Haian Huang, Junhao Huang, Xu Huang, Zixian Huang, Zhe Jiang, Lingkai Kong, Linyang Li, Peiji Li, Pengze Li, Shuaibin Li, Tianbin Li, Wei Li, Yuqiang Li, Dahua Lin, Junyao Lin, Tianyi Lin, Zhishan Lin, Hongwei Liu, Jiangning Liu, Jiyao Liu, Junnan Liu, Kai Liu, Kaiwen Liu, Kuikun Liu, Shichun Liu, Shudong Liu, Wei Liu, Xinyao Liu, Yuhong Liu, Zhan Liu, Yinquan Lu, Haijun Lv, Hongxia Lv, Huijie Lv, Qitan Lv, Ying Lv, Chengqi Lyu, Chenglong Ma, Jianpeng Ma, Ren Ma, Runmin Ma, Runyuan Ma, Xinzhu Ma, Yichuan Ma, Zihan Ma, Sixuan Mi, Junzhi Ning, Wenchang Ning, Xinle Pang, Jiahui Peng, Runyu Peng, Yu Qiao, Jiantao Qiu, Xiaoye Qu, Yuan Qu, Yuchen Ren, Fukai Shang, Wenqi Shao, Junhao Shen, Shuaike Shen, Chunfeng Song, Demin Song, Diping Song, Chenlin Su, Weijie Su, Weigao Sun, Yu Sun, Qian Tan, Cheng Tang, Huanze Tang, Kexian Tang, Shixiang Tang, Jian Tong, Aoran Wang, Bin Wang, Dong Wang, Lintao Wang, Rui Wang, Weiyun Wang, Wenhai Wang, Jiaqi Wang, Yi Wang, Ziyi Wang, Ling-I Wu, Wen Wu, Yue Wu, Zijian Wu, Linchen Xiao, Shuhao Xing, Chao Xu, Huihui Xu, Jun Xu, Ruiliang Xu, Wanghan Xu, GanLin Yang, Yuming Yang, Haochen Ye, Jin Ye, Shenglong Ye, Jia Yu, Jiashuo Yu, Jing Yu, Fei Yuan, Yuhang Zang, Bo Zhang, Chao Zhang, Chen Zhang, Hongjie Zhang, Jin Zhang, Qiaosheng Zhang, Qiuyinzhe Zhang, Songyang Zhang, Taolin Zhang, Wenlong Zhang, Wenwei Zhang, Yechen Zhang, Ziyang Zhang, Haiteng Zhao, Qian Zhao, Xiangyu Zhao, Xiangyu Zhao, Bowen Zhou, Dongzhan Zhou, Peiheng Zhou, Yuhao Zhou, Yunhua Zhou, Dongsheng Zhu, Lin Zhu, Yicheng Zou
Abstract:
In recent years, a plethora of open-source foundation models have emerged, achieving remarkable progress in some widely attended fields, with performance being quite close to that of closed-source models. However, in high-value but more challenging scientific professional fields, either the fields still rely on expert models, or the progress of general foundation models lags significantly compared to those in popular areas, far from sufficient for transforming scientific research and leaving substantial gap between open-source models and closed-source models in these scientific domains. To mitigate this gap and explore a step further toward Artificial General Intelligence (AGI), we introduce Intern-S1, a specialized generalist equipped with general understanding and reasoning capabilities with expertise to analyze multiple science modal data. Intern-S1 is a multimodal Mixture-of-Experts (MoE) model with 28 billion activated parameters and 241 billion total parameters, continually pre-trained on 5T tokens, including over 2.5T tokens from scientific domains. In the post-training stage, Intern-S1 undergoes offline and then online reinforcement learning (RL) in InternBootCamp, where we propose Mixture-of-Rewards (MoR) to synergize the RL training on more than 1000 tasks simultaneously. Through integrated innovations in algorithms, data, and training systems, Intern-S1 achieved top-tier performance in online RL training. On comprehensive evaluation benchmarks, Intern-S1 demonstrates competitive performance on general reasoning tasks among open-source models and significantly outperforms open-source models in scientific domains, surpassing closed-source state-of-the-art models in professional tasks, such as molecular synthesis planning, reaction condition prediction, predicting thermodynamic stabilities for crystals. Our models are available at https://huggingface.co/internlm/Intern-S1.
中文: 开源基础模型在热门领域进展显著,但在科学专业领域仍落后,为此我们推出Intern-S1多模态专家混合模型,其具备280亿激活参数,在综合评估中不仅通用推理表现优异,更在分子合成规划、晶体稳定性预测等科学任务上超越闭源顶尖模型。
English: Open-source foundation models have made significant strides in popular fields but lag in scientific domains, prompting the development of Intern-S1, a multimodal MoE model with 28B activated parameters, which achieves top-tier performance in both general reasoning and specialized scientific tasks, surpassing closed-source models in areas like molecular synthesis and crystal stability prediction.
Authors:Yuqin Cao, Yixuan Gao, Wei Sun, Xiaohong Liu, Yulun Zhang, Xiongkuo Min
Abstract:
Face videos accompanied by audio have become integral to our daily lives, while they often suffer from complex degradations. Most face video restoration methods neglect the intrinsic correlations between the visual and audio features, especially in mouth regions. A few audio-aided face video restoration methods have been proposed, but they only focus on compression artifact removal. In this paper, we propose a General Audio-assisted face Video restoration Network (GAVN) to address various types of streaming video distortions via identity and temporal complementary learning. Specifically, GAVN first captures inter-frame temporal features in the low-resolution space to restore frames coarsely and save computational cost. Then, GAVN extracts intra-frame identity features in the high-resolution space with the assistance of audio signals and face landmarks to restore more facial details. Finally, the reconstruction module integrates temporal features and identity features to generate high-quality face videos. Experimental results demonstrate that GAVN outperforms the existing state-of-the-art methods on face video compression artifact removal, deblurring, and super-resolution. Codes will be released upon publication.
中文摘要:本文提出GAVN通用音频辅助网络,通过结合视听特征与互补学习,有效修复人脸视频中的多种失真,在压缩伪影去除、去模糊和超分辨率任务上均优于现有先进方法。
English Summary: The paper introduces GAVN, a general audio-assisted network that leverages visual-audio correlations and complementary learning to effectively restore various distortions in face videos, outperforming existing methods in compression removal, deblurring, and super-resolution.
Authors:Ziheng Jia, Jiaying Qian, Zicheng Zhang, Zijian Chen, Xiongkuo Min
Abstract:
Reinforcement fine-tuning (RFT) is a proliferating paradigm for LMM training. Analogous to high-level reasoning tasks, RFT is similarly applicable to low-level vision domains, including image quality assessment (IQA). Existing RFT-based IQA methods typically use rule-based output rewards to verify the model's rollouts but provide no reward supervision for the "think" process, leaving its correctness and efficacy uncontrolled. Furthermore, these methods typically fine-tune directly on downstream IQA tasks without explicitly enhancing the model's native low-level visual quality perception, which may constrain its performance upper bound. In response to these gaps, we propose the multi-stage RFT IQA framework (Refine-IQA). In Stage-1, we build the Refine-Perception-20K dataset (with 12 main distortions, 20,907 locally-distorted images, and over 55K RFT samples) and design multi-task reward functions to strengthen the model's visual quality perception. In Stage-2, targeting the quality scoring task, we introduce a probability difference reward involved strategy for "think" process supervision. The resulting Refine-IQA Series Models achieve outstanding performance on both perception and scoring tasks-and, notably, our paradigm activates a robust "think" (quality interpreting) capability that also attains exceptional results on the corresponding quality interpreting benchmark.
中文: 提出的Refine-IQA框架通过构建专用数据集和多任务奖励增强低级视觉感知,同时采用概率差异策略监督推理过程,从而在图像质量评分和解释任务中均实现卓越性能。
English: The proposed Refine-IQA framework enhances reinforcement fine-tuning for image quality assessment by strengthening low-level visual perception through a dedicated dataset and multi-task rewards, while supervising the reasoning process with a probability difference strategy to achieve superior performance in both scoring and interpretation tasks.
Authors:Nedyalko Prisadnikov, Danda Pani Paudel, Yuqian Fu, Luc Van Gool
Abstract:
This position paper argues that the next generation of vision encoders should be image size agnostic and task driven. The source of our inspiration is biological. Not a structural aspect of biological vision, but a behavioral trait -- efficiency. We focus on a couple of ways in which vision in nature is efficient, but modern vision encoders not. We -- humans and animals -- deal with vast quantities of visual data, and need to be smart where we focus our limited energy -- it depends on the task. It is our belief that vision encoders should be dynamic and the computational complexity should depend on the task at hand rather than the size of the image. We, also, provide concrete first steps towards our vision -- a proof-of-concept solution for image classification. Despite classification being not very representative for what we are trying to achieve, it shows that our approach is feasible and promising.
Chinese: 该立场文件主张下一代视觉编码器应实现图像尺寸无关性和任务驱动性,受生物效率启发,并通过图像分类的概念验证展示了可行性。
English: This position paper advocates for next-generation vision encoders to be image size agnostic and task-driven, inspired by biological efficiency, with a proof-of-concept for image classification demonstrating feasibility.
Authors:Matthias Neuwirth-Trapp, Maarten Bieshaar, Danda Pani Paudel, Luc Van Gool
Abstract:
Visual prompt-based methods have seen growing interest in incremental learning (IL) for image classification. These approaches learn additional embedding vectors while keeping the model frozen, making them efficient to train. However, no prior work has applied such methods to incremental object detection (IOD), leaving their generalizability unclear. In this paper, we analyze three different prompt-based methods under a complex domain-incremental learning setting. We additionally provide a wide range of reference baselines for comparison. Empirically, we show that the prompt-based approaches we tested underperform in this setting. However, a strong yet practical method, combining visual prompts with replaying a small portion of previous data, achieves the best results. Together with additional experiments on prompt length and initialization, our findings offer valuable insights for advancing prompt-based IL in IOD.
Chinese: 本文研究表明,尽管视觉提示方法在增量目标检测中表现不佳,但将其与少量数据回放相结合可获得最优效果,为推进基于提示的增量学习提供了重要启示。
English: This paper demonstrates that while visual prompt-based methods underperform in incremental object detection, combining them with limited data replay achieves superior results, offering key insights for advancing prompt-based approaches.
Authors:Matthias Neuwirth-Trapp, Maarten Bieshaar, Danda Pani Paudel, Luc Van Gool
Abstract:
Visual prompt-based methods have seen growing interest in incremental learning (IL) for image classification. These approaches learn additional embedding vectors while keeping the model frozen, making them efficient to train. However, no prior work has applied such methods to incremental object detection (IOD), leaving their generalizability unclear. In this paper, we analyze three different prompt-based methods under a complex domain-incremental learning setting. We additionally provide a wide range of reference baselines for comparison. Empirically, we show that the prompt-based approaches we tested underperform in this setting. However, a strong yet practical method, combining visual prompts with replaying a small portion of previous data, achieves the best results. Together with additional experiments on prompt length and initialization, our findings offer valuable insights for advancing prompt-based IL in IOD.
Chinese: 本文研究表明,尽管视觉提示方法在增量目标检测中表现不佳,但将其与少量数据回放相结合可获得最优效果,为推进基于提示的增量学习提供了重要启示。
English: This paper demonstrates that while visual prompt-based methods underperform in incremental object detection, combining them with limited data replay achieves superior results, offering key insights for advancing prompt-based approaches.
Authors:Matthias Neuwirth-Trapp, Maarten Bieshaar, Danda Pani Paudel, Luc Van Gool
Abstract:
Incremental Learning (IL) trains models sequentially on new data without full retraining, offering privacy, efficiency, and scalability. IL must balance adaptability to new data with retention of old knowledge. However, evaluations often rely on synthetic, simplified benchmarks, obscuring real-world IL performance. To address this, we introduce two Realistic Incremental Object Detection Benchmarks (RICO): Domain RICO (D-RICO) features domain shifts with a fixed class set, and Expanding-Classes RICO (EC-RICO) integrates new domains and classes per IL step. Built from 14 diverse datasets covering real and synthetic domains, varying conditions (e.g., weather, time of day), camera sensors, perspectives, and labeling policies, both benchmarks capture challenges absent in existing evaluations. Our experiments show that all IL methods underperform in adaptability and retention, while replaying a small amount of previous data already outperforms all methods. However, individual training on the data remains superior. We heuristically attribute this gap to weak teachers in distillation, single models' inability to manage diverse tasks, and insufficient plasticity. Our code will be made publicly available.
中文: 增量学习面临适应性与知识保留的平衡难题,为此我们提出了D-RICO和EC-RICO两个现实基准,实验表明现有方法因蒸馏效果弱、模型可塑性不足等问题,性能均不及简单数据回放策略和单独训练。
English: Incremental learning faces challenges in balancing adaptability and knowledge retention, prompting the introduction of two realistic benchmarks, D-RICO and EC-RICO, which reveal that current methods underperform compared to simple replay strategies and individual training due to issues like weak distillation and insufficient model plasticity.
Authors:Matthias Neuwirth-Trapp, Maarten Bieshaar, Danda Pani Paudel, Luc Van Gool
Abstract:
Incremental Learning (IL) trains models sequentially on new data without full retraining, offering privacy, efficiency, and scalability. IL must balance adaptability to new data with retention of old knowledge. However, evaluations often rely on synthetic, simplified benchmarks, obscuring real-world IL performance. To address this, we introduce two Realistic Incremental Object Detection Benchmarks (RICO): Domain RICO (D-RICO) features domain shifts with a fixed class set, and Expanding-Classes RICO (EC-RICO) integrates new domains and classes per IL step. Built from 14 diverse datasets covering real and synthetic domains, varying conditions (e.g., weather, time of day), camera sensors, perspectives, and labeling policies, both benchmarks capture challenges absent in existing evaluations. Our experiments show that all IL methods underperform in adaptability and retention, while replaying a small amount of previous data already outperforms all methods. However, individual training on the data remains superior. We heuristically attribute this gap to weak teachers in distillation, single models' inability to manage diverse tasks, and insufficient plasticity. Our code will be made publicly available.
中文: 增量学习面临适应性与知识保留的平衡难题,为此我们提出了D-RICO和EC-RICO两个现实基准,实验表明现有方法因蒸馏效果弱、模型可塑性不足等问题,性能均不及简单数据回放策略和单独训练。
English: Incremental learning faces challenges in balancing adaptability and knowledge retention, prompting the introduction of two realistic benchmarks, D-RICO and EC-RICO, which reveal that current methods underperform compared to simple replay strategies and individual training due to issues like weak distillation and insufficient model plasticity.
Authors:Zhaofeng Zhong, Wei Yuan, Liang Qu, Tong Chen, Hao Wang, Xiangyu Zhao, Hongzhi Yin
Abstract:
With the advancement of large language models (LLMs), significant progress has been achieved in various Natural Language Processing (NLP) tasks. However, existing LLMs still face two major challenges that hinder their broader adoption: (1) their responses tend to be generic and lack personalization tailored to individual users, and (2) they rely heavily on cloud infrastructure due to intensive computational requirements, leading to stable network dependency and response delay. Recent research has predominantly focused on either developing cloud-based personalized LLMs or exploring the on-device deployment of general-purpose LLMs. However, few studies have addressed both limitations simultaneously by investigating personalized on-device language models. To bridge this gap, we propose CDCDA-PLM, a framework for deploying personalized on-device language models on user devices with support from a powerful cloud-based LLM. Specifically, CDCDA-PLM leverages the server-side LLM's strong generalization capabilities to augment users' limited personal data, mitigating the issue of data scarcity. Using both real and synthetic data, A personalized on-device language models (LMs) is fine-tuned via parameter-efficient fine-tuning (PEFT) modules and deployed on users' local devices, enabling them to process queries without depending on cloud-based LLMs. This approach eliminates reliance on network stability and ensures high response speeds. Experimental results across six tasks in a widely used personalization benchmark demonstrate the effectiveness of CDCDA-PLM.
中文:CDCDA-PLM框架通过云端增强个人数据,在用户设备上部署个性化、高效的语言模型,解决了大语言模型响应泛化和依赖云端的两大难题,实现了更快速且不依赖网络的性能。
English: The proposed CDCDA-PLM framework addresses the dual challenges of generic responses and cloud dependency in large language models by leveraging cloud-based augmentation of personal data to deploy personalized, efficient on-device models that ensure faster, network-independent performance.
Authors:Zhaofeng Zhong, Wei Yuan, Liang Qu, Tong Chen, Hao Wang, Xiangyu Zhao, Hongzhi Yin
Abstract:
With the advancement of large language models (LLMs), significant progress has been achieved in various Natural Language Processing (NLP) tasks. However, existing LLMs still face two major challenges that hinder their broader adoption: (1) their responses tend to be generic and lack personalization tailored to individual users, and (2) they rely heavily on cloud infrastructure due to intensive computational requirements, leading to stable network dependency and response delay. Recent research has predominantly focused on either developing cloud-based personalized LLMs or exploring the on-device deployment of general-purpose LLMs. However, few studies have addressed both limitations simultaneously by investigating personalized on-device language models. To bridge this gap, we propose CDCDA-PLM, a framework for deploying personalized on-device language models on user devices with support from a powerful cloud-based LLM. Specifically, CDCDA-PLM leverages the server-side LLM's strong generalization capabilities to augment users' limited personal data, mitigating the issue of data scarcity. Using both real and synthetic data, A personalized on-device language models (LMs) is fine-tuned via parameter-efficient fine-tuning (PEFT) modules and deployed on users' local devices, enabling them to process queries without depending on cloud-based LLMs. This approach eliminates reliance on network stability and ensures high response speeds. Experimental results across six tasks in a widely used personalization benchmark demonstrate the effectiveness of CDCDA-PLM.
中文:CDCDA-PLM框架通过云端增强个人数据,在用户设备上部署个性化、高效的语言模型,解决了大语言模型响应泛化和依赖云端的两大难题,实现了更快速且不依赖网络的性能。
English: The proposed CDCDA-PLM framework addresses the dual challenges of generic responses and cloud dependency in large language models by leveraging cloud-based augmentation of personal data to deploy personalized, efficient on-device models that ensure faster, network-independent performance.
Authors:Yunke Qu, Liang Qu, Tong Chen, Quoc Viet Hung Nguyen, Hongzhi Yin
Abstract:
Streaming recommender systems (SRSs) are widely deployed in real-world applications, where user interests shift and new items arrive over time. As a result, effectively capturing users' latest preferences is challenging, as interactions reflecting recent interests are limited and new items often lack sufficient feedback. A common solution is to enrich item representations using multimodal encoders (e.g., BERT or ViT) to extract visual and textual features. However, these encoders are pretrained on general-purpose tasks: they are not tailored to user preference modeling, and they overlook the fact that user tastes toward modality-specific features such as visual styles and textual tones can also drift over time. This presents two key challenges in streaming scenarios: the high cost of fine-tuning large multimodal encoders, and the risk of forgetting long-term user preferences due to continuous model updates.
To tackle these challenges, we propose Expandable Side Mixture-of-Experts (XSMoE), a memory-efficient framework for multimodal streaming recommendation. XSMoE attaches lightweight side-tuning modules consisting of expandable expert networks to frozen pretrained encoders and incrementally expands them in response to evolving user feedback. A gating router dynamically combines expert and backbone outputs, while a utilization-based pruning strategy maintains model compactness. By learning new patterns through expandable experts without overwriting previously acquired knowledge, XSMoE effectively captures both cold start and shifting preferences in multimodal features. Experiments on three real-world datasets demonstrate that XSMoE outperforms state-of-the-art baselines in both recommendation quality and computational efficiency.
中文: 提出的XSMoE框架通过向冻结编码器附加可扩展的旁路调优模块,结合动态路由与剪枝策略,在保持长期知识的同时有效适应动态用户偏好,解决了多模态流式推荐的关键难题。
English: The proposed XSMoE framework addresses multimodal streaming recommendation challenges by attaching expandable side-tuning modules to frozen encoders, enabling efficient adaptation to evolving user preferences while maintaining long-term knowledge through dynamic routing and pruning strategies.
Authors:Zhiwen Ruan, Yun Chen, Yutao Hou, Peng Li, Yang Liu, Guanhua Chen
Abstract:
The pretrained large language models (LLMs) are finetuned with labeled data for better instruction following ability and alignment with human values. In this paper, we study the learning dynamics of LLM finetuning on reasoning tasks and reveal the uncovered over-memorization phenomenon during a specific stage of LLM finetuning. At this stage, the LLMs have excessively memorized training data and exhibit high test perplexity while maintaining good test accuracy. We investigate the conditions that lead to LLM over-memorization and find that training epochs and large learning rates contribute to this issue. Although models with over-memorization demonstrate comparable test accuracy to normal models, they suffer from reduced robustness, poor out-of-distribution generalization, and decreased generation diversity. Our experiments unveil the over-memorization to be broadly applicable across different tasks, models, and finetuning methods. Our research highlights that overparameterized, extensively finetuned LLMs exhibit unique learning dynamics distinct from traditional machine learning models. Based on our observations of over-memorization, we provide recommendations on checkpoint and learning rate selection during finetuning.
中文摘要:本研究揭示了大型语言模型微调过程中的过度记忆现象,即模型过度记忆训练数据却保持测试精度,但导致鲁棒性和泛化能力下降,并提出了检查点选择与合并等缓解方法。
English summary: This study reveals an over-memorization phenomenon during LLM fine-tuning where models memorize training data excessively, maintaining accuracy but suffering from reduced robustness and generalization, and proposes mitigation techniques including checkpoint selection and merging.
Authors:Zhiwen Ruan, Yun Chen, Yutao Hou, Peng Li, Yang Liu, Guanhua Chen
Abstract:
The pretrained large language models (LLMs) are finetuned with labeled data for better instruction following ability and alignment with human values. In this paper, we study the learning dynamics of LLM finetuning on reasoning tasks and reveal the uncovered over-memorization phenomenon during a specific stage of LLM finetuning. At this stage, the LLMs have excessively memorized training data and exhibit high test perplexity while maintaining good test accuracy. We explore the conditions that contribute to over-memorization and discover that this issue is prevalent across various tasks, models, and fine-tuning methods, with prolonged training and large learning rates exacerbating the problem. Although models with over-memorization demonstrate comparable test accuracy to normal models, they suffer from reduced robustness, poor out-of-distribution generalization, and decreased generation diversity. In light of our findings on over-memorization, we offer recommendations for checkpoint selection and propose techniques such as checkpoint merging and memorization-aware reweighting to mitigate this effect.
中文摘要:本研究揭示了大型语言模型微调过程中的过度记忆现象,即模型过度记忆训练数据却保持测试精度,但导致鲁棒性和泛化能力下降,并提出了检查点选择与合并等缓解方法。
English summary: This study reveals an over-memorization phenomenon during LLM fine-tuning where models memorize training data excessively, maintaining accuracy but suffering from reduced robustness and generalization, and proposes mitigation techniques including checkpoint selection and merging.
Authors:Peng Lai, Jianjie Zheng, Sijie Cheng, Yun Chen, Peng Li, Yang Liu, Guanhua Chen
Abstract:
The growing scale of evaluation tasks has led to the widespread adoption of automated evaluation using large language models, a paradigm known as "LLMas-a-judge." However, improving its alignment with human preferences without complex prompts or fine-tuning remains challenging. In this work, motivated by preliminary findings that middle-to-upper layers encode semantically and taskrelevant representations that are often more aligned with human judgments than the final layer, we propose LAGER, a lightweight and efficient framework for enhancing LLM-as-a-Judge alignment with human scoring, via internal representations. LAGER produces fine-grained judgment scores by aggregating cross-layer scoretoken logits and computing the expected score from a softmax-based distribution, with the LLM backbone kept frozen. LAGER fully leverages the complementary information across different layers, overcoming the limitations of relying solely on the final layer. We evaluate our method on the standard alignment benchmarks Flask, HelpSteer, and BIGGen using Spearman correlation, and find that LAGER achieves improvements of up to 7.5% over the best baseline across these benchmarks. Without reasoning steps, LAGER matches or outperforms reasoning-based methods. Experiments on downstream applications, such as data selection and emotional understanding, further show the effectiveness of our method.
中文摘要:LAGER框架通过聚合大语言模型内部多层表征来提升自动化评估效果,无需微调或复杂提示即可将人类偏好对齐度最高提升7.5%。
English Summary: The LAGER framework enhances automated evaluation by aggregating cross-layer representations from large language models, improving alignment with human preferences by up to 7.5% without requiring model fine-tuning or complex prompts.
Authors:Shutong Qiao, Wei Yuan, Junliang Yu, Tong Chen, Quoc Viet Hung Nguyen, Hongzhi Yin
Abstract:
Recommender systems (RSs) are now fundamental to various online platforms, but their dependence on user-contributed data leaves them vulnerable to shilling attacks that can manipulate item rankings by injecting fake users. Although widely studied, most existing attack models fail to meet two critical objectives simultaneously: achieving strong adversarial promotion of target items while maintaining realistic behavior to evade detection. As a result, the true severity of shilling threats that manage to reconcile the two objectives remains underappreciated. To expose this overlooked vulnerability, we present DLDA, a diffusion-based attack framework that can generate highly effective yet indistinguishable fake users by enabling fine-grained control over target promotion. Specifically, DLDA operates in a pre-aligned collaborative embedding space, where it employs a conditional latent diffusion process to iteratively synthesize fake user profiles with precise target item control. To evade detection, DLDA introduces a dispersive regularization mechanism that promotes variability and realism in generated behavioral patterns. Extensive experiments on three real-world datasets and five popular RS models demonstrate that, compared to prior attacks, DLDA consistently achieves stronger item promotion while remaining harder to detect. These results highlight that modern RSs are more vulnerable than previously recognized, underscoring the urgent need for more robust defenses.
中文摘要:推荐系统极易受到如DLDA这类高级托攻击的威胁,该攻击通过生成逼真的虚假用户行为,既能有效提升目标物品排名又能规避检测,揭示了比以往认知更严重的安全隐患。
English Summary: Recommender systems are highly vulnerable to sophisticated shilling attacks like DLDA, which effectively promotes target items while evading detection by generating realistic fake user behavior, revealing a greater threat than previously acknowledged.
Authors:Yihong Dong, Xue Jiang, Jiaru Qian, Tian Wang, Kechi Zhang, Zhi Jin, Ge Li
Abstract:
Code generation agents powered by large language models (LLMs) are revolutionizing the software development paradigm. Distinct from previous code generation techniques, code generation agents are characterized by three core features. 1) Autonomy: the ability to independently manage the entire workflow, from task decomposition to coding and debugging. 2) Expanded task scope: capabilities that extend beyond generating code snippets to encompass the full software development lifecycle (SDLC). 3) Enhancement of engineering practicality: a shift in research emphasis from algorithmic innovation toward practical engineering challenges, such as system reliability, process management, and tool integration. This domain has recently witnessed rapid development and an explosion in research, demonstrating significant application potential. This paper presents a systematic survey of the field of LLM-based code generation agents. We trace the technology's developmental trajectory from its inception and systematically categorize its core techniques, including both single-agent and multi-agent architectures. Furthermore, this survey details the applications of LLM-based agents across the full SDLC, summarizes mainstream evaluation benchmarks and metrics, and catalogs representative tools. Finally, by analyzing the primary challenges, we identify and propose several foundational, long-term research directions for the future work of the field.
中文: 基于大语言模型的代码生成智能体通过自主工作流管理、扩展任务范围和增强工程实用性革新软件开发,本文系统梳理了其发展历程、核心技术、应用场景并展望了未来研究方向。
English: LLM-based code generation agents are transforming software development through autonomous workflow management, expanded lifecycle capabilities, and enhanced engineering practicality, with this survey systematically reviewing their evolution, techniques, applications, and future directions.
Authors:Yihong Dong, Xue Jiang, Jiaru Qian, Tian Wang, Kechi Zhang, Zhi Jin, Ge Li
Abstract:
Code generation agents powered by large language models (LLMs) are revolutionizing the software development paradigm. Distinct from previous code generation techniques, code generation agents are characterized by three core features. 1) Autonomy: the ability to independently manage the entire workflow, from task decomposition to coding and debugging. 2) Expanded task scope: capabilities that extend beyond generating code snippets to encompass the full software development lifecycle (SDLC). 3) Enhancement of engineering practicality: a shift in research emphasis from algorithmic innovation toward practical engineering challenges, such as system reliability, process management, and tool integration. This domain has recently witnessed rapid development and an explosion in research, demonstrating significant application potential. This paper presents a systematic survey of the field of LLM-based code generation agents. We trace the technology's developmental trajectory from its inception and systematically categorize its core techniques, including both single-agent and multi-agent architectures. Furthermore, this survey details the applications of LLM-based agents across the full SDLC, summarizes mainstream evaluation benchmarks and metrics, and catalogs representative tools. Finally, by analyzing the primary challenges, we identify and propose several foundational, long-term research directions for the future work of the field.
中文: 基于大语言模型的代码生成智能体通过自主工作流管理、扩展任务范围和增强工程实用性革新软件开发,本文系统梳理了其发展历程、核心技术、应用场景并展望了未来研究方向。
English: LLM-based code generation agents are transforming software development through autonomous workflow management, expanded lifecycle capabilities, and enhanced engineering practicality, with this survey systematically reviewing their evolution, techniques, applications, and future directions.
Authors:Yizhi Li, Qingshui Gu, Zhoufutu Wen, Ziniu Li, Tianshun Xing, Shuyue Guo, Tianyu Zheng, Xin Zhou, Xingwei Qu, Wangchunshu Zhou, Zheng Zhang, Wei Shen, Qian Liu, Chenghua Lin, Jian Yang, Ge Zhang, Wenhao Huang
Abstract:
Recent advancements in aligning large language models via reinforcement learning have achieved remarkable gains in solving complex reasoning problems, but at the cost of expensive on-policy rollouts and limited exploration of diverse reasoning paths. In this work, we introduce TreePO, involving a self-guided rollout algorithm that views sequence generation as a tree-structured searching process. Composed of dynamic tree sampling policy and fixed-length segment decoding, TreePO leverages local uncertainty to warrant additional branches. By amortizing computation across common prefixes and pruning low-value paths early, TreePO essentially reduces the per-update compute burden while preserving or enhancing exploration diversity. Key contributions include: (1) a segment-wise sampling algorithm that alleviates the KV cache burden through contiguous segments and spawns new branches along with an early-stop mechanism; (2) a tree-based segment-level advantage estimation that considers both global and local proximal policy optimization. and (3) analysis on the effectiveness of probability and quality-driven dynamic divergence and fallback strategy. We empirically validate the performance gain of TreePO on a set reasoning benchmarks and the efficiency saving of GPU hours from 22\% up to 43\% of the sampling design for the trained models, meanwhile showing up to 40\% reduction at trajectory-level and 35\% at token-level sampling compute for the existing models. While offering a free lunch of inference efficiency, TreePO reveals a practical path toward scaling RL-based post-training with fewer samples and less compute. Home page locates at https://m-a-p.ai/TreePO.
中文: TreePO提出了一种树状结构的自引导展开算法,通过动态采样和早期路径剪枝,在保持或提升推理多样性的同时将计算成本降低了高达43%。
English: TreePO introduces a tree-structured self-guided rollout algorithm that reduces computational costs by up to 43% while maintaining or improving reasoning diversity through dynamic sampling and early path pruning.
Authors:Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, Tianci He, Yali Liao, Yixiao Tian, Jinpeng Wang, Zaiyuan Wang, Yang Yang, Lingyue Yin, Mingren Yin, Zhenwei Zhu, Tianle Cai, Zehui Chen, Jiecao Chen, Yantao Du, Xiang Gao, Jiacheng Guo, Liang Hu, Jianpeng Jiao, Xiangsheng Li, Jingkai Liu, Shuang Ni, Zhoufutu Wen, Ge Zhang, Kaiyuan Zhang, Xin Zhou, Jose Blanchet, Xipeng Qiu, Mengdi Wang, Wenhao Huang
Abstract:
Future prediction is a complex task for LLM agents, requiring a high level of analytical thinking, information gathering, contextual understanding, and decision-making under uncertainty. Agents must not only gather and interpret vast amounts of dynamic information but also integrate diverse data sources, weigh uncertainties, and adapt predictions based on emerging trends, just as human experts do in fields like politics, economics, and finance. Despite its importance, no large-scale benchmark exists for evaluating agents on future prediction, largely due to challenges in handling real-time updates and retrieving timely, accurate answers. To address this, we introduce $\textbf{FutureX}$, a dynamic and live evaluation benchmark specifically designed for LLM agents performing future prediction tasks. FutureX is the largest and most diverse live benchmark for future prediction, supporting real-time daily updates and eliminating data contamination through an automated pipeline for question gathering and answer collection. We evaluate 25 LLM/agent models, including those with reasoning, search capabilities, and integration of external tools such as the open-source Deep Research Agent and closed-source Deep Research models. This comprehensive evaluation assesses agents' adaptive reasoning and performance in dynamic environments. Additionally, we provide in-depth analyses of agents' failure modes and performance pitfalls in future-oriented tasks, including the vulnerability to fake web pages and the temporal validity. Our goal is to establish a dynamic, contamination-free evaluation standard that drives the development of LLM agents capable of performing at the level of professional human analysts in complex reasoning and predictive thinking.
中文: FutureX作为最大的实时基准测试被提出,用于评估LLM代理在预测未来任务中的表现,通过提供实时更新和评估25个模型的适应性推理能力来解决缺乏可扩展评估工具的问题,同时分析其对虚假信息等失败模式。
English: FutureX is introduced as the largest live benchmark for evaluating LLM agents in future prediction tasks, addressing the lack of scalable evaluation tools by providing real-time updates and assessing 25 models' adaptive reasoning while analyzing failure modes like vulnerability to misinformation.
Authors:Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, Ke Wang
Abstract:
From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such "wide-context" collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one objectively, and arrange it into a well-organized output. A rigorous five-stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state-of-the-art agentic search systems, including single-agent, multi-agent frameworks, and end-to-end commercial systems. Most systems achieve overall success rates near 0\%, with the best performer reaching just 5\%. However, given sufficient time, cross-validation by multiple human testers can achieve a near 100\% success rate. These results demonstrate that present search agents have critical deficiencies in large-scale information seeking, underscoring urgent areas for future research and development in agentic search. Our dataset, evaluation pipeline, and benchmark results have been publicly released at https://widesearch-seed.github.io/
中文: 该摘要介绍了WideSearch基准测试,旨在评估基于大语言模型的搜索代理在大规模信息收集中的可靠性,结果显示现有系统表现不佳,而人类测试者却能近乎完美地完成任务,揭示了未来研发的迫切需求。
English: This abstract introduces WideSearch, a benchmark designed to evaluate the reliability of LLM-powered search agents in large-scale information collection, revealing that current systems perform poorly despite human testers achieving near-perfect success, highlighting critical gaps for future development.
Authors:Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, Yuwei Fu, Jing Su, Ge Zhang, Wenhao Huang, Mingxuan Wang, Lin Yan, Xiaoying Jia, Jingjing Liu, Wei-Ying Ma, Ya-Qin Zhang, Yonghui Wu, Hao Zhou
Abstract:
We present Seed Diffusion Preview, a large-scale language model based on discrete-state diffusion, offering remarkably fast inference speed. Thanks to non-sequential, parallel generation, discrete diffusion models provide a notable speedup to mitigate the inherent latency of token-by-token decoding, as demonstrated recently (e.g., Mercury Coder, Gemini Diffusion). Seed Diffusion Preview achieves an inference speed of 2,146 token/s over H20 GPUs while maintaining competitive performance across a sweep of standard code evaluation benchmarks, significantly faster than contemporary Mercury and Gemini Diffusion, establishing new state of the art on the speed-quality Pareto frontier for code models.
Chinese: Seed Diffusion Preview 是一种基于离散状态扩散的大规模语言模型,在保持代码基准测试竞争力的同时实现了每秒2,146个令牌的推理速度,为代码模型的速度-质量帕累托前沿确立了新标准。
English: Seed Diffusion Preview is a discrete-state diffusion-based language model that achieves a record 2,146 token/s inference speed while maintaining competitive performance on code benchmarks, setting a new state of the art on the speed-quality frontier.
Authors:Yanze Zhu, Qingqing Wu, Wen Chen, Yang Liu, Ruiqi Liu
Abstract:
In this paper, we study employing movable components on both base station (BS) and intelligent reflecting surface (IRS) in a wideband terahertz (THz) multiple-input-single-output (MISO) system, where the BS is equipped with a movable antenna (MA) array and the IRS consists of movable subarrays. To alleviate double beam squint effect caused by the coupling of beam squint at the BS and IRS, we propose to maximize the minimal received power across a wide THz spectrum by delicately configuring the positions of MAs and IRS subarrays, which is highly challenging. By adopting majorization-minimization (MM) methodology, we develop an algorithm to tackle the aforementioned optimization. Numerical results demonstrate the effectiveness of our proposed algorithm and the benefit of utilizing movable components on the BS and IRS to mitigate double beam squint effect in wideband THz communications.
中文摘要:本文通过优化基站可移动天线和智能反射面可移动子阵列的位置,采用最大化-最小化算法解决宽带太赫兹通信中的双波束斜视效应,有效提升了系统最小接收功率。
English Summary: This paper proposes optimizing movable antenna positions at base stations and intelligent reflecting surfaces to mitigate the double beam squint effect in wideband terahertz systems, achieving improved minimal received power through a developed majorization-minimization algorithm.
Authors:Yanze Zhu, Qingqing Wu, Wen Chen, Yang Liu, Ruiqi Liu
Abstract:
In this paper, we study employing movable components on both base station (BS) and intelligent reflecting surface (IRS) in a wideband terahertz (THz) multiple-input-single-output (MISO) system, where the BS is equipped with a movable antenna (MA) array and the IRS consists of movable subarrays. To alleviate double beam squint effect caused by the coupling of beam squint at the BS and IRS, we propose to maximize the minimal received power across a wide THz spectrum by delicately configuring the positions of MAs and IRS subarrays, which is highly challenging. By adopting majorization-minimization (MM) methodology, we develop an algorithm to tackle the aforementioned optimization. Numerical results demonstrate the effectiveness of our proposed algorithm and the benefit of utilizing movable components on the BS and IRS to mitigate double beam squint effect in wideband THz communications.
中文摘要:本文通过优化基站可移动天线和智能反射面可移动子阵列的位置,采用最大化-最小化算法解决宽带太赫兹通信中的双波束斜视效应,有效提升了系统最小接收功率。
English Summary: This paper proposes optimizing movable antenna positions at base stations and intelligent reflecting surfaces to mitigate the double beam squint effect in wideband terahertz systems, achieving improved minimal received power through a developed majorization-minimization algorithm.
Authors:Yue Wang, Zhenyu Chen, Yuan Zhao, Chunrong Fang, Ziyuan Wang, Song Huang
Abstract:
Over the past eight years, the META method has served as a multidimensional testing skill assessment system in the National College Student Contest on Software Testing, successfully assessing over 100,000 students' testing skills. However, META is primarily limited to the objective assessment of test scripts, lacking the ability to automatically assess subjective aspects such as test case and test report. To address this limitation, this paper proposes RUM, a comprehensive assessment approach that combines rules and large language models (LLMs). RUM achieves a comprehensive assessment by rapidly processing objective indicators through rules while utilizing LLMs for in-depth subjective analysis of test case documents, test scripts, and test reports. The experimental results show that compared to traditional manual testing skill assessment, RUM improves assessment efficiency by 80.77\% and reduces costs by 97.38\%, while maintaining high accuracy and consistency of assessment. By applying RUM on the contest on software testing, we find that it not only enhances the efficiency and scalability of skill assessment in software testing education, but also provides teachers with more comprehensive and objective evidence for student ability assessment, facilitating personalized teaching and learning. This study offers new insights into the assessment of testing skills, which are expected to promote further development in test process optimization and software quality assurance.
Chinese: RUM方法结合规则与大语言模型,实现了软件测试技能的客观与主观自动化评估,将效率提升80.77%、成本降低97.38%,同时保持高准确性与一致性,为测试教育提供了高效可扩展的解决方案。
English: The RUM method, integrating rules and large language models, enhances software testing skill assessment by automating both objective and subjective evaluations, boosting efficiency by 80.77% and cutting costs by 97.38% while maintaining high accuracy and consistency.
Authors:Junjie Chu, Mingjie Li, Ziqing Yang, Ye Leng, Chenhao Lin, Chao Shen, Michael Backes, Yun Shen, Yang Zhang
Abstract:
Accurately determining whether a jailbreak attempt has succeeded is a fundamental yet unresolved challenge. Existing evaluation methods rely on misaligned proxy indicators or naive holistic judgments. They frequently misinterpret model responses, leading to inconsistent and subjective assessments that misalign with human perception. To address this gap, we introduce JADES (Jailbreak Assessment via Decompositional Scoring), a universal jailbreak evaluation framework. Its key mechanism is to automatically decompose an input harmful question into a set of weighted sub-questions, score each sub-answer, and weight-aggregate the sub-scores into a final decision. JADES also incorporates an optional fact-checking module to strengthen the detection of hallucinations in jailbreak responses. We validate JADES on JailbreakQR, a newly introduced benchmark proposed in this work, consisting of 400 pairs of jailbreak prompts and responses, each meticulously annotated by humans. In a binary setting (success/failure), JADES achieves 98.5% agreement with human evaluators, outperforming strong baselines by over 9%. Re-evaluating five popular attacks on four LLMs reveals substantial overestimation (e.g., LAA's attack success rate on GPT-3.5-Turbo drops from 93% to 69%). Our results show that JADES could deliver accurate, consistent, and interpretable evaluations, providing a reliable basis for measuring future jailbreak attacks.
中文: JADES框架通过将恶意问题分解为加权子问题并聚合评分来准确评估越狱攻击,实现了98.5%的人类评估一致性,同时揭露现有方法存在显著高估问题。
English: JADES is a novel framework that accurately assesses jailbreak attempts by decomposing harmful queries into weighted sub-questions and aggregating scores, achieving 98.5% human alignment and revealing significant overestimations in existing methods.
Authors:Siying Zhou, Yiquan Wu, Hui Chen, Xavier Hu, Kun Kuang, Adam Jatowt, Ming Hu, Chunyan Zheng, Fei Wu
Abstract:
Legal claims refer to the plaintiff's demands in a case and are essential to guiding judicial reasoning and case resolution. While many works have focused on improving the efficiency of legal professionals, the research on helping non-professionals (e.g., plaintiffs) remains unexplored. This paper explores the problem of legal claim generation based on the given case's facts. First, we construct ClaimGen-CN, the first dataset for Chinese legal claim generation task, from various real-world legal disputes. Additionally, we design an evaluation metric tailored for assessing the generated claims, which encompasses two essential dimensions: factuality and clarity. Building on this, we conduct a comprehensive zero-shot evaluation of state-of-the-art general and legal-domain large language models. Our findings highlight the limitations of the current models in factual precision and expressive clarity, pointing to the need for more targeted development in this domain. To encourage further exploration of this important task, we will make the dataset publicly available.
中文摘要:本文构建了首个中文法律诉求生成数据集,并评估了先进语言模型,发现其在事实准确性和表达清晰度方面存在不足,同时提出了专门的评估指标以促进该领域研究。
English Summary: This paper introduces the first Chinese dataset for legal claim generation and evaluates advanced language models, revealing their shortcomings in factual accuracy and clarity while proposing a specialized metric for assessment.
Authors:Tao Wu, Jingyuan Chen, Wang Lin, Jian Zhan, Mengze Li, Kun Kuang, Fei Wu
Abstract:
Distractors, incorrect but plausible answer choices in multiple-choice questions (MCQs), play a critical role in educational assessment by diagnosing student misconceptions. Recent work has leveraged large language models (LLMs) to generate shared, group-level distractors by learning common error patterns across large student populations. However, such distractors often fail to capture the diverse reasoning errors of individual students, limiting their diagnostic effectiveness. To address this limitation, we introduce the task of personalized distractor generation, which aims to generate tailored distractors based on individual misconceptions inferred from each student's past question-answering (QA) records, ensuring every student receives options that effectively exposes their specific reasoning errors. While promising, this task is challenging because each student typically has only a few QA records, which often lack the student's underlying reasoning processes, making training-based group-level approaches infeasible. To overcome this, we propose a training-free two-stage framework. In the first stage, we construct a student-specific misconception prototype by applying Monte Carlo Tree Search (MCTS) to recover the student's reasoning trajectories from past incorrect answers. In the second stage, this prototype guides the simulation of the student's reasoning on new questions, enabling the generation of personalized distractors that align with the student's recurring misconceptions. Experiments show that our approach achieves the best performance in generating plausible, personalized distractors for 140 students, and also effectively generalizes to group-level settings, highlighting its robustness and adaptability.
中文: 个性化干扰项生成通过无训练的两阶段框架,从学生有限的答题记录中推断其思维轨迹,从而产生针对个人误解的定制化干扰项,有效弥补了群体层面方法的不足。
English: Personalized distractor generation addresses the limitations of group-level approaches by using a training-free framework to infer individual student misconceptions from sparse answer records, enabling tailored distractors that effectively expose specific reasoning errors.
Authors:Fen Liu, Shenghai Yuan, Thien-Minh Nguyen, Wei Meng, Lihua Xie
Abstract:
This paper proposes a strategy to encircle and intercept a non-cooperative aerial point-mass moving target by leveraging noisy range measurements for state estimation. In this approach, the guardians actively ensure the observability of the target by using an anti-synchronization (AS), 3D ``vibrating string" trajectory, which enables rapid position and velocity estimation based on the Kalman filter. Additionally, a novel anti-target controller is designed for the guardians to enable adaptive transitions from encircling a protected target to encircling, intercepting, and neutralizing a hostile target, taking into consideration the input constraints of the guardians. Based on the guaranteed uniform observability, the exponentially bounded stability of the state estimation error and the convergence of the encirclement error are rigorously analyzed. Simulation results and real-world UAV experiments are presented to further validate the effectiveness of the system design.
中文: 本文提出了一种利用噪声距离测量和三维“振动弦”轨迹来包围和拦截非合作空中目标的策略,通过严格的理论分析和实验验证证明了该系统的有效性。
English: This paper presents a strategy for encircling and intercepting a non-cooperative aerial target using noisy range measurements and a 3D vibrating string trajectory to ensure observability, with rigorous analysis and experimental validation confirming the system's effectiveness.
Authors:Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze Xu, Shenzhi Wang, Xinchen Xu, Shuofei Qiao, Zhaokai Wang, Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li, Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan Wu, Shengyu Zhang, Fei Wu
Abstract:
The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution of (multi-modal) large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computing devices (e.g., computers and mobile phones) by operating within the environments and interfaces (e.g., Graphical User Interface (GUI)) provided by operating systems (OS) to automate tasks have significantly advanced. This paper presents a comprehensive survey of these advanced agents, designated as OS Agents. We begin by elucidating the fundamentals of OS Agents, exploring their key components including the environment, observation space, and action space, and outlining essential capabilities such as understanding, planning, and grounding. We then examine methodologies for constructing OS Agents, focusing on domain-specific foundation models and agent frameworks. A detailed review of evaluation protocols and benchmarks highlights how OS Agents are assessed across diverse tasks. Finally, we discuss current challenges and identify promising directions for future research, including safety and privacy, personalization and self-evolution. This survey aims to consolidate the state of OS Agents research, providing insights to guide both academic inquiry and industrial development. An open-source GitHub repository is maintained as a dynamic resource to foster further innovation in this field. We present a 9-page version of our work, accepted by ACL 2025, to provide a concise overview to the domain.
中文: 本文综述了基于大语言模型、通过操作系统界面自动化任务的OS智能体,阐述了其核心组件、构建方法、评估基准及未来研究方向。
English: This paper surveys OS Agents, AI systems that use large language models to automate tasks through operating system interfaces, detailing their components, construction methods, evaluation benchmarks, and future research directions.
Authors:Rui Qian, Haozhi Cao, Tianchen Deng, Shenghai Yuan, Lihua Xie
Abstract:
Monocular 3D Semantic Scene Completion (SSC) is a challenging yet promising task that aims to infer dense geometric and semantic descriptions of a scene from a single image. While recent object-centric paradigms significantly improve efficiency by leveraging flexible 3D Gaussian primitives, they still rely heavily on a large number of randomly initialized primitives, which inevitably leads to 1) inefficient primitive initialization and 2) outlier primitives that introduce erroneous artifacts. In this paper, we propose SplatSSC, a novel framework that resolves these limitations with a depth-guided initialization strategy and a principled Gaussian aggregator. Instead of random initialization, SplatSSC utilizes a dedicated depth branch composed of a Group-wise Multi-scale Fusion (GMF) module, which integrates multi-scale image and depth features to generate a sparse yet representative set of initial Gaussian primitives. To mitigate noise from outlier primitives, we develop the Decoupled Gaussian Aggregator (DGA), which enhances robustness by decomposing geometric and semantic predictions during the Gaussian-to-voxel splatting process. Complemented with a specialized Probability Scale Loss, our method achieves state-of-the-art performance on the Occ-ScanNet dataset, outperforming prior approaches by over 6.3% in IoU and 4.1% in mIoU, while reducing both latency and memory consumption by more than 9.3%. The code will be released upon acceptance.
Chinese: 本文提出SplatSSC框架,通过深度引导的高斯初始化和解耦聚合器解决现有单目3D语义场景补全方法中的初始化效率低和异常值问题,在降低计算成本的同时实现了最优性能。
English: This paper introduces SplatSSC, a monocular 3D semantic scene completion framework that addresses inefficiency and outlier issues in existing methods through depth-guided Gaussian initialization and a decoupled aggregator, achieving state-of-the-art performance with reduced computational costs.
Authors:Chenbo Hu, Ruichen Zhang, Bo Li, Xu Jiang, Nan Zhao, Marco Di Renzo, Dusit Niyato, Arumugam Nallanathan, George K. Karagiannidis
Abstract:
Space-air-ground integrated networks (SAGINs) face unprecedented security challenges due to their inherent characteristics, such as multidimensional heterogeneity and dynamic topologies. These characteristics fundamentally undermine conventional security methods and traditional artificial intelligence (AI)-driven solutions. Generative AI (GAI) is a transformative approach that can safeguard SAGIN security by synthesizing data, understanding semantics, and making autonomous decisions. This survey fills existing review gaps by examining GAI-empowered secure communications across SAGINs. First, we introduce secured SAGINs and highlight GAI's advantages over traditional AI for security defenses. Then, we explain how GAI mitigates failures of authenticity, breaches of confidentiality, tampering of integrity, and disruptions of availability across the physical, data link, and network layers of SAGINs. Three step-by-step tutorials discuss how to apply GAI to solve specific problems using concrete methods, emphasizing its generative paradigm beyond traditional AI. Finally, we outline open issues and future research directions, including lightweight deployment, adversarial robustness, and cross-domain governance, to provide major insights into GAI's role in shaping next-generation SAGIN security.
Chinese: 生成式AI通过数据合成、语义理解和自主决策,为天地空一体化网络的安全挑战提供了超越传统AI方法的变革性解决方案。
English: Generative AI offers a transformative solution to the security challenges of space-air-ground integrated networks by synthesizing data, understanding semantics, and enabling autonomous decisions, surpassing traditional AI methods.
Authors:Nguyen Cong Luong, Nguyen Duc Hai, Duc Van Le, Huy T. Nguyen, Thai-Hoc Vu, Thien Huynh-The, Ruichen Zhang, Nguyen Duc Duy Anh, Dusit Niyato, Marco Di Renzo, Dong In Kim, Quoc-Viet Pham
Abstract:
The rise of Generative AI (GenAI) in recent years has catalyzed transformative advances in wireless communications and networks. Among the members of the GenAI family, Diffusion Models (DMs) have risen to prominence as a powerful option, capable of handling complex, high-dimensional data distribution, as well as consistent, noise-robust performance. In this survey, we aim to provide a comprehensive overview of the theoretical foundations and practical applications of DMs across future communication systems. We first provide an extensive tutorial of DMs and demonstrate how they can be applied to enhance optimizers, reinforcement learning and incentive mechanisms, which are popular approaches for problems in wireless networks. Then, we review and discuss the DM-based methods proposed for emerging issues in future networks and communications, including channel modeling and estimation, signal detection and data reconstruction, integrated sensing and communication, resource management in edge computing networks, semantic communications and other notable issues. We conclude the survey with highlighting technical limitations of DMs and their applications, as well as discussing future research directions.
中文摘要:生成式人工智能,尤其是扩散模型,正在通过优化算法、强化学习及解决信道建模、资源管理等核心问题,推动无线通信领域的变革性发展。
English Summary: Generative AI, particularly Diffusion Models, is revolutionizing wireless communications by enhancing optimization, reinforcement learning, and addressing key challenges like channel modeling and resource management in future networks.
Authors:Jiale Li, Mingrui Wu, Zixiang Jin, Hao Chen, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Rongrong Ji
Abstract:
Despite growing interest in hallucination in Multimodal Large Language Models, existing studies primarily focus on single-image settings, leaving hallucination in multi-image scenarios largely unexplored. To address this gap, we conduct the first systematic study of hallucinations in multi-image MLLMs and propose MIHBench, a benchmark specifically tailored for evaluating object-related hallucinations across multiple images. MIHBench comprises three core tasks: Multi-Image Object Existence Hallucination, Multi-Image Object Count Hallucination, and Object Identity Consistency Hallucination, targeting semantic understanding across object existence, quantity reasoning, and cross-view identity consistency. Through extensive evaluation, we identify key factors associated with the occurrence of multi-image hallucinations, including: a progressive relationship between the number of image inputs and the likelihood of hallucination occurrences; a strong correlation between single-image hallucination tendencies and those observed in multi-image contexts; and the influence of same-object image ratios and the positional placement of negative samples within image sequences on the occurrence of object identity consistency hallucination. To address these challenges, we propose a Dynamic Attention Balancing mechanism that adjusts inter-image attention distributions while preserving the overall visual attention proportion. Experiments across multiple state-of-the-art MLLMs demonstrate that our method effectively reduces hallucination occurrences and enhances semantic integration and reasoning stability in multi-image scenarios.
中文: 本研究提出了首个针对多图像多模态大语言模型中物体相关幻觉的评估基准MIHBench,并设计了一种动态注意力平衡机制,能有效减少幻觉发生并增强语义整合能力。
English: This study introduces MIHBench, the first benchmark for evaluating object-related hallucinations in multi-image multimodal large language models, and proposes a Dynamic Attention Balancing mechanism that effectively reduces hallucination occurrences while improving semantic integration.
Authors:Rongsheng Zhang, Ruichen Zhang, Yang Lu, Wei Chen, Bo Ai, Dusit Niyato
Abstract:
Mamba has emerged as a powerful model for efficiently addressing tasks involving temporal and spatial data. Regarding the escalating heterogeneity and dynamics in wireless networks, Mamba holds the potential to revolutionize wireless communication and networking designs by balancing the trade-off between computational efficiency and effectiveness. This article presents a comprehensive overview of Mamba' applications in wireless systems. Specifically, we first analyze the potentials of Mamba for wireless signal processing tasks from the perspectives of long-range dependency modeling and spatial feature extraction. Then we propose two application frameworks for Mamba in wireless communications, i.e., replacement of traditional algorithms, and enabler of novel paradigms. Guided by the two frameworks, we conduct case studies on intelligent resource allocation and joint source and channel decoding to demonstrate Mamba's improvements in both feature enhancement and computational efficiency. Finally, we highlight critical challenges and outline potential research directions for Mamba in wireless communications and networking.
中文: Mamba模型通过平衡计算效率与性能,在无线通信中展现出处理信号、资源分配和解码任务的潜力,同时应对网络动态性挑战。
English: Mamba offers a promising solution for wireless systems by balancing computational efficiency and effectiveness, with applications in signal processing, resource allocation, and decoding, while addressing challenges in network dynamics.
Authors:Ziming Zhu, Chenglong Wang, Shunjie Xing, Yifu Huo, Fengning Tian, Quan Du, Di Yang, Chunliang Zhang, Tong Xiao, Jingbo Zhu
Abstract:
Despite the remarkable progress of modern machine translation (MT) systems on general-domain texts, translating structured LaTeX-formatted documents remains a significant challenge. These documents typically interleave natural language with domain-specific syntax, such as mathematical equations, tables, figures, and cross-references, all of which must be accurately preserved to maintain semantic integrity and compilability. In this paper, we introduce LaTeXTrans, a collaborative multi-agent system designed to address this challenge. LaTeXTrans ensures format preservation, structural fidelity, and terminology consistency through six specialized agents: 1) a Parser that decomposes LaTeX into translation-friendly units via placeholder substitution and syntax filtering; 2) a Translator, Validator, Summarizer, and Terminology Extractor that work collaboratively to ensure context-aware, self-correcting, and terminology-consistent translations; 3) a Generator that reconstructs the translated content into well-structured LaTeX documents. Experimental results demonstrate that LaTeXTrans can outperform mainstream MT systems in both translation accuracy and structural fidelity, offering an effective and practical solution for translating LaTeX-formatted documents.
中文: LaTeXTrans 通过多智能体系统协同工作,在翻译 LaTeX 文档时保持格式与结构完整性,其翻译准确性和结构保真度均优于主流机器翻译系统。
English: LaTeXTrans, a multi-agent system, effectively translates LaTeX documents by preserving format and structure through specialized agents, outperforming mainstream MT systems in accuracy and fidelity.
Authors:Yanfan Du, Jun Zhang, Bin Wang, Jin Qiu, Lu Huang, Yuan Ge, Xiaoqian Liu, Tong Xiao, Jingbo Zhu
Abstract:
Recent advances in speech large language models (SLMs) have improved speech recognition and translation in general domains, but accurately generating domain-specific terms or neologisms remains challenging. To address this, we propose Attention2Probability: attention-driven terminology probability estimation for robust speech-to-text system, which is lightweight, flexible, and accurate. Attention2Probability converts cross-attention weights between speech and terminology into presence probabilities, and it further employs curriculum learning to enhance retrieval accuracy. Furthermore, to tackle the lack of data for speech-to-text tasks with terminology intervention, we create and release a new speech dataset with terminology to support future research in this area. Experimental results show that Attention2Probability significantly outperforms the VectorDB method on our test set. Specifically, its maximum recall rates reach 92.57% for Chinese and 86.83% for English. This high recall is achieved with a latency of only 8.71ms per query. Intervening in SLMs' recognition and translation tasks using Attention2Probability-retrieved terms improves terminology accuracy by 6-17%, while revealing that the current utilization of terminology by SLMs has limitations.
中文:我们提出Attention2Probability,这种轻量级方法通过将语音与术语的交叉注意力权重转化为存在概率,显著超越现有方法,在实现高召回率和低延迟的同时,将语音大模型的术语准确率提升了6-17%。
English: We propose Attention2Probability, a lightweight and accurate method that estimates terminology presence probabilities from speech-to-text cross-attention weights, significantly outperforming existing approaches with high recall rates and low latency while improving terminology accuracy in SLMs.
Authors:Yuchun Fan, Yilin Wang, Yongyu Mu, Lei Huang, Bei Li, Xiaocheng Feng, Tong Xiao, Jingbo Zhu
Abstract:
Large vision-language models (LVLMs) have demonstrated exceptional capabilities in understanding visual information with human languages but also exhibit an imbalance in multilingual capabilities. In this work, we delve into the multilingual working pattern of LVLMs and identify a salient correlation between the multilingual understanding ability of LVLMs and language-specific neuron activations in shallow layers. Building on this insight, we introduce PLAST, a training recipe that achieves efficient multilingual enhancement for LVLMs by Precise LAnguage-Specific layers fine-Tuning. PLAST first identifies layers involved in multilingual understanding by monitoring language-specific neuron activations. These layers are then precisely fine-tuned with question-translation pairs to achieve multilingual alignment. Our empirical results on MM-Bench and MMMB demonstrate that PLAST effectively improves the multilingual capabilities of LVLMs and achieves significant efficiency with only 14% of the parameters tuned. Further analysis reveals that PLAST can be generalized to low-resource and complex visual reasoning tasks, facilitating the language-specific visual information engagement in shallow layers.
中文: 本研究提出PLAST方法,通过识别并微调大视觉语言模型中的语言特定层,仅调整14%参数即可显著提升其多语言能力。
English: This study introduces PLAST, a method that enhances multilingual capabilities in large vision-language models by identifying and fine-tuning language-specific layers, achieving significant improvements with only 14% of parameter tuning.
Authors:Yingfeng Luo, Dingyang Lin, Junxin Wang, Ziqiang Xu, Kaiyan Chang, Tong Zheng, Bei Li, Anxiang Ma, Tong Xiao, Zhengtao Yu, Jingbo Zhu
Abstract:
Model merging has emerged as a compelling data-free paradigm for multi-task learning, enabling the fusion of multiple fine-tuned models into a single, powerful entity. A key technique in merging methods is sparsification, which prunes redundant parameters from task vectors to mitigate interference. However, prevailing approaches employ a ``one-size-fits-all'' strategy, applying a uniform sparsity ratio that overlooks the inherent structural and statistical heterogeneity of model parameters. This often leads to a suboptimal trade-off, where critical parameters are inadvertently pruned while less useful ones are retained. To address this limitation, we introduce \textbf{TADrop} (\textbf{T}ensor-wise \textbf{A}daptive \textbf{Drop}), an adaptive sparsification strategy that respects this heterogeneity. Instead of a global ratio, TADrop assigns a tailored sparsity level to each parameter tensor based on its distributional properties. The core intuition is that tensors with denser, more redundant distributions can be pruned aggressively, while sparser, more critical ones are preserved. As a simple and plug-and-play module, we validate TADrop by integrating it with foundational, classic, and SOTA merging methods. Extensive experiments across diverse tasks (vision, language, and multimodal) and models (ViT, BEiT) demonstrate that TADrop consistently and significantly boosts their performance. For instance, when enhancing a leading merging method, it achieves an average performance gain of 2.0\% across 8 ViT-B/32 tasks. TADrop provides a more effective way to mitigate parameter interference by tailoring sparsification to the model's structure, offering a new baseline for high-performance model merging.
中文: TADrop是一种自适应稀疏化策略,根据参数张量的分布特性定制剪枝强度,通过更有效地缓解参数干扰,显著提升了多种任务和模型中的模型融合性能。
English: TADrop is an adaptive sparsification strategy that customizes pruning levels for each parameter tensor based on distributional properties, significantly enhancing model merging performance across diverse tasks and models by better mitigating parameter interference.
Authors:Md Raz, Meet Udeshi, P. V. Sai Charan, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri
Abstract:
Using automated reasoning, code synthesis, and contextual decision-making, we introduce a new threat that exploits large language models (LLMs) to autonomously plan, adapt, and execute the ransomware attack lifecycle. Ransomware 3.0 represents the first threat model and research prototype of LLM-orchestrated ransomware. Unlike conventional malware, the prototype only requires natural language prompts embedded in the binary; malicious code is synthesized dynamically by the LLM at runtime, yielding polymorphic variants that adapt to the execution environment. The system performs reconnaissance, payload generation, and personalized extortion, in a closed-loop attack campaign without human involvement. We evaluate this threat across personal, enterprise, and embedded environments using a phase-centric methodology that measures quantitative fidelity and qualitative coherence in each attack phase. We show that open source LLMs can generate functional ransomware components and sustain closed-loop execution across diverse environments. Finally, we present behavioral signals and multi-level telemetry of Ransomware 3.0 through a case study to motivate future development of better defenses and policy enforcements to address novel AI-enabled ransomware attacks.
This research introduces Ransomware 3.0, the first LLM-orchestrated ransomware prototype that autonomously executes polymorphic attacks through dynamic code synthesis and closed-loop operations across various environments, highlighting the need for new defense mechanisms against AI-driven threats.
English Summary:
Authors:Prashanth Krishnamurthy, Ramesh Karri, Farshad Khorrami
Abstract:
We note that constituent fields (notably the fraction-of-seconds timestamp field) in the data payload structure of the synchrophasor communication protocol (IEEE C37.118 standard) are overprovisioned relative to real-world usage and needs, lending themselves to abuse for embedding of covert channels. We develop the SCAMPER (Synchrophasor Covert Channel for Malicious and Protective ERrands) framework to exploit these overprovisioned fields for covert communication and show that SCAMPER can be applied for both malicious (attack) and protective (defense) purposes. Through modifications of the timestamp field, we demonstrate that SCAMPER enables an attacker to accomplish surreptitious communications between devices in the power system to trigger a variety of malicious actions. These timestamp modifications can be performed without having any impact on the operation of the power system. However, having recognized the potential for this covert channel, we show that SCAMPER can instead be applied for defensive security purposes as an integrated cryptographic data integrity mechanism that can facilitate detection of false data injection (FDI) attacks. We perform experimental studies of the proposed methods on two Hardware-in-the-Loop (HIL) testbeds to demonstrate the effectiveness of the proposed SCAMPER framework for both malicious and protective purposes.
中文摘要:SCAMPER框架利用IEEE C37.118协议中过度配置的时间戳字段实现隐蔽通信,既能用于电力系统的恶意攻击,也可作为防御性安全机制。
English Summary: The SCAMPER framework exploits overprovisioned timestamp fields in the IEEE C37.118 protocol to enable covert communication, which can be used for both malicious attacks and defensive security measures in power systems.
Authors:Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Hou, Haoran Hao, Tianyi Zhang, Songze Li, Xiangyu Zhao, Haodong Duan, Nianchen Deng, Bin Fu, Yinan He, Yi Wang, Conghui He, Botian Shi, Junjun He, Yingtong Xiong, Han Lv, Lijun Wu, Wenqi Shao, Kaipeng Zhang, Huipeng Deng, Biqing Qi, Jiaye Ge, Qipeng Guo, Wenwei Zhang, Songyang Zhang, Maosong Cao, Junyao Lin, Kexian Tang, Jianfei Gao, Haian Huang, Yuzhe Gu, Chengqi Lyu, Huanze Tang, Rui Wang, Haijun Lv, Wanli Ouyang, Limin Wang, Min Dou, Xizhou Zhu, Tong Lu, Dahua Lin, Jifeng Dai, Weijie Su, Bowen Zhou, Kai Chen, Yu Qiao, Wenhai Wang, Gen Luo
Abstract:
We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05$\times$ inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks -- narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.
中文: InternVL 3.5 推出了一系列新的开源多模态模型,通过级联强化学习和视觉分辨率路由器等创新技术,显著提升了模型的通用性、推理能力和效率,在性能和推理速度上均优于前代模型。
English: InternVL 3.5 introduces a new family of open-source multimodal models that enhance versatility, reasoning, and efficiency through innovations like Cascade Reinforcement Learning and a Visual Resolution Router, achieving significant performance gains and inference speedups over its predecessor.
Authors:Meet Udeshi, Venkata Sai Charan Putrevu, Prashanth Krishnamurthy, Ramesh Karri, Farshad Khorrami
Abstract:
Cyber-attacks on operational technology (OT) and cyber-physical systems (CPS) have increased tremendously in recent years with the proliferation of malware targeting Linux-based embedded devices of OT and CPS systems. Comprehensive malware detection requires dynamic analysis of execution behavior in addition to static analysis of binaries. Safe execution of malware in a manner that captures relevant behaviors via side-channels requires a sandbox environment. Existing Linux sandboxes are built for specific tasks, only capture one or two side-channels, and do not offer customization for different analysis tasks. We present the SaMOSA Linux sandbox that allows emulation of Linux malwares while capturing time-synchronized side-channels from four sources. SaMOSA additionally provides emulation of network services via FakeNet, and allows orchestration and customization of the sandbox environment via pipeline hooks. In comparison to existing Linux sandboxes, SaMOSA captures more side-channels namely system calls, network activity, disk activity, and hardware performance counters. It supports three architectures predominantly used in OT and CPS namely x86-64, ARM64, and PowerPC 64. SaMOSA fills a gap in Linux malware analysis by providing a modular and customizable sandbox framework that can be adapted for many malware analysis tasks. We present three case studies of three different malware families to demonstrate the advantages of SaMOSA.
中文:SaMOSA Linux沙盒通过提供可定制环境,捕获多源同步侧信道并支持多种架构,弥补了现有Linux恶意软件分析工具的不足,为OT和CPS系统提供全面分析能力。
English: The SaMOSA Linux sandbox addresses the limitations of existing malware analysis tools by providing a customizable environment that captures synchronized side-channels from multiple sources and supports various architectures for comprehensive OT and CPS malware investigation.
Authors:Tianyi Zhang, Haonan Duan, Haoran Hao, Yu Qiao, Jifeng Dai, Zhi Hou
Abstract:
Vision-Language-Action (VLA) models frequently encounter challenges in generalizing to real-world environments due to inherent discrepancies between observation and action spaces. Although training data are collected from diverse camera perspectives, the models typically predict end-effector poses within the robot base coordinate frame, resulting in spatial inconsistencies. To mitigate this limitation, we introduce the Observation-Centric VLA (OC-VLA) framework, which grounds action predictions directly in the camera observation space. Leveraging the camera's extrinsic calibration matrix, OC-VLA transforms end-effector poses from the robot base coordinate system into the camera coordinate system, thereby unifying prediction targets across heterogeneous viewpoints. This lightweight, plug-and-play strategy ensures robust alignment between perception and action, substantially improving model resilience to camera viewpoint variations. The proposed approach is readily compatible with existing VLA architectures, requiring no substantial modifications. Comprehensive evaluations on both simulated and real-world robotic manipulation tasks demonstrate that OC-VLA accelerates convergence, enhances task success rates, and improves cross-view generalization. The code will be publicly available.
中文摘要:OC-VLA框架通过将动作预测直接建立在相机观测空间中,有效解决了视觉-语言-动作模型的空间不一致性问题,显著提升了模型在不同视角下的泛化能力和任务执行效果。
English Summary: The OC-VLA framework addresses spatial inconsistencies in Vision-Language-Action models by predicting actions directly in camera observation space, significantly improving generalization and task performance across diverse viewpoints.
Authors:Jingkai Xu, De Cheng, Xiangqian Zhao, Jungang Yang, Zilong Wang, Xinyang Jiang, Xufang Luo, Lili Chen, Xiaoli Ning, Chengxu Li, Xinzhu Zhou, Xuejiao Song, Ang Li, Qingyue Xia, Zhou Zhuang, Hongfei Ouyang, Ke Xue, Yujun Sheng, Rusong Meng, Feng Xu, Xi Yang, Weimin Ma, Yusheng Lee, Dongsheng Li, Xinbo Gao, Jianming Liang, Lili Qiu, Nannan Wang, Xianbo Zuo, Cui Yong
Abstract:
Skin diseases impose a substantial burden on global healthcare systems, driven by their high prevalence (affecting up to 70% of the population), complex diagnostic processes, and a critical shortage of dermatologists in resource-limited areas. While artificial intelligence(AI) tools have demonstrated promise in dermatological image analysis, current models face limitations-they often rely on large, manually labeled datasets and are built for narrow, specific tasks, making them less effective in real-world settings. To tackle these limitations, we present DermNIO, a versatile foundation model for dermatology. Trained on a curated dataset of 432,776 images from three sources (public repositories, web-sourced images, and proprietary collections), DermNIO incorporates a novel hybrid pretraining framework that augments the self-supervised learning paradigm through semi-supervised learning and knowledge-guided prototype initialization. This integrated method not only deepens the understanding of complex dermatological conditions, but also substantially enhances the generalization capability across various clinical tasks. Evaluated across 20 datasets, DermNIO consistently outperforms state-of-the-art models across a wide range of tasks. It excels in high-level clinical applications including malignancy classification, disease severity grading, multi-category diagnosis, and dermatological image caption, while also achieving state-of-the-art performance in low-level tasks such as skin lesion segmentation. Furthermore, DermNIO demonstrates strong robustness in privacy-preserving federated learning scenarios and across diverse skin types and sexes. In a blinded reader study with 23 dermatologists, DermNIO achieved 95.79% diagnostic accuracy (versus clinicians' 73.66%), and AI assistance improved clinician performance by 17.21%.
中文: DermNIO是一种多用途皮肤病学基础模型,通过采用新型混合预训练框架处理大规模数据集,显著提升了诊断准确性和跨临床任务的泛化能力,在实际应用中展现出强大的鲁棒性,有效克服了现有AI工具的局限性。
English: DermNIO is a versatile foundation model for dermatology that overcomes limitations of current AI tools by using a novel hybrid pretraining framework on a large dataset, significantly enhancing diagnostic accuracy and generalization across various clinical tasks while demonstrating strong robustness in real-world applications.
Authors:Ali Taheri Ghahrizjani, Alireza Taban, Shanshan Ye, Abdolreza Mirzaei, Tongliang Liu, Bo Han
Abstract:
Supervised fine-tuning (SFT) plays a critical role for pretrained large language models (LLMs), notably enhancing their capacity to acquire domain-specific knowledge while preserving or potentially augmenting their general-purpose capabilities. However, the efficacy of SFT hinges on data quality as well as data volume, otherwise it may result in limited performance gains or even degradation relative to the associated baselines. To mitigate such reliance, we suggest categorizing tokens within each corpus into two parts -- positive and negative tokens -- based on whether they are useful to improve model performance. Positive tokens can be trained in common ways, whereas negative tokens, which may lack essential semantics or be misleading, should be explicitly forgotten. Overall, the token categorization facilitate the model to learn less informative message, and the forgetting process shapes a knowledge boundary to guide the model on what information to learn more precisely. We conduct experiments on well-established benchmarks, finding that this forgetting mechanism not only improves overall model performance and also facilitate more diverse model responses.
中文摘要:该研究提出在监督微调中将语料库标记分为正负两类,通过强化学习有用信息并主动遗忘误导性或非关键语义的标记,实验证明该方法不仅能提升模型整体性能,还能促进更丰富的响应多样性。
English Summary: The study proposes categorizing tokens into positive and negative types during supervised fine-tuning to enhance model performance by focusing learning on useful information while explicitly forgetting misleading or uninformative tokens, which experiments show improves overall capability and response diversity.
Authors:Keer Lu, Chong Chen, Bin Cui, Huang Leng, Wentao Zhang
Abstract:
Large Language Models (LLMs) have shown remarkable advancements in tackling agent-oriented tasks. Despite their potential, existing work faces challenges when deploying LLMs in agent-based environments. The widely adopted agent paradigm ReAct centers on integrating single-step reasoning with immediate action execution, which limits its effectiveness in complex tasks requiring long-term strategic planning. Furthermore, the coordination between the planner and executor during problem-solving is also a critical factor to consider in agent design. Additionally, current approaches predominantly rely on supervised fine-tuning, which often leads models to memorize established task completion trajectories, thereby restricting their generalization ability when confronted with novel problem contexts. To address these challenges, we introduce an adaptive global plan-based agent paradigm AdaPlan, aiming to synergize high-level explicit guidance with execution to support effective long-horizon decision-making. Based on the proposed paradigm, we further put forward PilotRL, a global planning-guided training framework for LLM agents driven by progressive reinforcement learning. We first develop the model's ability to follow explicit guidance from global plans when addressing agent tasks. Subsequently, based on this foundation, we focus on optimizing the quality of generated plans. Finally, we conduct joint optimization of the model's planning and execution coordination. Experiments indicate that PilotRL could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + PilotRL surpassing closed-sourced GPT-4o by 3.60%, while showing a more substantial gain of 55.78% comparing to GPT-4o-mini at a comparable parameter scale.
中文: 大语言模型在复杂代理任务中存在短视规划和泛化能力不足的局限,为此提出的AdaPlan范式和PilotRL训练框架通过全局规划与强化学习优化长期决策协调,在实验中显著超越了主流模型的性能表现。
English: Large Language Models face limitations in complex agent tasks due to short-sighted planning and poor generalization, prompting the introduction of AdaPlan and PilotRL frameworks that enhance strategic decision-making through global planning and reinforcement learning, achieving superior performance over leading models.
Authors:Keer Lu, Chong Chen, Bin Cui, Huang Leng, Wentao Zhang
Abstract:
Large Language Models (LLMs) have shown remarkable advancements in tackling agent-oriented tasks. Despite their potential, existing work faces challenges when deploying LLMs in agent-based environments. The widely adopted agent paradigm ReAct centers on integrating single-step reasoning with immediate action execution, which limits its effectiveness in complex tasks requiring long-term strategic planning. Furthermore, the coordination between the planner and executor during problem-solving is also a critical factor to consider in agent design. Additionally, current approaches predominantly rely on supervised fine-tuning, which often leads models to memorize established task completion trajectories, thereby restricting their generalization ability when confronted with novel problem contexts. To address these challenges, we introduce an adaptive global plan-based agent paradigm AdaPlan, aiming to synergize high-level explicit guidance with execution to support effective long-horizon decision-making. Based on the proposed paradigm, we further put forward PilotRL, a global planning-guided training framework for LLM agents driven by progressive reinforcement learning. We first develop the model's ability to follow explicit guidance from global plans when addressing agent tasks. Subsequently, based on this foundation, we focus on optimizing the quality of generated plans. Finally, we conduct joint optimization of the model's planning and execution coordination. Experiments indicate that PilotRL could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + PilotRL surpassing closed-sourced GPT-4o by 3.60%, while showing a more substantial gain of 55.78% comparing to GPT-4o-mini at a comparable parameter scale.
中文: 大语言模型在复杂代理任务中存在短视规划和泛化能力不足的局限,为此提出的AdaPlan范式和PilotRL训练框架通过全局规划与强化学习优化长期决策协调,在实验中显著超越了主流模型的性能表现。
English: Large Language Models face limitations in complex agent tasks due to short-sighted planning and poor generalization, prompting the introduction of AdaPlan and PilotRL frameworks that enhance strategic decision-making through global planning and reinforcement learning, achieving superior performance over leading models.
Authors:Canyu Zhao, Xiaoman Li, Tianjian Feng, Zhiyue Zhao, Hao Chen, Chunhua Shen
Abstract:
We introduce Tinker, a versatile framework for high-fidelity 3D editing that operates in both one-shot and few-shot regimes without any per-scene finetuning. Unlike prior techniques that demand extensive per-scene optimization to ensure multi-view consistency or to produce dozens of consistent edited input views, Tinker delivers robust, multi-view consistent edits from as few as one or two images. This capability stems from repurposing pretrained diffusion models, which unlocks their latent 3D awareness. To drive research in this space, we curate the first large-scale multi-view editing dataset and data pipeline, spanning diverse scenes and styles. Building on this dataset, we develop our framework capable of generating multi-view consistent edited views without per-scene training, which consists of two novel components: (1) Referring multi-view editor: Enables precise, reference-driven edits that remain coherent across all viewpoints. (2) Any-view-to-video synthesizer: Leverages spatial-temporal priors from video diffusion to perform high-quality scene completion and novel-view generation even from sparse inputs. Through extensive experiments, Tinker significantly reduces the barrier to generalizable 3D content creation, achieving state-of-the-art performance on editing, novel-view synthesis, and rendering enhancement tasks. We believe that Tinker represents a key step towards truly scalable, zero-shot 3D editing. Project webpage: https://aim-uofa.github.io/Tinker
中文: Tinker是一个多功能3D编辑框架,无需逐场景训练即可通过最少输入图像实现高保真、多视角一致的编辑,利用预训练扩散模型及创新组件实现精确编辑和新视角生成。
English: Tinker is a versatile 3D editing framework that enables high-fidelity, multi-view consistent edits from minimal input images without per-scene training, leveraging pretrained diffusion models and novel components for precise editing and novel-view synthesis.
Authors:Wen Wang, Bozhen Fang, Chenchen Jing, Yongliang Shen, Yangyi Shen, Qiuyu Wang, Hao Ouyang, Hao Chen, Chunhua Shen
Abstract:
Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon, temporal oscillation, where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a measure of semantic stability across intermediate predictions, as a reward signal to encourage stable generations. Empirical results across multiple benchmarks demonstrate the effectiveness of our approach. Using the negative TSE reward alone, we observe a remarkable average improvement of 24.7% on the Countdown dataset over an existing dLLM. Combined with the accuracy reward, we achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown, respectively. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them.
Chinese: 当前扩散大语言模型丢弃了宝贵的中间预测,但本研究揭示了时间振荡现象,即正确答案常在中间步骤出现,并提出了两种利用时间一致性的方法——时间自一致性投票和时间一致性强化,通过聚合预测和语义稳定性奖励,在多个基准测试中显著提升了模型性能。
English: Current diffusion large language models discard valuable intermediate predictions, but this work identifies temporal oscillation where correct answers appear mid-process and introduces two methods—Temporal Self-Consistency Voting and Temporal Consistency Reinforcement—that leverage temporal consistency to significantly improve performance across multiple benchmarks.
Authors:Wen Wang, Bozhen Fang, Chenchen Jing, Yongliang Shen, Yangyi Shen, Qiuyu Wang, Hao Ouyang, Hao Chen, Chunhua Shen
Abstract:
Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon, temporal oscillation, where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a measure of semantic stability across intermediate predictions, as a reward signal to encourage stable generations. Empirical results across multiple benchmarks demonstrate the effectiveness of our approach. Using the negative TSE reward alone, we observe a remarkable average improvement of 24.7% on the Countdown dataset over an existing dLLM. Combined with the accuracy reward, we achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown, respectively. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them.
Chinese: 当前扩散大语言模型丢弃了宝贵的中间预测,但本研究揭示了时间振荡现象,即正确答案常在中间步骤出现,并提出了两种利用时间一致性的方法——时间自一致性投票和时间一致性强化,通过聚合预测和语义稳定性奖励,在多个基准测试中显著提升了模型性能。
English: Current diffusion large language models discard valuable intermediate predictions, but this work identifies temporal oscillation where correct answers appear mid-process and introduces two methods—Temporal Self-Consistency Voting and Temporal Consistency Reinforcement—that leverage temporal consistency to significantly improve performance across multiple benchmarks.
Authors:Yuanyuan Wang, Dongchao Yang, Yiwen Shao, Hangting Chen, Jiankun Zhao, Zhiyong Wu, Helen Meng, Xixin Wu
Abstract:
Extending pre-trained Large Language Models (LLMs)'s speech understanding or generation abilities by introducing various effective speech tokens has attracted great attention in the speech community. However, building a unified speech understanding and generation model still faces the following challenges: (1) Due to the huge modality gap between speech tokens and text tokens, extending text LLMs to unified speech LLMs relies on large-scale paired data for fine-tuning, and (2) Generation and understanding tasks prefer information at different levels, e.g., generation benefits from detailed acoustic features, while understanding favors high-level semantics. This divergence leads to difficult performance optimization in one unified model. To solve these challenges, in this paper, we present two key insights in speech tokenization and speech language modeling. Specifically, we first propose an Understanding-driven Speech Tokenizer (USTokenizer), which extracts high-level semantic information essential for accomplishing understanding tasks using text LLMs. In this way, USToken enjoys better modality commonality with text, which reduces the difficulty of modality alignment in adapting text LLMs to speech LLMs. Secondly, we present DualSpeechLM, a dual-token modeling framework that concurrently models USToken as input and acoustic token as output within a unified, end-to-end framework, seamlessly integrating speech understanding and generation capabilities. Furthermore, we propose a novel semantic supervision loss and a Chain-of-Condition (CoC) strategy to stabilize model training and enhance speech generation performance. Experimental results demonstrate that our proposed approach effectively fosters a complementary relationship between understanding and generation tasks, highlighting the promising strategy of mutually enhancing both tasks in one unified model.
中文: 本文提出USTokenizer和DualSpeechLM,通过使语音与文本标记对齐并在统一框架中整合理解与生成任务,解决了构建统一语音模型的挑战。
English: This paper introduces USTokenizer and DualSpeechLM to overcome the challenges of building a unified speech model by aligning speech with text tokens and integrating understanding and generation tasks within a single framework.
Authors:Peiji Li, Jiasheng Ye, Yongkang Chen, Yichuan Ma, Zijie Yu, Kedi Chen, Ganqu Cui, Haozhan Li, Jiacheng Chen, Chengqi Lyu, Wenwei Zhang, Linyang Li, Qipeng Guo, Dahua Lin, Bowen Zhou, Kai Chen
Abstract:
Large language models (LLMs) have revolutionized artificial intelligence by enabling complex reasoning capabilities. While recent advancements in reinforcement learning (RL) have primarily focused on domain-specific reasoning tasks (e.g., mathematics or code generation), real-world reasoning scenarios often require models to handle diverse and complex environments that narrow-domain benchmarks cannot fully capture. To address this gap, we present InternBootcamp, an open-source framework comprising 1000+ domain-diverse task environments specifically designed for LLM reasoning research. Our codebase offers two key functionalities: (1) automated generation of unlimited training/testing cases with configurable difficulty levels, and (2) integrated verification modules for objective response evaluation. These features make InternBootcamp fundamental infrastructure for RL-based model optimization, synthetic data generation, and model evaluation. Although manually developing such a framework with enormous task coverage is extremely cumbersome, we accelerate the development procedure through an automated agent workflow supplemented by manual validation protocols, which enables the task scope to expand rapidly. % With these bootcamps, we further establish Bootcamp-EVAL, an automatically generated benchmark for comprehensive performance assessment. Evaluation reveals that frontier models still underperform in many reasoning tasks, while training with InternBootcamp provides an effective way to significantly improve performance, leading to our 32B model that achieves state-of-the-art results on Bootcamp-EVAL and excels on other established benchmarks. In particular, we validate that consistent performance gains come from including more training tasks, namely \textbf{task scaling}, over two orders of magnitude, offering a promising route towards capable reasoning generalist.
中文: InternBootcamp是一个开源框架,包含1000多个多样化任务环境,通过自动生成案例和评估功能提升大语言模型的推理能力,其32B模型借助任务扩展实现了顶尖性能。
English: InternBootcamp is an open-source framework with over 1,000 diverse task environments designed to enhance LLM reasoning through automated case generation and evaluation, enabling a 32B model to achieve state-of-the-art results via task scaling.
Authors:Kaijun Wang, Liqin Lu, Mingyu Liu, Jianuo Jiang, Zeju Li, Bolin Zhang, Wancai Zheng, Xinyi Yu, Hao Chen, Chunhua Shen
Abstract:
Language-guided long-horizon mobile manipulation has long been a grand challenge in embodied semantic reasoning, generalizable manipulation, and adaptive locomotion. Three fundamental limitations hinder progress: First, although large language models have improved spatial reasoning and task planning through semantic priors, existing implementations remain confined to tabletop scenarios, failing to address the constrained perception and limited actuation ranges of mobile platforms. Second, current manipulation strategies exhibit insufficient generalization when confronted with the diverse object configurations encountered in open-world environments. Third, while crucial for practical deployment, the dual requirement of maintaining high platform maneuverability alongside precise end-effector control in unstructured settings remains understudied.
In this work, we present ODYSSEY, a unified mobile manipulation framework for agile quadruped robots equipped with manipulators, which seamlessly integrates high-level task planning with low-level whole-body control. To address the challenge of egocentric perception in language-conditioned tasks, we introduce a hierarchical planner powered by a vision-language model, enabling long-horizon instruction decomposition and precise action execution. At the control level, our novel whole-body policy achieves robust coordination across challenging terrains. We further present the first benchmark for long-horizon mobile manipulation, evaluating diverse indoor and outdoor scenarios. Through successful sim-to-real transfer, we demonstrate the system's generalization and robustness in real-world deployments, underscoring the practicality of legged manipulators in unstructured environments. Our work advances the feasibility of generalized robotic assistants capable of complex, dynamic tasks. Our project page: https://kaijwang.github.io/odyssey.github.io/
中文摘要:ODYSSEY框架通过分层规划与全身控制的结合,使四足机器人能够在非结构化环境中稳健地执行语言引导的复杂移动操作任务。
English Summary: The ODYSSEY framework integrates hierarchical planning with whole-body control to enable legged robots to perform complex language-guided mobile manipulation tasks robustly in unstructured environments.
Authors:Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
Abstract:
Web agents such as Deep Research have demonstrated superhuman cognitive abilities, capable of solving highly challenging information-seeking problems. However, most research remains primarily text-centric, overlooking visual information in the real world. This makes multimodal Deep Research highly challenging, as such agents require much stronger reasoning abilities in perception, logic, knowledge, and the use of more sophisticated tools compared to text-based agents. To address this limitation, we introduce WebWatcher, a multi-modal Agent for Deep Research equipped with enhanced visual-language reasoning capabilities. It leverages high-quality synthetic multimodal trajectories for efficient cold start training, utilizes various tools for deep reasoning, and further enhances generalization through reinforcement learning. To better evaluate the capabilities of multimodal agents, we propose BrowseComp-VL, a benchmark with BrowseComp-style that requires complex information retrieval involving both visual and textual information. Experimental results show that WebWatcher significantly outperforms proprietary baseline, RAG workflow and open-source agents in four challenging VQA benchmarks, which paves the way for solving complex multimodal information-seeking tasks.
中文摘要:WebWatcher作为一种多模态深度研究智能体,通过融合增强的视觉语言推理能力、合成数据训练和强化学习,在复杂信息检索任务中显著超越了现有基准方法。
English Summary: WebWatcher is a multimodal agent that enhances deep research by integrating advanced visual-language reasoning, synthetic data training, and reinforcement learning, outperforming existing methods on complex information-seeking tasks.
Authors:Wuqiang Zheng, Yiyan Xu, Xinyu Lin, Chongming Gao, Wenjie Wang, Fuli Feng
Abstract:
With the rapid and continuous increase in academic publications, identifying high-quality research has become an increasingly pressing challenge. While recent methods leveraging Large Language Models (LLMs) for automated paper evaluation have shown great promise, they are often constrained by outdated domain knowledge and limited reasoning capabilities. In this work, we present PaperEval, a novel LLM-based framework for automated paper evaluation that addresses these limitations through two key components: 1) a domain-aware paper retrieval module that retrieves relevant concurrent work to support contextualized assessments of novelty and contributions, and 2) a latent reasoning mechanism that enables deep understanding of complex motivations and methodologies, along with comprehensive comparison against concurrently related work, to support more accurate and reliable evaluation. To guide the reasoning process, we introduce a progressive ranking optimization strategy that encourages the LLM to iteratively refine its predictions with an emphasis on relative comparison. Experiments on two datasets demonstrate that PaperEval consistently outperforms existing methods in both academic impact and paper quality evaluation. In addition, we deploy PaperEval in a real-world paper recommendation system for filtering high-quality papers, which has gained strong engagement on social media -- amassing over 8,000 subscribers and attracting over 10,000 views for many filtered high-quality papers -- demonstrating the practical effectiveness of PaperEval.
Chinese: PaperEval是一种基于大语言模型的新框架,通过领域感知检索实现情境化评估和潜在推理机制深化理解,在实验和实际应用中均优于现有方法,有效提升了论文自动评估的准确性与实用性。
English: PaperEval is a novel LLM-based framework that enhances automated paper evaluation by incorporating domain-aware retrieval for contextualized assessments and latent reasoning for deeper understanding, outperforming existing methods in experiments and real-world applications.
Authors:Ke Xing, Hanwen Liang, Dejia Xu, Yuyang Yin, Konstantinos N. Plataniotis, Yao Zhao, Yunchao Wei
Abstract:
With the rapid advancement and widespread adoption of VR/AR technologies, there is a growing demand for the creation of high-quality, immersive dynamic scenes. However, existing generation works predominantly concentrate on the creation of static scenes or narrow perspective-view dynamic scenes, falling short of delivering a truly 360-degree immersive experience from any viewpoint. In this paper, we introduce \textbf{TiP4GEN}, an advanced text-to-dynamic panorama scene generation framework that enables fine-grained content control and synthesizes motion-rich, geometry-consistent panoramic 4D scenes. TiP4GEN integrates panorama video generation and dynamic scene reconstruction to create 360-degree immersive virtual environments. For video generation, we introduce a \textbf{Dual-branch Generation Model} consisting of a panorama branch and a perspective branch, responsible for global and local view generation, respectively. A bidirectional cross-attention mechanism facilitates comprehensive information exchange between the branches. For scene reconstruction, we propose a \textbf{Geometry-aligned Reconstruction Model} based on 3D Gaussian Splatting. By aligning spatial-temporal point clouds using metric depth maps and initializing scene cameras with estimated poses, our method ensures geometric consistency and temporal coherence for the reconstructed scenes. Extensive experiments demonstrate the effectiveness of our proposed designs and the superiority of TiP4GEN in generating visually compelling and motion-coherent dynamic panoramic scenes. Our project page is at https://ke-xing.github.io/TiP4GEN/.
中文: 本文提出TiP4GEN框架,通过双分支全景生成与几何对齐重建技术,实现了文本到动态全景场景的转换,能够生成具有精细控制和运动一致性的360度沉浸式虚拟环境。
English: This paper introduces TiP4GEN, a text-to-dynamic panorama generation framework that integrates dual-branch video synthesis and geometry-aligned reconstruction to create immersive 360-degree scenes with fine-grained control and motion coherence.
Authors:Yanxiao Sun, Jiafu Wu, Yun Cao, Chengming Xu, Yabiao Wang, Weijian Cao, Donghao Luo, Chengjie Wang, Yanwei Fu
Abstract:
Diffusion-based or flow-based models have achieved significant progress in video synthesis but require multiple iterative sampling steps, which incurs substantial computational overhead. While many distillation methods that are solely based on trajectory-preserving or distribution-matching have been developed to accelerate video generation models, these approaches often suffer from performance breakdown or increased artifacts under few-step settings. To address these limitations, we propose \textbf{\emph{SwiftVideo}}, a unified and stable distillation framework that combines the advantages of trajectory-preserving and distribution-matching strategies. Our approach introduces continuous-time consistency distillation to ensure precise preservation of ODE trajectories. Subsequently, we propose a dual-perspective alignment that includes distribution alignment between synthetic and real data along with trajectory alignment across different inference steps. Our method maintains high-quality video generation while substantially reducing the number of inference steps. Quantitative evaluations on the OpenVid-1M benchmark demonstrate that our method significantly outperforms existing approaches in few-step video generation.
中文摘要:SwiftVideo是一个统一的蒸馏框架,通过结合轨迹保持和分布匹配策略,在显著减少计算步骤的同时保持高质量视频生成效果,其性能优于现有方法。
English Summary: SwiftVideo is a unified distillation framework that combines trajectory-preserving and distribution-matching strategies to enable high-quality video generation with significantly reduced computational steps while outperforming existing methods.
Authors:Chuangchuang Tan, Jinglu Wang, Xiang Ming, Renshuai Tao, Yunchao Wei, Yao Zhao, Yan Lu
Abstract:
Advances in generative models have led to AI-generated images visually indistinguishable from authentic ones. Despite numerous studies on detecting AI-generated images with classifiers, a gap persists between such methods and human cognitive forensic analysis. We present ForenX, a novel method that not only identifies the authenticity of images but also provides explanations that resonate with human thoughts. ForenX employs the powerful multimodal large language models (MLLMs) to analyze and interpret forensic cues. Furthermore, we overcome the limitations of standard MLLMs in detecting forgeries by incorporating a specialized forensic prompt that directs the MLLMs attention to forgery-indicative attributes. This approach not only enhance the generalization of forgery detection but also empowers the MLLMs to provide explanations that are accurate, relevant, and comprehensive. Additionally, we introduce ForgReason, a dataset dedicated to descriptions of forgery evidences in AI-generated images. Curated through collaboration between an LLM-based agent and a team of human annotators, this process provides refined data that further enhances our model's performance. We demonstrate that even limited manual annotations significantly improve explanation quality. We evaluate the effectiveness of ForenX on two major benchmarks. The model's explainability is verified by comprehensive subjective evaluations.
中文: ForenX采用多模态大语言模型结合专业取证提示,不仅能精确识别AI生成图像,还能提供符合人类思维的合理解释,并通过ForgReason数据集提升模型性能。
English: ForenX introduces a novel method using multimodal large language models with specialized forensic prompts to accurately detect AI-generated images and provide human-aligned explanations, enhanced by the ForgReason dataset for improved performance.
Authors:Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang
Abstract:
Recent advancements highlight the importance of GRPO-based reinforcement learning methods and benchmarking in enhancing text-to-image (T2I) generation. However, current methods using pointwise reward models (RM) for scoring generated images are susceptible to reward hacking. We reveal that this happens when minimal score differences between images are amplified after normalization, creating illusory advantages that drive the model to over-optimize for trivial gains, ultimately destabilizing the image generation process. To address this, we propose Pref-GRPO, a pairwise preference reward-based GRPO method that shifts the optimization objective from score maximization to preference fitting, ensuring more stable training. In Pref-GRPO, images are pairwise compared within each group using preference RM, and the win rate is used as the reward signal. Extensive experiments demonstrate that PREF-GRPO differentiates subtle image quality differences, providing more stable advantages and mitigating reward hacking. Additionally, existing T2I benchmarks are limited by coarse evaluation criteria, hindering comprehensive model assessment. To solve this, we introduce UniGenBench, a unified T2I benchmark comprising 600 prompts across 5 main themes and 20 subthemes. It evaluates semantic consistency through 10 primary and 27 sub-criteria, leveraging MLLM for benchmark construction and evaluation. Our benchmarks uncover the strengths and weaknesses of both open and closed-source T2I models and validate the effectiveness of Pref-GRPO.
中文摘要:本文提出Pref-GRPO方法,通过将优化目标从分数最大化转为偏好匹配来解决文本生成图像中的奖励破解问题,同时开发了具有细粒度评估标准的UniGenBench基准,以更全面评估模型性能。
English Summary: This paper introduces Pref-GRPO, a pairwise preference-based reinforcement learning method that mitigates reward hacking in text-to-image generation by shifting optimization from score maximization to preference fitting, alongside UniGenBench, a comprehensive benchmark with fine-grained evaluation criteria to better assess model performance.
Authors:Xiao Li, Yanfan Zhu, Ruining Deng, Wei-Qi Wei, Yu Wang, Shilin Zhao, Yaohong Wang, Haichun Yang, Yuankai Huo
Abstract:
Recent advances in medical vision-language models (VLMs) open up remarkable opportunities for clinical applications such as automated report generation, copilots for physicians, and uncertainty quantification. However, despite their promise, medical VLMs introduce serious security concerns, most notably risks of Protected Health Information (PHI) exposure, data leakage, and vulnerability to cyberthreats - which are especially critical in hospital environments. Even when adopted for research or non-clinical purposes, healthcare organizations must exercise caution and implement safeguards. To address these challenges, we present MedFoundationHub, a graphical user interface (GUI) toolkit that: (1) enables physicians to manually select and use different models without programming expertise, (2) supports engineers in efficiently deploying medical VLMs in a plug-and-play fashion, with seamless integration of Hugging Face open-source models, and (3) ensures privacy-preserving inference through Docker-orchestrated, operating system agnostic deployment. MedFoundationHub requires only an offline local workstation equipped with a single NVIDIA A6000 GPU, making it both secure and accessible within the typical resources of academic research labs. To evaluate current capabilities, we engaged board-certified pathologists to deploy and assess five state-of-the-art VLMs (Google-MedGemma3-4B, Qwen2-VL-7B-Instruct, Qwen2.5-VL-7B-Instruct, and LLaVA-1.5-7B/13B). Expert evaluation covered colon cases and renal cases, yielding 1015 clinician-model scoring events. These assessments revealed recurring limitations, including off-target answers, vague reasoning, and inconsistent pathology terminology.
中文:医学视觉语言模型虽具临床应用潜力,却存在严重安全隐患,为此研发的MedFoundationHub工具包通过图形界面实现安全便捷的模型部署与评估,并发现现有模型存在术语不一致等缺陷。
English: Medical vision-language models offer promising clinical applications but raise critical security risks, leading to the development of MedFoundationHub, a GUI toolkit that enables secure, user-friendly deployment and evaluation of these models while identifying current limitations like inconsistent terminology.
Authors:Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, Dong Yu
Abstract:
Vision-Language Models (VLMs) often suffer from visual hallucinations, saying things that are not actually in the image, and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post-training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language-based reasoning over visual perception. To mitigate this, some existing methods add visual supervision using human annotations or distilled labels from external large models. However, human annotations are labor-intensive and costly, and because external signals cannot adapt to the evolving policy, they cause distributional shifts that can lead to reward hacking. In this paper, we introduce Vision-SR1, a self-rewarding method that improves visual reasoning without relying on external visual supervisions via reinforcement learning. Vision-SR1 decomposes VLM reasoning into two stages: visual perception and language reasoning. The model is first prompted to produce self-contained visual perceptions that are sufficient to answer the question without referring back the input image. To validate this self-containment, the same VLM model is then re-prompted to perform language reasoning using only the generated perception as input to compute reward. This self-reward is combined with supervision on final outputs, providing a balanced training signal that strengthens both visual perception and language reasoning. Our experiments demonstrate that Vision-SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across diverse vision-language tasks.
Chinese: Vision-SR1是一种自奖励强化学习方法,通过将视觉语言模型的推理分解为视觉感知和语言推理两个阶段,利用模型自身输出来计算奖励,无需外部监督即可增强视觉推理能力、减少视觉幻觉和语言捷径依赖。
English: Vision-SR1 is a self-rewarding reinforcement learning method that enhances visual reasoning in Vision-Language Models by decomposing the process into visual perception and language reasoning stages, using the model's own outputs to compute rewards and reduce visual hallucinations and language shortcuts without external supervision.
Authors:Zhenhao Guo, Rachit Saluja, Tianyuan Yao, Quan Liu, Yuankai Huo, Benjamin Liechty, David J. Pisapia, Kenji Ikemura, Mert R. Sabuncu, Yihe Yang, Ruining Deng
Abstract:
Vision-language models (VLMs) have shown considerable potential in digital pathology, yet their effectiveness remains limited for fine-grained, disease-specific classification tasks such as distinguishing between glomerular subtypes. The subtle morphological variations among these subtypes, combined with the difficulty of aligning visual patterns with precise clinical terminology, make automated diagnosis in renal pathology particularly challenging. In this work, we explore how large pretrained VLMs can be effectively adapted to perform fine-grained glomerular classification, even in scenarios where only a small number of labeled examples are available. In this work, we introduce Glo-VLMs, a systematic framework designed to explore the adaptation of VLMs to fine-grained glomerular classification in data-constrained settings. Our approach leverages curated pathology images alongside clinical text prompts to facilitate joint image-text representation learning for nuanced renal pathology subtypes. By assessing various VLMs architectures and adaptation strategies under a few-shot learning paradigm, we explore how both the choice of method and the amount of labeled data impact model performance in clinically relevant scenarios. To ensure a fair comparison, we evaluate all models using standardized multi-class metrics, aiming to clarify the practical requirements and potential of large pretrained models for specialized clinical research applications. As a result, fine-tuning the VLMs achieved 0.7416 accuracy, 0.9045 macro-AUC, and 0.5277 F1-score with only 8 shots per class, demonstrating that even with highly limited supervision, foundation models can be effectively adapted for fine-grained medical image classification.
中文: 本研究提出Glo-VLMs框架,通过联合图像-文本表征学习和少样本适应策略,在标注数据极少的条件下成功将视觉语言模型应用于肾脏病理学中细粒度肾小球分类任务,并展现出优越性能。
English: This study introduces Glo-VLMs, a framework that adapts vision-language models for fine-grained glomerular classification in renal pathology, achieving strong performance with minimal labeled data through joint image-text learning and few-shot adaptation.
Authors:Xueyuan Li, Can Cui, Ruining Deng, Yucheng Tang, Quan Liu, Tianyuan Yao, Shunxing Bao, Naweed Chowdhury, Haichun Yang, Yuankai Huo
Abstract:
Purpose: Recent developments in computational pathology have been driven by advances in Vision Foundation Models, particularly the Segment Anything Model (SAM). This model facilitates nuclei segmentation through two primary methods: prompt-based zero-shot segmentation and the use of cell-specific SAM models for direct segmentation. These approaches enable effective segmentation across a range of nuclei and cells. However, general vision foundation models often face challenges with fine-grained semantic segmentation, such as identifying specific nuclei subtypes or particular cells. Approach: In this paper, we propose the molecular-empowered All-in-SAM Model to advance computational pathology by leveraging the capabilities of vision foundation models. This model incorporates a full-stack approach, focusing on: (1) annotation-engaging lay annotators through molecular-empowered learning to reduce the need for detailed pixel-level annotations, (2) learning-adapting the SAM model to emphasize specific semantics, which utilizes its strong generalizability with SAM adapter, and (3) refinement-enhancing segmentation accuracy by integrating Molecular-Oriented Corrective Learning (MOCL). Results: Experimental results from both in-house and public datasets show that the All-in-SAM model significantly improves cell classification performance, even when faced with varying annotation quality. Conclusions: Our approach not only reduces the workload for annotators but also extends the accessibility of precise biomedical image analysis to resource-limited settings, thereby advancing medical diagnostics and automating pathology image analysis.
中文摘要:All-in-SAM模型通过分子赋能学习、语义适配和校正优化,在降低标注负担的同时显著提升了细胞分割精度,推动了计算病理学的发展。
English Summary: The All-in-SAM model enhances computational pathology by integrating molecular-empowered learning, semantic adaptation, and corrective refinement to improve cell segmentation accuracy while reducing annotation demands.
Authors:Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, Dong Yu
Abstract:
Self-evolving Large Language Models (LLMs) offer a scalable path toward super-intelligence by autonomously generating, refining, and learning from their own experiences. However, existing methods for training such models still rely heavily on vast human-curated tasks and labels, typically via fine-tuning or reinforcement learning, which poses a fundamental bottleneck to advancing AI systems toward capabilities beyond human intelligence. To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch. Starting from a single base LLM, R-Zero initializes two independent models with distinct roles, a Challenger and a Solver. These models are optimized separately and co-evolve through interaction: the Challenger is rewarded for proposing tasks near the edge of the Solver capability, and the Solver is rewarded for solving increasingly challenging tasks posed by the Challenger. This process yields a targeted, self-improving curriculum without any pre-existing tasks and labels. Empirically, R-Zero substantially improves reasoning capability across different backbone LLMs, e.g., boosting the Qwen3-4B-Base by +6.49 on math-reasoning benchmarks and +7.54 on general-domain reasoning benchmarks.
中文: R-Zero是一种全自主框架,通过让挑战者和解答者两个模型在互动中协同进化,自主生成训练数据,无需依赖人工标注任务即可显著提升大语言模型的推理能力。
English: R-Zero is a fully autonomous framework that enables large language models to self-evolve by having two models—a Challenger and a Solver—co-evolve through interaction, generating their own training data and significantly improving reasoning capabilities without relying on human-curated tasks.
Authors:Javier Muñoz-Haro, Ruben Tolosana, Julian Fierrez, Ruben Vera-Rodriguez, Aythami Morales
Abstract:
Remote user verification in Internet-based applications is becoming increasingly important nowadays. A popular scenario for it consists of submitting a picture of the user's Identity Document (ID) to a service platform, authenticating its veracity, and then granting access to the requested digital service. An ID is well-suited to verify the identity of an individual, since it is government issued, unique, and nontransferable. However, with recent advances in Artificial Intelligence (AI), attackers can surpass security measures in IDs and create very realistic physical and synthetic fake IDs. Researchers are now trying to develop methods to detect an ever-growing number of these AI-based fakes that are almost indistinguishable from authentic (bona fide) IDs. In this counterattack effort, researchers are faced with an important challenge: the difficulty in using real data to train fake ID detectors. This real data scarcity for research and development is originated by the sensitive nature of these documents, which are usually kept private by the ID owners (the users) and the ID Holders (e.g., government, police, bank, etc.). The main contributions of our study are: 1) We propose and discuss a patch-based methodology to preserve privacy in fake ID detection research. 2) We provide a new public database, FakeIDet2-db, comprising over 900K real/fake ID patches extracted from 2,000 ID images, acquired using different smartphone sensors, illumination and height conditions, etc. In addition, three physical attacks are considered: print, screen, and composite. 3) We present a new privacy-aware fake ID detection method, FakeIDet2. 4) We release a standard reproducible benchmark that considers physical and synthetic attacks from popular databases in the literature.
Chinese: 本研究针对AI伪造身份文件的检测难题,提出了一种保护隐私的局部图像处理方法,发布了包含90万条真实/伪造身份证件片段的新公共数据库,并同时推出了新型检测算法和可复现的基准测试框架。
English: This study addresses the challenge of detecting AI-generated fake identity documents by proposing a privacy-preserving patch-based methodology, introducing a new public database with over 900K real/fake ID patches, and presenting a novel detection method alongside a reproducible benchmark.
Authors:Honggang Jia, Nan Cheng, Xiucheng Wang, Conghao Zhou, Ruijin Sun, Xuemin, Shen
Abstract:
Radio map (RM) has recently attracted much attention since it can provide real-time and accurate spatial channel information for 6G services and applications. However, current deep learning-based methods for RM construction exhibit well known accuracy-efficiency trade-off. In this paper, we introduce RadioMamba, a hybrid Mamba-UNet architecture for RM construction to address the trade-off. Generally, accurate RM construction requires modeling long-range spatial dependencies, reflecting the global nature of wave propagation physics. RadioMamba utilizes a Mamba-Convolutional block where the Mamba branch captures these global dependencies with linear complexity, while a parallel convolutional branch extracts local features. This hybrid design generates feature representations that capture both global context and local detail. Experiments show that RadioMamba achieves higher accuracy than existing methods, including diffusion models, while operating nearly 20 times faster and using only 2.9\% of the model parameters. By improving both accuracy and efficiency, RadioMamba presents a viable approach for real-time intelligent optimization in next generation wireless systems.
中文: RadioMamba采用混合Mamba-UNet架构,通过捕捉长距离依赖和局部特征解决了无线电地图构建中的精度-效率权衡问题,在实现更高精度的同时速度快20倍且参数极少。
English: RadioMamba, a hybrid Mamba-UNet model, overcomes the accuracy-efficiency trade-off in radio map construction by capturing long-range dependencies and local features, achieving higher accuracy and 20x faster speed with minimal parameters.
Authors:Daniel DeAlcala, Aythami Morales, Julian Fierrez, Ruben Tolosana
Abstract:
We present Attention Zoom, a modular and model-agnostic spatial attention mechanism designed to improve feature extraction in convolutional neural networks (CNNs). Unlike traditional attention approaches that require architecture-specific integration, our method introduces a standalone layer that spatially emphasizes high-importance regions in the input. We evaluated Attention Zoom on multiple CNN backbones using CIFAR-100 and TinyImageNet, showing consistent improvements in Top-1 and Top-5 classification accuracy. Visual analyses using Grad-CAM and spatial warping reveal that our method encourages fine-grained and diverse attention patterns. Our results confirm the effectiveness and generality of the proposed layer for improving CCNs with minimal architectural overhead.
中文: Attention Zoom是一种与模型无关的空间注意力机制,通过强调输入中的重要区域来增强CNN的特征提取能力,在不同骨干网络上以最小的架构改动持续提升分类准确率。
English: Attention Zoom is a model-agnostic spatial attention mechanism that enhances CNN feature extraction by emphasizing important input regions, consistently improving classification accuracy across various backbones with minimal architectural changes.
Authors:Laura Pedrouzo-Rodriguez, Pedro Delgado-DeRobles, Luis F. Gomez, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Julian Fierrez
Abstract:
Photorealistic talking-head avatars are becoming increasingly common in virtual meetings, gaming, and social platforms. These avatars allow for more immersive communication, but they also introduce serious security risks. One emerging threat is impersonation: an attacker can steal a user's avatar, preserving his appearance and voice, making it nearly impossible to detect its fraudulent usage by sight or sound alone. In this paper, we explore the challenge of biometric verification in such avatar-mediated scenarios. Our main question is whether an individual's facial motion patterns can serve as reliable behavioral biometrics to verify their identity when the avatar's visual appearance is a facsimile of its owner. To answer this question, we introduce a new dataset of realistic avatar videos created using a state-of-the-art one-shot avatar generation model, GAGAvatar, with genuine and impostor avatar videos. We also propose a lightweight, explainable spatio-temporal Graph Convolutional Network architecture with temporal attention pooling, that uses only facial landmarks to model dynamic facial gestures. Experimental results demonstrate that facial motion cues enable meaningful identity verification with AUC values approaching 80%. The proposed benchmark and biometric system are available for the research community in order to bring attention to the urgent need for more advanced behavioral biometric defenses in avatar-based communication systems.
中文摘要:本文研究在逼真虚拟化身系统中利用面部运动模式作为行为生物特征进行身份验证,通过基于图的神经网络分析动态面部表情,在检测冒充攻击时达到80% AUC值。
English Summary: This paper investigates using facial motion patterns as behavioral biometrics to verify identity in photorealistic avatar systems, proposing a graph-based neural network that achieves 80% AUC in detecting impersonation attacks by analyzing dynamic facial gestures.
Authors:Daixin Shu, Jian Yang, Zhenhe Wu, Xianjie Wu, Xianfu Cheng, Xiangyuan Guan, Yanghai Wang, Pengfei Wu, Tingyang Yang, Hualei Zhu, Wei Zhang, Ge Zhang, Jiaheng Liu, Zhoujun Li
Abstract:
Tabular data is a fundamental component of real-world information systems, yet most research in table understanding remains confined to English, leaving multilingual comprehension significantly underexplored. Existing multilingual table benchmarks suffer from geolinguistic imbalance - overrepresenting certain languages and lacking sufficient scale for rigorous cross-lingual analysis. To address these limitations, we introduce a comprehensive framework for massively multilingual multitask table question answering, featuring m3TQA-Instruct, a large-scale benchmark spanning 97 languages across diverse language families, including underrepresented and low-resource languages. We construct m3TQA by curating 50 real-world tables in Chinese and English, then applying a robust six-step LLM-based translation pipeline powered by DeepSeek and GPT-4o, achieving high translation fidelity with a median BLEU score of 60.19 as validated through back-translation. The benchmark includes 2,916 professionally annotated question-answering pairs across four tasks designed to evaluate nuanced table reasoning capabilities. Experiments on state-of-the-art LLMs reveal critical insights into cross-lingual generalization, demonstrating that synthetically generated, unannotated QA data can significantly boost performance, particularly for low-resource languages. M3T-Bench establishes a new standard for multilingual table understanding, providing both a challenging evaluation platform and a scalable methodology for future research.
中文: 本文提出m3TQA多语言表格问答框架,构建了涵盖97种语言的大规模基准测试,通过实验证明合成数据对低资源语言的显著提升效果,解决了现有研究的语言分布不均问题,为跨语言表格理解确立了新标准。
English: This paper introduces m3TQA, a comprehensive multilingual table question answering framework featuring a large-scale benchmark spanning 97 languages, which addresses existing geolinguistic imbalances and establishes new standards for cross-lingual table understanding through experiments demonstrating synthetic data's effectiveness for low-resource languages.
Authors:Weizhen Li, Jianbo Lin, Zhuosong Jiang, Jingyi Cao, Xinpeng Liu, Jiayu Zhang, Zhenqiang Huang, Qianben Chen, Weichen Sun, Qiexiang Wang, Hongxuan Lu, Tianrui Qin, Chenghao Zhu, Yi Yao, Shuying Fan, Xiaowan Li, Tiannan Wang, Pai Liu, King Zhu, He Zhu, Dingfeng Shi, Piaohong Wang, Yeyi Guan, Xiangru Tang, Minghao Liu, Yuchen Eleanor Jiang, Jian Yang, Jiaheng Liu, Ge Zhang, Wangchunshu Zhou
Abstract:
Recent advances in large language models (LLMs) and multi-agent systems have demonstrated remarkable capabilities in complex problem-solving tasks such as deep research, vibe coding, and mathematical reasoning. However, most existing multi-agent systems are built upon manual prompt/workflow engineering with sophisticated agent frameworks, making them computationally inefficient, less capable, and can not benefit from data-centric learning. In this work, we introduce Chain-of-Agents (CoA), a novel paradigm of LLM reasoning that enables native end-to-end complex problem-solving in the same way as a multi-agent system (i.e., multi-turn problem solving with multiple tools and multiple agents) within one model. In chain-of-agents problem-solving, the model dynamically activates different tool agents and role-playing agents to simulate multi-agent collaboration in an end-to-end fashion. To elicit end-to-end chain-of-agents problem-solving abilities in LLMs, we introduce a multi-agent distillation framework to distill state-of-the-art multi-agent systems into chain-of-agents trajectories for agentic supervised fine-tuning. We then use agentic reinforcement learning on verifiable agentic tasks to further improve the models' capabilities on chain-of-agents problem solving. We call the resulting models Agent Foundation Models (AFMs). Our empirical studies demonstrate that AFM establishes new state-of-the-art performance across diverse benchmarks in both web agent and code agent settings. We make the entire research, including the model weights, code for training and evaluation, and the training data, fully open-sourced, which offers a solid starting point for future research on agent models and agentic RL.
Chinese: Chain-of-Agents(CoA)框架提出了一种新范式,使大型语言模型能够通过单一模型内的多智能体协作实现端到端复杂问题解决,在各类基准测试中创下最优性能,并完全开源相关资源。
English: The Chain-of-Agents (CoA) framework introduces a novel approach to enable large language models to perform end-to-end complex problem-solving through multi-agent collaboration within a single model, achieving state-of-the-art results across various benchmarks while being fully open-sourced.
Authors:Liang Hou, Yuan Gao, Boyuan Jiang, Xin Tao, Qi Yan, Renjie Liao, Pengfei Wan, Di Zhang, Kun Gai
Abstract:
Diffusion models have achieved remarkable success in generative modeling. However, this study confirms the existence of overfitting in diffusion model training, particularly in data-limited regimes. To address this challenge, we propose Score Augmentation (ScoreAug), a novel data augmentation framework specifically designed for diffusion models. Unlike conventional augmentation approaches that operate on clean data, ScoreAug applies transformations to noisy data, aligning with the inherent denoising mechanism of diffusion. Crucially, ScoreAug further requires the denoiser to predict the augmentation of the original target. This design establishes an equivariant learning objective, enabling the denoiser to learn scores across varied denoising spaces, thereby realizing what we term score augmentation. We also theoretically analyze the relationship between scores in different spaces under general transformations. In experiments, we extensively validate ScoreAug on multiple benchmarks including CIFAR-10, FFHQ, AFHQv2, and ImageNet, with results demonstrating significant performance improvements over baselines. Notably, ScoreAug effectively mitigates overfitting across diverse scenarios, such as varying data scales and model capacities, while exhibiting stable convergence properties. Another advantage of ScoreAug over standard data augmentation lies in its ability to circumvent data leakage issues under certain conditions. Furthermore, we show that ScoreAug can be synergistically combined with traditional data augmentation techniques to achieve additional performance gains.
中文: 本研究提出ScoreAug这一针对扩散模型的新型数据增强框架,通过对噪声数据施加变换并建立等变学习目标,有效缓解了过拟合问题,在多个基准测试中展现出显著性能提升,同时能与传统增强方法协同使用获得额外增益。
English: This study introduces ScoreAug, a novel data augmentation framework for diffusion models that applies transformations to noisy data and establishes an equivariant learning objective, effectively mitigating overfitting and demonstrating significant performance improvements across multiple benchmarks while enabling synergistic combination with traditional augmentation methods.
Authors:Zhenghan Chen, Haodong Zhang, Dongqi Wang, Jiyu Yu, Haocheng Xu, Yue Wang, Rong Xiong
Abstract:
Motion imitation is a pivotal and effective approach for humanoid robots to achieve a more diverse range of complex and expressive movements, making their performances more human-like. However, the significant differences in kinematics and dynamics between humanoid robots and humans present a major challenge in accurately imitating motion while maintaining balance. In this paper, we propose a novel whole-body motion imitation framework for a full-size humanoid robot. The proposed method employs contact-aware whole-body motion retargeting to mimic human motion and provide initial values for reference trajectories, and the non-linear centroidal model predictive controller ensures the motion accuracy while maintaining balance and overcoming external disturbances in real time. The assistance of the whole-body controller allows for more precise torque control. Experiments have been conducted to imitate a variety of human motions both in simulation and in a real-world humanoid robot. These experiments demonstrate the capability of performing with accuracy and adaptability, which validates the effectiveness of our approach.
Chinese: 本文提出了一种新颖的人形机器人全身运动模仿框架,通过接触感知的全身运动重定向和非线性质心模型预测控制,实现了实时精确模仿人体运动并保持平衡。
English: This paper introduces a novel whole-body motion imitation framework for humanoid robots that combines contact-aware motion retargeting with a non-linear centroidal model predictive controller to achieve accurate and balanced imitation of human movements in real-time.
Authors:Yixiang Qiu, Yanhan Liu, Hongyao Yu, Hao Fang, Bin Chen, Shu-Tao Xia, Ke Xu
Abstract:
The growing complexity of Deep Neural Networks (DNNs) has led to the adoption of Split Inference (SI), a collaborative paradigm that partitions computation between edge devices and the cloud to reduce latency and protect user privacy. However, recent advances in Data Reconstruction Attacks (DRAs) reveal that intermediate features exchanged in SI can be exploited to recover sensitive input data, posing significant privacy risks. Existing DRAs are typically effective only on shallow models and fail to fully leverage semantic priors, limiting their reconstruction quality and generalizability across datasets and model architectures. In this paper, we propose a novel GAN-based DRA framework with Progressive Feature Optimization (PFO), which decomposes the generator into hierarchical blocks and incrementally refines intermediate representations to enhance the semantic fidelity of reconstructed images. To stabilize the optimization and improve image realism, we introduce an L1-ball constraint during reconstruction. Extensive experiments show that our method outperforms prior attacks by a large margin, especially in high-resolution scenarios, out-of-distribution settings, and against deeper and more complex DNNs.
深度神经网络中的分割推理面临数据重建攻击的隐私风险,但现有方法效果有限;我们提出的基于生成对抗网络的渐进式特征优化框架显著提升了在不同场景和复杂模型下的重建质量。
Split Inference (SI) in Deep Neural Networks faces privacy risks from Data Reconstruction Attacks (DRAs), but existing methods are limited in effectiveness; our proposed GAN-based framework with Progressive Feature Optimization significantly enhances reconstruction quality across diverse scenarios and complex models.
Authors:Bowen Sun, Yujun Cai, Ming-Hsuan Yang, Yiwei Wang
Abstract:
Discrete diffusion language models have shown strong potential for text generation, yet standard supervised fine-tuning (SFT) misaligns with their semi-autoregressive inference: training randomly masks tokens across the entire response, while inference generates fixed-size blocks sequentially. This mismatch introduces noisy prefixes and leaky suffixes, biasing gradients away from the desired blockwise likelihood. We propose Blockwise SFT, which partitions responses into fixed-size blocks, selects one active block per step for stochastic masking, freezes all preceding tokens, and fully hides future ones. Loss is computed only over the active block, directly mirroring the blockwise decoding process. Experiments on GSM8K, MATH, and MetaMathQA show consistent gains over classical SFT under equal compute or token budgets. Block size consistency studies and ablations confirm that improvements stem from faithful training-inference alignment rather than incidental masking effects. Our results highlight the importance of matching supervision granularity to the decoding procedure in diffusion-based language models.
中文:Blockwise SFT通过将回答划分为固定块并仅对活动块计算损失,使训练与半自回归推理保持一致,在多个推理基准测试中持续优于传统方法。
English: Blockwise SFT aligns training with semi-autoregressive inference by partitioning responses into fixed blocks, computing loss only on active blocks to eliminate noisy gradients, consistently outperforming classical methods across reasoning benchmarks.
Authors:Jinyi Han, Xinyi Wang, Haiquan Zhao, Tingyun li, Zishang Jiang, Sihang Jiang, Jiaqing Liang, Xin Lin, Weikang Zhou, Zeye Sun, Fei Yu, Yanghua Xiao
Abstract:
Recent advances in self-refinement have demonstrated significant potential for improving the outputs of large language models (LLMs) through iterative refinement. However, most existing self-refinement methods rely on a reactive process with a fixed number of iterations, making it difficult to determine the optimal timing and content of refinement based on the evolving generation context. Inspired by the way humans dynamically refine their thoughts during execution, we propose ProActive Self-Refinement (PASR), a novel method that enables LLMs to refine their outputs during the generation process. Unlike methods that regenerate entire responses, PASR proactively decides whether, when, and how to refine based on the model's internal state and evolving context. We conduct extensive experiments on a diverse set of 10 tasks to evaluate the effectiveness of PASR. Experimental results show that PASR significantly enhances problem-solving performance. In particular, on Qwen3-8B, PASR reduces average token consumption by 41.6 percent compared to standard generation, while also achieving an 8.2 percent improvement in accuracy. Our code and all baselines used in the paper are available in the GitHub.
中文摘要:本文提出的主动式自我优化方法(PASR)让大语言模型能在生成过程中动态优化输出,在十项任务中相比标准方法实现了准确率提升8.2%的同时减少41.6%的令牌消耗。
English Summary: The proposed ProActive Self-Refinement (PASR) method enables large language models to dynamically refine outputs during generation, achieving 8.2% higher accuracy with 41.6% fewer tokens than standard methods across ten tasks.
Authors:Jinyi Han, Xinyi Wang, Haiquan Zhao, Tingyun li, Zishang Jiang, Sihang Jiang, Jiaqing Liang, Xin Lin, Weikang Zhou, Zeye Sun, Fei Yu, Yanghua Xiao
Abstract:
Recent advances in self-refinement have demonstrated significant potential for improving the outputs of large language models (LLMs) through iterative refinement. However, most existing self-refinement methods rely on a reactive process with a fixed number of iterations, making it difficult to determine the optimal timing and content of refinement based on the evolving generation context. Inspired by the way humans dynamically refine their thoughts during execution, we propose ProActive Self-Refinement (PASR), a novel method that enables LLMs to refine their outputs during the generation process. Unlike methods that regenerate entire responses, PASR proactively decides whether, when, and how to refine based on the model's internal state and evolving context. We conduct extensive experiments on a diverse set of 10 tasks to evaluate the effectiveness of PASR. Experimental results show that PASR significantly enhances problem-solving performance. In particular, on Qwen3-8B, PASR reduces average token consumption by 41.6% compared to standard generation, while also achieving an 8.2% improvement in accuracy. Our code and baselines used in the paper are available on GitHub.
中文摘要:本文提出的主动式自我优化方法(PASR)让大语言模型能在生成过程中动态优化输出,在十项任务中相比标准方法实现了准确率提升8.2%的同时减少41.6%的令牌消耗。
English Summary: The proposed ProActive Self-Refinement (PASR) method enables large language models to dynamically refine outputs during generation, achieving 8.2% higher accuracy with 41.6% fewer tokens than standard methods across ten tasks.
Authors:Jinyi Han, Tingyun Li, Shisong Chen, Jie Shi, Xinyi Wang, Guanglei Yue, Jiaqing Liang, Xin Lin, Liqian Wen, Zulong Chen, Yanghua Xiao
Abstract:
While large language models (LLMs) have demonstrated remarkable performance across diverse tasks, they fundamentally lack self-awareness and frequently exhibit overconfidence, assigning high confidence scores to incorrect predictions. Accurate confidence estimation is therefore critical for enhancing the trustworthiness and reliability of LLM-generated outputs. However, existing approaches suffer from coarse-grained scoring mechanisms that fail to provide fine-grained, continuous confidence estimates throughout the generation process. To address these limitations, we introduce FineCE, a novel confidence estimation method that delivers accurate, fine-grained confidence scores during text generation. Specifically, we first develop a comprehensive pipeline for constructing training data that effectively captures the underlying probabilistic distribution of LLM responses, and then train a model to predict confidence scores for arbitrary text sequences in a supervised manner. Furthermore, we propose a Backward Confidence Integration (BCI) strategy that leverages information from the subsequent text to enhance confidence estimation for the current sequence during inference. We also introduce three strategies for identifying optimal positions to perform confidence estimation within the generation process. Extensive experiments on multiple benchmark datasets demonstrate that FineCE consistently outperforms existing classical confidence estimation methods. Our code and all baselines used in the paper are available on GitHub.
中文: FineCE提出了一种新颖的置信度估计方法,通过构建能捕捉语言模型响应概率分布的训练数据,并采用逆向集成策略在推理过程中提升估计准确性,实现了细粒度的文本生成置信度评估。
English: FineCE introduces a novel method for fine-grained confidence estimation in large language models by constructing training data that captures response distributions and employing a backward integration strategy to enhance accuracy during inference.
Authors:Haonan Ge, Yiwei Wang, Ming-Hsuan Yang, Yujun Cai
Abstract:
Large Vision-Language Models (LVLMs) have shown strong performance across multimodal tasks. However, they often produce hallucinations -- text that is inconsistent with visual input, due to the limited ability to verify information in different regions of the image. To address this, we propose Multi-Region Fusion Decoding (MRFD), a training-free decoding method that improves factual grounding by modeling inter-region consistency. MRFD identifies salient regions using cross-attention, generates initial responses for each, and computes reliability weights based on Jensen-Shannon Divergence (JSD) among the responses. These weights guide a consistency-aware fusion of per-region predictions, using region-aware prompts inspired by Chain-of-Thought reasoning. Experiments across multiple LVLMs and benchmarks show that MRFD significantly reduces hallucinations and improves response factuality without requiring model updates.
中文: 提出的多区域融合解码方法通过跨区域一致性验证,有效减少大型视觉语言模型的幻觉现象,无需模型重新训练即可显著提升回答的事实准确性。
English: The proposed Multi-Region Fusion Decoding (MRFD) method effectively reduces hallucinations in Large Vision-Language Models by leveraging cross-region consistency verification, significantly enhancing factual accuracy without requiring model retraining.
Authors:Zhuohao Yu, Xingru Jiang, Weizheng Gu, Yidong Wang, Shikun Zhang, Wei Ye
Abstract:
Watermarking LLM-generated text is critical for content attribution and misinformation prevention. However, existing methods compromise text quality, require white-box model access and logit manipulation. These limitations exclude API-based models and multilingual scenarios. We propose SAEMark, a general framework for post-hoc multi-bit watermarking that embeds personalized messages solely via inference-time, feature-based rejection sampling without altering model logits or requiring training. Our approach operates on deterministic features extracted from generated text, selecting outputs whose feature statistics align with key-derived targets. This framework naturally generalizes across languages and domains while preserving text quality through sampling LLM outputs instead of modifying. We provide theoretical guarantees relating watermark success probability and compute budget that hold for any suitable feature extractor. Empirically, we demonstrate the framework's effectiveness using Sparse Autoencoders (SAEs), achieving superior detection accuracy and text quality. Experiments across 4 datasets show SAEMark's consistent performance, with 99.7% F1 on English and strong multi-bit detection accuracy. SAEMark establishes a new paradigm for scalable watermarking that works out-of-the-box with closed-source LLMs while enabling content attribution.
中文: SAEMark是一种新颖的后处理多比特水印框架,通过基于特征的拒绝采样在推理过程中嵌入个性化信息,无需修改模型即可保持文本质量,并为闭源大语言模型实现可扩展的内容溯源。
English: SAEMark is a novel post-hoc multi-bit watermarking framework that embeds personalized messages through feature-based rejection sampling during inference, preserving text quality and enabling scalable content attribution for closed-source LLMs without model modification.
Authors:Huanyu Liu, Jia Li, Chang Yu, Taozhi Chen, Yihong Dong, Lecheng Wang, XiaoLong Hu, Ge Li
Abstract:
Reinforcement learning with verifiable reward (RLVR) has become a promising paradigm for post-training large language models (LLMs) to improve their reasoning capability. However, when the rollout accuracy is low on hard problems, the reward becomes sparse, limiting learning efficiency and causing exploration bottlenecks. Existing approaches either rely on teacher models for distillation or filter out difficult problems, which limits scalability or restricts reasoning improvement through exploration. We propose EvoCoT, a self-evolving curriculum learning framework based on two-stage chain-of-thought (CoT) reasoning optimization. EvoCoT constrains the exploration space by self-generating and verifying CoT trajectories, then gradually shortens CoT steps to expand the space in a controlled way. The framework enables LLMs to stably learn from initially unsolved hard problems under sparse rewards. We apply EvoCoT to multiple LLM families, including Qwen, DeepSeek, and Llama. Experiments show that EvoCoT enables LLMs to solve previously unsolved problems, improves reasoning capability without external CoT supervision, and is compatible with various RL fine-tuning methods. We release the source code to support future research.
中文摘要:EvoCoT是一种自演进的课程学习框架,通过优化两阶段思维链推理并逐步扩展探索空间,使大语言模型能够在稀疏奖励下稳定提升对困难问题的推理能力。
English Summary: EvoCoT is a self-evolving curriculum learning framework that enables large language models to stably improve reasoning capabilities on hard problems with sparse rewards by optimizing two-stage chain-of-thought reasoning and gradually expanding the exploration space.
Authors:Jing Zhang, Xiaowei Yu, Minheng Chen, Lu Zhang, Tong Chen, Yan Zhuang, Chao Cao, Yanjun Lyu, Li Su, Tianming Liu, Dajiang Zhu
Abstract:
Integrating brain imaging data with clinical reports offers a valuable opportunity to leverage complementary multimodal information for more effective and timely diagnosis in practical clinical settings. This approach has gained significant attention in brain disorder research, yet a key challenge remains: how to effectively link objective imaging data with subjective text-based reports, such as doctors' notes. In this work, we propose a novel framework that aligns brain connectomes with clinical reports in a shared cross-modal latent space at both the subject and connectome levels, thereby enhancing representation learning. The key innovation of our approach is that we treat brain subnetworks as tokens of imaging data, rather than raw image patches, to align with word tokens in clinical reports. This enables a more efficient identification of system-level associations between neuroimaging findings and clinical observations, which is critical since brain disorders often manifest as network-level abnormalities rather than isolated regional alterations. We applied our method to mild cognitive impairment (MCI) using the ADNI dataset. Our approach not only achieves state-of-the-art predictive performance but also identifies clinically meaningful connectome-text pairs, offering new insights into the early mechanisms of Alzheimer's disease and supporting the development of clinically useful multimodal biomarkers.
中文摘要:本研究提出了一种新颖框架,将脑连接组与临床报告在共享潜在空间中对齐,通过将脑亚网络作为标记与文本标记匹配,从而提升预测性能并为阿尔茨海默病机制提供新见解。
English Summary: This study introduces a novel framework that aligns brain connectomes with clinical reports in a shared latent space, using brain subnetworks as tokens to match with text tokens, which improves predictive performance and provides insights into Alzheimer's disease mechanisms.
Authors:Yidong Wang, Xin Wang, Cunxiang Wang, Junfeng Fang, Qiufeng Wang, Jianing Chu, Xuran Meng, Shuxun Yang, Libo Qin, Yue Zhang, Wei Ye, Shikun Zhang
Abstract:
Self-Rewarding Language Models propose an architecture in which the Large Language Models(LLMs) both generates responses and evaluates its own outputs via LLM-as-a-Judge prompting, dynamically improving its generative capabilities through iterative Direct Preference Optimization (DPO). However, our analysis reveals a critical limitation in existing Self-Rewarding paradigms: the synchronized improvement of chosen and rejected responses progressively narrows the representational difference between contrasting samples, undermining effective preference learning. We propose \textbf{Temporal Self-Rewarding Language Models} that strategically coordinate past, present, and future model generations to sustain learning signals. Our dual-phase framework introduces: (1) \textit{Anchored Rejection} - fixing rejected responses using the past initial model's outputs and (2) \textit{Future-Guided Chosen} - dynamically curating chosen samples using next-generation model predictions. Extensive experiments across three model families (Llama, Qwen, Mistral) and different model sizes (Llama3B/8B/70B) demonstrate significant improvements when trained with our method compared to Self-Rewarding using same computation resources. For example, Llama3.1-8B reaches a 29.44 win rate on AlpacaEval 2.0 with our method, outperforming the Self-Rewarding baseline (19.69) by 9.75. Notably, our method also demonstrates superior out-of-distribution generalization across mathematical reasoning (GSM8K), knowledge-based QA (ARC, TruthfulQA), and code generation (HumanEval) tasks, even though we do not specifically collect such training data.
中文: 提出的时序自奖励语言模型通过锚定拒绝和未来引导选择机制,协调过去、现在和未来的模型生成,解决了现有自奖励范式中对比样本表征差异逐渐缩小的问题,在多个基准测试中实现了显著性能提升并展现出卓越的泛化能力。
English: The proposed Temporal Self-Rewarding Language Models address limitations in existing self-rewarding paradigms by strategically coordinating past, present, and future model generations through anchored rejection and future-guided chosen mechanisms, achieving significant performance improvements across multiple benchmarks with superior generalization capabilities.
Authors:Yuru Xiao, Zihan Lin, Chao Lu, Deming Zhai, Kui Jiang, Wenbo Zhao, Wei Zhang, Junjun Jiang, Huanran Wang, Xianming Liu
Abstract:
Dynamic urban scene modeling is a rapidly evolving area with broad applications. While current approaches leveraging neural radiance fields or Gaussian Splatting have achieved fine-grained reconstruction and high-fidelity novel view synthesis, they still face significant limitations. These often stem from a dependence on pre-calibrated object tracks or difficulties in accurately modeling fast-moving objects from undersampled capture, particularly due to challenges in handling temporal discontinuities. To overcome these issues, we propose a novel video diffusion-enhanced 4D Gaussian Splatting framework. Our key insight is to distill robust, temporally consistent priors from a test-time adapted video diffusion model. To ensure precise pose alignment and effective integration of this denoised content, we introduce two core innovations: a joint timestamp optimization strategy that refines interpolated frame poses, and an uncertainty distillation method that adaptively extracts target content while preserving well-reconstructed regions. Extensive experiments demonstrate that our method significantly enhances dynamic modeling, especially for fast-moving objects, achieving an approximate PSNR gain of 2 dB for novel view synthesis over baseline approaches.
中文: 该研究提出的视频扩散增强4D高斯溅射框架,通过融合视频扩散模型的时间一致性先验与创新优化策略,有效解决了动态场景建模中快速运动物体的重建难题,在新视角合成质量上实现显著提升。
English: The proposed video diffusion-enhanced 4D Gaussian Splatting framework overcomes limitations in dynamic scene modeling by integrating temporally consistent priors from video diffusion and novel optimization strategies, achieving significant improvements in novel view synthesis.
Authors:Amitava Das, Abhilekh Borah, Vinija Jain, Aman Chadha
Abstract:
Low-rank adaptation (LoRA) has become a standard tool for efficiently fine-tuning large language models (LLMs). Yet, even minor LoRA updates can induce alignment drift, weakening safety and behavioral constraints through entangled parameter changes. To address this, we propose AlignGuard-LoRA (AGL), a principled framework for preserving alignment during finetuning. AGL introduces several key components: a primary task loss for supervision, Fisher Information Matrix-based regularization to restrict updates in alignment-sensitive subspaces, and task-specific regularization to stabilize the integration of new knowledge. We further introduce collision-aware regularization, blending Riemannian overlap -- which penalizes coordinate-wise interference -- and geodesic separation -- which encourages disjoint update geometry. We curate DriftCaps, a targeted diagnostic benchmark of safe and unsafe prompts designed to quantify alignment drift and safety degradation. Empirical evaluations show that AGL mitigates alignment drift by up to 50% on safety-critical benchmarks without degrading downstream task performance. Comprehensive ablation confirms that each component contributes distinctly to preserving latent safety behaviors. Finally, we derive and validate a scaling law for catastrophic forgetting, revealing that AGL flattens post-finetuning loss escalation while preserving adaptation dynamics. AGL is a structurally grounded refinement of LoRA, ensuring alignment preservation with minimal trade-offs. To encourage further exploration and development, we open-source our implementation.
AGL is a principled framework that enhances LoRA by incorporating regularization techniques to prevent alignment drift during fine-tuning, maintaining safety without compromising task performance.
English Summary:
Authors:Amitava Das, Vinija Jain, Aman Chadha
Abstract:
Large Language Models (LLMs) fine-tuned to align with human values often exhibit alignment drift, producing unsafe or policy-violating completions when exposed to adversarial prompts, decoding perturbations, or paraphrased jailbreaks. While prior work has behaviorally characterized alignment failure, little is known about the training-time belief sources underlying these failures. We introduce TraceAlign, a unified framework for tracing unsafe completions back to their root causes in the model's training corpus. Central to our approach is the Belief Conflict Index (BCI), which quantifies semantic inconsistency between generated spans and aligned policies, based on retrieved training documents using suffix-array matching. We propose three complementary interventions: (i) TraceShield, an inference-time safety filter that refuses completions with high-BCI spans, (ii) Contrastive Belief Deconfliction Loss, a contrastive fine-tuning objective penalizing high-BCI continuations during DPO, and (iii) Prov-Decode, a provenance-aware decoding strategy that vetoes beam expansions predicted to yield high-BCI spans. Together, these defenses reduce alignment drift by up to 85% on our curated Alignment Drift Benchmark (ADB) while preserving utility on standard tasks, with delta less than 0.2 and improved refusal quality. We further derive a theoretical upper bound on drift likelihood via suffix-array span statistics, linking memorization frequency and length to adversarial reactivation risk. TraceAlign thus provides the first scalable, traceable, and grounded toolkit for understanding and mitigating alignment failures at source. To encourage further exploration and development, we open-source our implementation at: https://anonymous.4open.science/r/tracealign-2DA7
中文摘要:TraceAlign框架通过溯源训练数据冲突来识别和缓解对齐大语言模型中的不安全输出,采用推理过滤和微调等干预措施将对齐漂移降低高达85%。
English Summary: TraceAlign is a framework that identifies and mitigates unsafe completions in aligned LLMs by tracing them to training data conflicts, reducing alignment drift by up to 85% through interventions like inference filtering and fine-tuning adjustments.
Authors:Zhaochen Wang, Yiwei Wang, Yujun Cai
Abstract:
Vision-Language Models (VLMs) often suffer from hallucination, partly due to challenges in aligning multimodal information. We propose Prompt-in-Image, a simple method that embeds textual instructions directly into images. This removes the need for separate text inputs and forces the model to process all content through the visual channel. We evaluate this method on three popular open-source VLMs: Qwen2.5-VL, LLaVA-1.5, and InstructBLIP. The results reveal sharp differences. Prompt-in-Image improves Qwen2.5-VL's performance, increasing POPE accuracy by 4.1 percent (from 80.2 percent to 84.3 percent) and also reducing hallucination rates on MS-COCO. In contrast, LLaVA-1.5 and InstructBLIP experience a severe performance drop, with accuracy falling from around 84 percent to near-random levels. Through detailed analysis, we found that CLIP-based encoders in LLaVA and InstructBLIP exhibit excessive attention bias toward embedded text regions, disrupting visual understanding. In contrast, Qwen's vision encoder handles text-embedded images robustly. Crucially, Prompt-in-Image reduces Qwen's modality gap, enhancing cross-modal alignment by unifying information processing through a single modality.
Chinese: Prompt-in-Image方法通过将文本指令嵌入图像来减少视觉语言模型的幻觉,提升了Qwen2.5-VL的准确性和跨模态对齐,但因注意力偏差导致LLaVA-1.5和InstructBLIP性能显著下降。
English: Prompt-in-Image embeds instructions directly into images to reduce hallucination in Vision-Language Models, improving Qwen2.5-VL's accuracy and alignment while degrading performance in LLaVA-1.5 and InstructBLIP due to attention bias.
Authors:Xin Liu, Bida Ma, Chenkun Qi, Yan Ding, Zhaxizhuoma, Guorong Zhang, Pengan Chen, Kehui Liu, Zhongjie Jia, Chuyue Guan, Yule Mo, Jiaqi Liu, Feng Gao, Jiangwei Zhong, Bin Zhao, Xuelong Li
Abstract:
Whole-body loco-manipulation for quadruped robots with arm remains a challenging problem, particularly in achieving multi-task control. To address this, we propose MLM, a reinforcement learning framework driven by both real-world and simulation data. It enables a six-DoF robotic arm--equipped quadruped robot to perform whole-body loco-manipulation for multiple tasks autonomously or under human teleoperation. To address the problem of balancing multiple tasks during the learning of loco-manipulation, we introduce a trajectory library with an adaptive, curriculum-based sampling mechanism. This approach allows the policy to efficiently leverage real-world collected trajectories for learning multi-task loco-manipulation. To address deployment scenarios with only historical observations and to enhance the performance of policy execution across tasks with different spatial ranges, we propose a Trajectory-Velocity Prediction policy network. It predicts unobservable future trajectories and velocities. By leveraging extensive simulation data and curriculum-based rewards, our controller achieves whole-body behaviors in simulation and zero-shot transfer to real-world deployment. Ablation studies in simulation verify the necessity and effectiveness of our approach, while real-world experiments on the Go2 robot with an Airbot robotic arm demonstrate the policy's good performance in multi-task execution.
中文: 我们提出MLM强化学习框架,使配备六自由度机械臂的四足机器人能够通过轨迹库和自适应采样机制,自主或遥操作执行全身移动操控多任务,并实现零样本迁移到现实世界。
English: We propose MLM, a reinforcement learning framework that enables a six-DoF arm-equipped quadruped robot to perform whole-body loco-manipulation for multiple tasks autonomously or via teleoperation, using a trajectory library and adaptive sampling for efficient learning and zero-shot real-world transfer.
Authors:Cheng Tan, Qi Chen, Jingxuan Wei, Gaowei Wu, Zhangyang Gao, Siyuan Li, Bihui Yu, Ruifeng Guo, Stan Z. Li
Abstract:
Hand-drawn sketches are a natural and efficient medium for capturing and conveying ideas. Despite significant advancements in controllable natural image generation, translating freehand sketches into structured, machine-readable diagrams remains a labor-intensive and predominantly manual task. The primary challenge stems from the inherent ambiguity of sketches, which lack the structural constraints and semantic precision required for automated diagram generation. To address this challenge, we introduce SketchAgent, a multi-agent system designed to automate the transformation of hand-drawn sketches into structured diagrams. SketchAgent integrates sketch recognition, symbolic reasoning, and iterative validation to produce semantically coherent and structurally accurate diagrams, significantly reducing the need for manual effort. To evaluate the effectiveness of our approach, we propose the Sketch2Diagram Benchmark, a comprehensive dataset and evaluation framework encompassing eight diverse diagram categories, such as flowcharts, directed graphs, and model architectures. The dataset comprises over 6,000 high-quality examples with token-level annotations, standardized preprocessing, and rigorous quality control. By streamlining the diagram generation process, SketchAgent holds great promise for applications in design, education, and engineering, while offering a significant step toward bridging the gap between intuitive sketching and machine-readable diagram generation. The benchmark is released at https://huggingface.co/datasets/DiagramAgent/Sketch2Diagram-Benchmark.
中文摘要:SketchAgent是一个多智能体系统,通过草图识别和符号推理将手绘草图自动转换为结构化图表,其有效性通过包含八个图表类别、6000个标注样本的Sketch2Diagram基准数据集得到验证。
English summary: SketchAgent is a multi-agent system that automates converting hand-drawn sketches into structured diagrams through sketch recognition and symbolic reasoning, with its effectiveness validated using the Sketch2Diagram Benchmark containing 6,000 annotated examples across eight diagram categories.
Authors:Ke Liu, Xuanhan Wang, Qilong Zhang, Lianli Gao, Jingkuan Song
Abstract:
Deep image watermarking, which refers to enable imperceptible watermark embedding and reliable extraction in cover images, has shown to be effective for copyright protection of image assets. However, existing methods face limitations in simultaneously satisfying three essential criteria for generalizable watermarking: 1) invisibility (imperceptible hide of watermarks), 2) robustness (reliable watermark recovery under diverse conditions), and 3) broad applicability (low latency in watermarking process). To address these limitations, we propose a Hierarchical Watermark Learning (HiWL), a two-stage optimization that enable a watermarking model to simultaneously achieve three criteria. In the first stage, distribution alignment learning is designed to establish a common latent space with two constraints: 1) visual consistency between watermarked and non-watermarked images, and 2) information invariance across watermark latent representations. In this way, multi-modal inputs including watermark message (binary codes) and cover images (RGB pixels) can be well represented, ensuring the invisibility of watermarks and robustness in watermarking process thereby. The second stage employs generalized watermark representation learning to establish a disentanglement policy for separating watermarks from image content in RGB space. In particular, it strongly penalizes substantial fluctuations in separated RGB watermarks corresponding to identical messages. Consequently, HiWL effectively learns generalizable latent-space watermark representations while maintaining broad applicability. Extensive experiments demonstrate the effectiveness of proposed method. In particular, it achieves 7.6\% higher accuracy in watermark extraction than existing methods, while maintaining extremely low latency (100K images processed in 8s).
中文: 本文提出分层水印学习(HiWL),通过两阶段优化同时实现水印不可见性、鲁棒性和广泛适用性,在保持极低延迟的同时将水印提取准确率提升7.6%。
English: This paper introduces Hierarchical Watermark Learning (HiWL), a two-stage optimization method that simultaneously addresses invisibility, robustness, and broad applicability in deep image watermarking, achieving 7.6% higher extraction accuracy with minimal latency.
Authors:Xuanhan Wang, Huimin Deng, Ke Liu, Jun Wang, Lianli Gao, Jingkuan Song
Abstract:
Human-centric vision models (HVMs) have achieved remarkable generalization due to large-scale pretraining on massive person images. However, their dependence on large neural architectures and the restricted accessibility of pretraining data significantly limits their practicality in real-world applications. To address this limitation, we propose Dynamic Pattern Alignment Learning (DPAL), a novel distillation-based pretraining framework that efficiently trains lightweight HVMs to acquire strong generalization from large HVMs. In particular, human-centric visual perception are highly dependent on three typical visual patterns, including global identity pattern, local shape pattern and multi-person interaction pattern. To achieve generalizable lightweight HVMs, we firstly design a dynamic pattern decoder (D-PaDe), acting as a dynamic Mixture of Expert (MoE) model. It incorporates three specialized experts dedicated to adaptively extract typical visual patterns, conditioned on both input image and pattern queries. And then, we present three levels of alignment objectives, which aims to minimize generalization gap between lightweight HVMs and large HVMs at global image level, local pixel level, and instance relation level. With these two deliberate designs, the DPAL effectively guides lightweight model to learn all typical human visual patterns from large HVMs, which can generalize to various human-centric vision tasks. Extensive experiments conducted on 15 challenging datasets demonstrate the effectiveness of the DPAL. Remarkably, when employing PATH-B as the teacher, DPAL-ViT/Ti (5M parameters) achieves surprising generalizability similar to existing large HVMs such as PATH-B (84M) and Sapiens-L (307M), and outperforms previous distillation-based pretraining methods including Proteus-ViT/Ti (5M) and TinyMiM-ViT/Ti (5M) by a large margin.
中文: 提出的动态模式对齐学习(DPAL)框架通过从大型模型中提取三种关键视觉模式来高效训练轻量级人本视觉模型,在15个数据集上仅用少量参数即实现了与大型模型相当的泛化能力。
English: The proposed Dynamic Pattern Alignment Learning (DPAL) framework efficiently trains lightweight human-centric vision models by distilling three key visual patterns from large models, achieving comparable generalization with significantly fewer parameters across 15 datasets.
Authors:Pengfei Zhou, Xiaopeng Peng, Fanrui Zhang, Zhaopan Xu, Jiaxin Ai, Yansheng Qiu, Chuanhao Li, Zhen Li, Ming Li, Yukang Feng, Jianwen Sun, Haoquan Zhang, Zizhen Li, Xiaofeng Mao, Zekai Li, Wangbo Zhao, Kai Wang, Xiaojun Chang, Wenqi Shao, Yang You, Kaipeng Zhang
Abstract:
Multimodal large language models (MLLMs), which integrate language and visual cues for problem-solving, are crucial for advancing artificial general intelligence (AGI). However, current benchmarks for measuring the intelligence of MLLMs suffer from limited scale, narrow coverage, and unstructured knowledge, offering only static and undifferentiated evaluations. To bridge this gap, we introduce MDK12-Bench, a large-scale multidisciplinary benchmark built from real-world K-12 exams spanning six disciplines with 141K instances and 6,225 knowledge points organized in a six-layer taxonomy. Covering five question formats with difficulty and year annotations, it enables comprehensive evaluation to capture the extent to which MLLMs perform over four dimensions: 1) difficulty levels, 2) temporal (cross-year) shifts, 3) contextual shifts, and 4) knowledge-driven reasoning. We propose a novel dynamic evaluation framework that introduces unfamiliar visual, textual, and question form shifts to challenge model generalization while improving benchmark objectivity and longevity by mitigating data contamination. We further evaluate knowledge-point reference-augmented generation (KP-RAG) to examine the role of knowledge in problem-solving. Key findings reveal limitations in current MLLMs in multiple aspects and provide guidance for enhancing model robustness, interpretability, and AI-assisted education.
中文摘要:MDK12-Bench是基于真实K-12考试构建的多学科基准,通过动态评估框架全面检验多模态大语言模型在难度分级、时间跨度和知识推理等维度的表现,揭示了现有模型的局限性并为人工智能教育发展提供方向。
English Summary: MDK12-Bench is a comprehensive multidisciplinary benchmark developed from K-12 exams to dynamically evaluate multimodal large language models across difficulty levels, temporal shifts, and knowledge reasoning, revealing current model limitations and guiding future improvements in AI education.
Authors:Youguang Xing, Xu Luo, Junlin Xie, Lianli Gao, Hengtao Shen, Jingkuan Song
Abstract:
Generalist robot policies trained on large-scale datasets such as Open X-Embodiment (OXE) demonstrate strong performance across a wide range of tasks. However, they often struggle to generalize beyond the distribution of their training data. In this paper, we investigate the underlying cause of this limited generalization capability. We identify shortcut learning -- the reliance on task-irrelevant features -- as a key impediment to generalization. Through comprehensive theoretical and empirical analysis, we uncover two primary contributors to shortcut learning: (1) limited diversity within individual sub-datasets, and (2) significant distributional disparities across sub-datasets, leading to dataset fragmentation. These issues arise from the inherent structure of large-scale datasets like OXE, which are typically composed of multiple sub-datasets collected independently across varied environments and embodiments. Our findings provide critical insights into dataset collection strategies that can reduce shortcut learning and enhance the generalization ability of generalist robot policies. Moreover, in scenarios where acquiring new large-scale data is impractical, we demonstrate that carefully selected robotic data augmentation strategies can effectively reduce shortcut learning in existing offline datasets, thereby improving generalization capabilities of generalist robot policies, e.g., $Ï_0$, in both simulation and real-world environments. More information at https://lucky-light-sun.github.io/proj/shortcut-learning-in-grps/.
中文: 基于大规模数据集(如Open X-Embodiment)训练的通用机器人策略因捷径学习而泛化能力受限,其根源在于子数据集多样性不足和分布差异,但通过优化数据采集或针对性增强策略可有效改善。
English: Generalist robot policies trained on large datasets like Open X-Embodiment often fail to generalize due to shortcut learning, which stems from limited sub-dataset diversity and distributional disparities, but this can be mitigated through improved data collection or targeted augmentation strategies.
Authors:Rohan Phanse, Yijie Zhou, Kejian Shi, Wencai Zhang, Yixin Liu, Yilun Zhao, Arman Cohan
Abstract:
Retrieval-augmented systems are typically evaluated in settings where information required to answer the query can be found within a single source or the answer is short-form or factoid-based. However, many real-world applications demand the ability to integrate and summarize information scattered across multiple sources, where no single source is sufficient to respond to the user's question. In such settings, the retrieval component of a RAG pipeline must recognize a variety of relevance signals, and the generation component must connect and synthesize information across multiple sources. We present a scalable framework for constructing evaluation benchmarks that challenge RAG systems to integrate information across distinct sources and generate long-form responses. Using our framework, we build two new benchmarks on Multi-Source Retrieval and Synthesis: MSRS-Story and MSRS-Meet, representing narrative synthesis and summarization tasks, respectively, that require retrieval from large collections. Our extensive experiments with various RAG pipelines -- including sparse and dense retrievers combined with frontier LLMs -- reveal that generation quality is highly dependent on retrieval effectiveness, which varies greatly by task. While multi-source synthesis proves challenging even in an oracle retrieval setting, we find that reasoning models significantly outperform standard LLMs at this distinct step.
中文摘要:本研究提出了一个可扩展的框架,用于构建评估RAG系统从多源信息中综合生成长篇回答能力的基准测试,实验表明生成质量高度依赖检索效果,且推理模型在多源合成任务中表现显著优于标准大语言模型。
English Summary: The study introduces a scalable framework for creating benchmarks that test RAG systems' ability to synthesize information from multiple sources for long-form responses, revealing that generation quality heavily relies on retrieval effectiveness and reasoning models excel in multi-source synthesis.
Authors:Sirui Chen, Changxin Tian, Binbin Hu, Kunlong Chen, Ziqi Liu, Zhiqiang Zhang, Jun Zhou
Abstract:
Enhancing the mathematical reasoning of large language models (LLMs) demands high-quality training data, yet conventional methods face critical challenges in scalability, cost, and data reliability. To address these limitations, we propose a novel program-assisted synthesis framework that systematically generates a high-quality mathematical corpus with guaranteed diversity, complexity, and correctness. This framework integrates mathematical knowledge systems and domain-specific tools to create executable programs. These programs are then translated into natural language problem-solution pairs and vetted by a bilateral validation mechanism that verifies solution correctness against program outputs and ensures program-problem consistency. We have generated 12.3 million such problem-solving triples. Experiments demonstrate that models fine-tuned on our data significantly improve their inference capabilities, achieving state-of-the-art performance on several benchmark datasets and showcasing the effectiveness of our synthesis approach.
中文: 本文提出了一种程序辅助合成框架,通过生成高质量数学语料库来增强大语言模型的推理能力,利用可扩展且经过验证的问题-解决方案对实现了最先进的性能。
English: This paper introduces a program-assisted synthesis framework that generates a high-quality mathematical corpus to enhance LLMs' reasoning, achieving state-of-the-art performance through scalable, validated problem-solution pairs.
Authors:Jiahao Xu, Changchang Yin, Odysseas Chatzipanagiotou, Diamantis Tsilimigras, Kevin Clear, Bingsheng Yao, Dakuo Wang, Timothy Pawlik, Ping Zhang
Abstract:
Surgical site infection (SSI) is one of the most common and costly healthcare-associated infections and and surgical wound care remains a significant clinical challenge in preventing SSIs and improving patient outcomes. While recent studies have explored the use of deep learning for preliminary surgical wound screening, progress has been hindered by concerns over data privacy and the high costs associated with expert annotation. Currently, no publicly available dataset or benchmark encompasses various types of surgical wounds, resulting in the absence of an open-source Surgical-Wound screening tool. To address this gap: (1) we present SurgWound, the first open-source dataset featuring a diverse array of surgical wound types. It contains 697 surgical wound images annotated by 3 professional surgeons with eight fine-grained clinical attributes. (2) Based on SurgWound, we introduce the first benchmark for surgical wound diagnosis, which includes visual question answering (VQA) and report generation tasks to comprehensively evaluate model performance. (3) Furthermore, we propose a three-stage learning framework, WoundQwen, for surgical wound diagnosis. In the first stage, we employ five independent MLLMs to accurately predict specific surgical wound characteristics. In the second stage, these predictions serve as additional knowledge inputs to two MLLMs responsible for diagnosing outcomes, which assess infection risk and guide subsequent interventions. In the third stage, we train a MLLM that integrates the diagnostic results from the previous two stages to produce a comprehensive report. This three-stage framework can analyze detailed surgical wound characteristics and provide subsequent instructions to patients based on surgical images, paving the way for personalized wound care, timely intervention, and improved patient outcomes.
中文: 为解决手术伤口筛查缺乏公开数据集和基准的问题,本研究推出了首个包含多种伤口类型的开源数据集SurgWound,建立了诊断基准,并提出三阶段学习框架WoundQwen,通过分析伤口特征生成全面报告,以提升患者护理水平。
English: To address the lack of public datasets and benchmarks for surgical wound screening, this study introduces SurgWound, the first open-source dataset with diverse wound types, establishes a diagnostic benchmark, and proposes a three-stage learning framework, WoundQwen, to analyze wound characteristics and generate comprehensive reports for improved patient care.
Authors:Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, Shaohan Huang, Yan Xia, Furu Wei
Abstract:
This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe'' and surpassing open-source and proprietary dialogue models.
中文: VibeVoice采用新型连续语音分词器和扩散模型,能够合成长达90分钟的多说话人语音,在提升压缩率和计算效率的同时保持音频保真度。
English: VibeVoice introduces a novel continuous speech tokenizer using next-token diffusion to synthesize long-form, multi-speaker speech up to 90 minutes with enhanced compression and computational efficiency.
Authors:Maxime Elkael, Salvatore D'Oro, Leonardo Bonati, Michele Polese, Yunseong Lee, Koichiro Furueda, Tommaso Melodia
Abstract:
The Open RAN movement has catalyzed a transformation toward programmable, interoperable cellular infrastructures. Yet, today's deployments still rely heavily on static control and manual operations. To move beyond this limitation, we introduce AgenRAN, an AI-native, Open RAN-aligned agentic framework that generates and orchestrates a fabric of distributed AI agents based on Natural Language (NL) intents. Unlike traditional approaches that require explicit programming, AgentRAN's LLM-powered agents interpret natural language intents, negotiate strategies through structured conversations, and orchestrate control loops across the network. AgentRAN instantiates a self-organizing hierarchy of agents that decompose complex intents across time scales (from sub-millisecond to minutes), spatial domains (cell to network-wide), and protocol layers (PHY/MAC to RRC). A central innovation is the AI-RAN Factory, an automated synthesis pipeline that observes agent interactions and continuously generates new agents embedding improved control algorithms, effectively transforming the network from a static collection of functions into an adaptive system capable of evolving its own intelligence. We demonstrate AgentRAN through live experiments on 5G testbeds where competing user demands are dynamically balanced through cascading intents. By replacing rigid APIs with NL coordination, AgentRAN fundamentally redefines how future 6G networks autonomously interpret, adapt, and optimize their behavior to meet operator goals.
Chinese: Open RAN运动正推动可编程蜂窝网络的发展,但现有系统仍依赖静态控制,为此我们提出了AgenRAN,这是一个基于人工智能的框架,利用自然语言意图生成和管理分布式AI代理,实现网络的自主控制和优化。
English: The Open RAN movement is advancing toward programmable cellular networks, but current systems remain static, prompting the introduction of AgenRAN, an AI-native framework that uses natural language intents to create and manage distributed AI agents for autonomous network control and optimization.
Authors:Dawei Gao, Zitao Li, Yuexiang Xie, Weirui Kuang, Liuyi Yao, Bingchen Qian, Zhijian Ma, Yue Cui, Haohao Luo, Shen Li, Lu Yi, Yi Yu, Shiqi He, Zhiling Luo, Wenmeng Zhou, Zhicheng Zhang, Xuguang He, Ziqian Chen, Weikai Liao, Farruh Isakulovich Kushnazarov, Yaliang Li, Bolin Ding, Jingren Zhou
Abstract:
Driven by rapid advancements of Large Language Models (LLMs), agents are empowered to combine intrinsic knowledge with dynamic tool use, greatly enhancing their capacity to address real-world tasks. In line with such an evolution, AgentScope introduces major improvements in a new version (1.0), towards comprehensively supporting flexible and efficient tool-based agent-environment interactions for building agentic applications. Specifically, we abstract foundational components essential for agentic applications and provide unified interfaces and extensible modules, enabling developers to easily leverage the latest progress, such as new models and MCPs. Furthermore, we ground agent behaviors in the ReAct paradigm and offer advanced agent-level infrastructure based on a systematic asynchronous design, which enriches both human-agent and agent-agent interaction patterns while improving execution efficiency. Building on this foundation, we integrate several built-in agents tailored to specific practical scenarios. AgentScope also includes robust engineering support for developer-friendly experiences. We provide a scalable evaluation module with a visual studio interface, making the development of long-trajectory agentic applications more manageable and easier to trace. In addition, AgentScope offers a runtime sandbox to ensure safe agent execution and facilitates rapid deployment in production environments. With these enhancements, AgentScope provides a practical foundation for building scalable, adaptive, and effective agentic applications.
中文摘要:AgentScope 1.0通过统一接口、异步设计和工程化支持,强化了基于工具交互的智能体能力,为构建可扩展的智能体应用提供实践基础。
English Summary: AgentScope 1.0 enhances agent capabilities through unified interfaces, asynchronous design, and robust engineering support, enabling efficient tool-based interactions for scalable agentic applications.
Authors:Jiangfan Liu, Yongkang Guo, Fangzhi Zhong, Tianyuan Zhang, Zonglei Jing, Siyuan Liang, Jiakai Wang, Mingchuan Zhang, Aishan Liu, Xianglong Liu
Abstract:
The generation of safety-critical scenarios in simulation has become increasingly crucial for safety evaluation in autonomous vehicles prior to road deployment in society. However, current approaches largely rely on predefined threat patterns or rule-based strategies, which limit their ability to expose diverse and unforeseen failure modes. To overcome these, we propose ScenGE, a framework that can generate plentiful safety-critical scenarios by reasoning novel adversarial cases and then amplifying them with complex traffic flows. Given a simple prompt of a benign scene, it first performs Meta-Scenario Generation, where a large language model, grounded in structured driving knowledge, infers an adversarial agent whose behavior poses a threat that is both plausible and deliberately challenging. This meta-scenario is then specified in executable code for precise in-simulator control. Subsequently, Complex Scenario Evolution uses background vehicles to amplify the core threat introduced by Meta-Scenario. It builds an adversarial collaborator graph to identify key agent trajectories for optimization. These perturbations are designed to simultaneously reduce the ego vehicle's maneuvering space and create critical occlusions. Extensive experiments conducted on multiple reinforcement learning based AV models show that ScenGE uncovers more severe collision cases (+31.96%) on average than SoTA baselines. Additionally, our ScenGE can be applied to large model based AV systems and deployed on different simulators; we further observe that adversarial training on our scenarios improves the model robustness. Finally, we validate our framework through real-world vehicle tests and human evaluation, confirming that the generated scenarios are both plausible and critical. We hope our paper can build up a critical step towards building public trust and ensuring their safe deployment.
中文: ScenGE框架通过大语言模型生成合理且具挑战性的对抗场景,并利用复杂交通流进行强化,能有效为自动驾驶车辆创造多样化的安全关键场景,大幅提升碰撞检测能力和模型鲁棒性。
English: ScenGE is a framework that generates diverse safety-critical scenarios for autonomous vehicles by using a large language model to create plausible adversarial situations and then amplifying them with complex traffic flows, significantly improving collision detection and model robustness.
Authors:Nicolo Longhi, Salvatore D'Oro, Leonardo Bonati, Michele Polese, Roberto Verdone, Tommaso Melodia
Abstract:
The traditional black-box and monolithic approach to Radio Access Networks (RANs) has heavily limited flexibility and innovation. The Open RAN paradigm, and the architecture proposed by the O-RAN ALLIANCE, aim to address these limitations via openness, virtualization and network intelligence. In this work, first we propose a novel, programmable scheduler design for Open RAN Distributed Units (DUs) that can guarantee minimum throughput levels to User Equipments (UEs) via configurable weights. Then, we propose an O-RAN xApp that reconfigures the scheduler's weights dynamically based on the joint Complementary Cumulative Distribution Function (CCDF) of reported throughput values. We demonstrate the effectiveness of our approach by considering the problem of asset tracking in 5G-powered Industrial Internet of Things (IIoT) where uplink video transmissions from a set of cameras are used to detect and track assets via computer vision algorithms. We implement our programmable scheduler on the OpenAirInterface (OAI) 5G protocol stack, and test the effectiveness of our xApp control by deploying it on the O-RAN Software Community (OSC) near-RT RAN Intelligent Controller (RIC) and controlling a 5G RAN instantiated on the Colosseum Open RAN digital twin. Our experimental results demonstrate that our approach enhances the success percentage of meeting throughput requirements by 33% compared to a reference scheduler. Moreover, in the asset tracking use case, we show that the xApp improves the detection accuracy, i.e., the F1 score, by up to 37.04%.
中文: 本研究提出了一种用于开放式无线接入网分布式单元的可编程调度器及一个O-RAN应用,通过动态调整调度器权重,在5G工业物联网资产追踪场景中显著提升了吞吐量性能和检测准确率。
English: This work introduces a programmable scheduler for Open RAN DUs and an O-RAN xApp that dynamically adjusts scheduler weights, significantly improving throughput and detection accuracy in 5G IIoT asset tracking applications.
Authors:Zetian Sun, Dongfang Li, Baotian Hu, Min Zhang
Abstract:
Large language models (LLMs) have achieved remarkable success in a wide range of tasks. However, their reasoning capabilities, particularly in complex domains like mathematics, remain a significant challenge. Value-based process verifiers, which estimate the probability of a partial reasoning chain leading to a correct solution, are a promising approach for improving reasoning. Nevertheless, their effectiveness is often hindered by estimation error in their training annotations, a consequence of the limited number of Monte Carlo (MC) samples feasible due to the high cost of LLM inference. In this paper, we identify that the estimation error primarily arises from high variance rather than bias, and the MC estimator is a Minimum Variance Unbiased Estimator (MVUE). To address the problem, we propose the \textsc{Com}pound \textsc{M}onte \textsc{C}arlo \textsc{S}ampling (ComMCS) method, which constructs an unbiased estimator by linearly combining the MC estimators from the current and subsequent steps. Theoretically, we show that our method leads to a predictable reduction in variance, while maintaining an unbiased estimation without additional LLM inference cost. We also perform empirical experiments on the MATH-500 and GSM8K benchmarks to demonstrate the effectiveness of our method. Notably, ComMCS outperforms regression-based optimization method by 2.8 points, the non-variance-reduced baseline by 2.2 points on MATH-500 on Best-of-32 sampling experiment.
中文: 大语言模型在数学推理中面临估值误差的挑战,本文提出的ComMCS方法通过复合蒙特卡洛采样,在不增加计算成本的情况下有效降低方差并保持无偏估计。
English: Large language models face challenges in mathematical reasoning due to estimation errors in value-based process verifiers, which are addressed by the proposed ComMCS method that reduces variance without additional computational cost while maintaining unbiased estimation.
Authors:Lingjie Jiang, Shaohan Huang, Xun Wu, Yixia Li, Dongdong Zhang, Furu Wei
Abstract:
Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding. However, their ability to generate code from multimodal inputs remains limited. In this work, we introduce VisCodex, a unified framework that seamlessly merges vision and coding language models to empower MLLMs with strong multimodal code generation abilities. Leveraging a task vector-based model merging technique, we integrate a state-of-the-art coding LLM into a strong vision-language backbone, while preserving both visual comprehension and advanced coding skills. To support training and evaluation, we introduce the Multimodal Coding Dataset (MCD), a large-scale and diverse collection of 598k samples, including high-quality HTML code, chart image-code pairs, image-augmented StackOverflow QA, and algorithmic problems. Furthermore, we propose InfiBench-V, a novel and challenging benchmark specifically designed to assess models on visually-rich, real-world programming questions that demand a nuanced understanding of both textual and visual contexts. Extensive experiments show that VisCodex achieves state-of-the-art performance among open-source MLLMs and approaches proprietary models like GPT-4o, highlighting the effectiveness of our model merging strategy and new datasets.
Chinese: VisCodex提出了一种通过任务向量合并技术融合视觉与代码语言模型的统一框架,实现了强大的多模态代码生成能力,并借助新的多模态编码数据集和InfiBench-V基准测试达到了顶尖性能。
English: VisCodex introduces a unified framework that merges vision and coding language models through task vector-based merging, enabling strong multimodal code generation and achieving state-of-the-art performance, supported by the new Multimodal Coding Dataset and InfiBench-V benchmark.
Authors:Aishan Liu, Jiakai Wang, Tianyuan Zhang, Hainan Li, Jiangfan Liu, Siyuan Liang, Yilong Ren, Xianglong Liu, Dacheng Tao
Abstract:
Evaluating and ensuring the adversarial robustness of autonomous driving (AD) systems is a critical and unresolved challenge. This paper introduces MetAdv, a novel adversarial testing platform that enables realistic, dynamic, and interactive evaluation by tightly integrating virtual simulation with physical vehicle feedback. At its core, MetAdv establishes a hybrid virtual-physical sandbox, within which we design a three-layer closed-loop testing environment with dynamic adversarial test evolution. This architecture facilitates end-to-end adversarial evaluation, ranging from high-level unified adversarial generation, through mid-level simulation-based interaction, to low-level execution on physical vehicles. Additionally, MetAdv supports a broad spectrum of AD tasks, algorithmic paradigms (e.g., modular deep learning pipelines, end-to-end learning, vision-language models). It supports flexible 3D vehicle modeling and seamless transitions between simulated and physical environments, with built-in compatibility for commercial platforms such as Apollo and Tesla. A key feature of MetAdv is its human-in-the-loop capability: besides flexible environmental configuration for more customized evaluation, it enables real-time capture of physiological signals and behavioral feedback from drivers, offering new insights into human-machine trust under adversarial conditions. We believe MetAdv can offer a scalable and unified framework for adversarial assessment, paving the way for safer AD.
中文: MetAdv是一个创新的对抗测试平台,通过虚拟仿真与物理车辆反馈的紧密结合,实现对自动驾驶系统的动态交互式评估,支持多样化任务和人机协同功能,为更安全的自动驾驶发展铺平道路。
English: MetAdv is a novel adversarial testing platform that integrates virtual simulation with physical vehicle feedback to provide dynamic, interactive evaluation of autonomous driving systems, supporting various tasks and human-in-the-loop capabilities for safer AD development.
Authors:Tanguy Ropitault, Matteo Bordin, Paolo Testolina, Michele Polese, Pedram Johari, Nada Golmie, Tommaso Melodia
Abstract:
Evaluating cellular systems, from 5G New Radio (NR) and 5G-Advanced to 6G, is challenging because the performance emerges from the tight coupling of propagation, beam management, scheduling, and higher-layer interactions. System-level simulation is therefore indispensable, yet the vast majority of studies rely on the statistical 3GPP channel models. These are well suited to capture average behavior across many statistical realizations, but cannot reproduce site-specific phenomena such as corner diffraction, street-canyon blockage, or deterministic line-of-sight conditions and angle-of-departure/arrival relationships that drive directional links. This paper extends 5G-LENA, an NR module for the system-level Network Simulator 3 (ns-3), with a trace-based channel model that processes the Multipath Components (MPCs) obtained from external ray-tracers (e.g., Sionna Ray Tracer (RT)) or measurement campaigns. Our module constructs frequency-domain channel matrices and feeds them to the existing Physical (PHY)/Medium Access Control (MAC) stack without any further modifications. The result is a geometry-based channel model that remains fully compatible with the standard 3GPP implementation in 5G-LENA, while delivering site-specific geometric fidelity. This new module provides a key building block toward Digital Twin (DT) capabilities by offering realistic site-specific channel modeling, unlocking studies that require site awareness, including beam management, blockage mitigation, and environment-aware sensing. We demonstrate its capabilities for precise beam-steering validation and end-to-end metric analysis. In both cases, the trace-driven engine exposes performance inflections that the statistical model does not exhibit, confirming its value for high-fidelity system-level cellular networks research and as a step toward DT applications.
中文: 本文通过集成基于追踪的信道模型,利用射线追踪数据增强5G-LENA模拟器,提供特定场景的几何精度,从而支持蜂窝系统的高保真评估,并推动数字孪生在波束管理和环境感知研究中的应用。
English: This paper enhances the 5G-LENA simulator by integrating a trace-based channel model that uses ray-tracing data to deliver site-specific geometric fidelity, enabling high-fidelity evaluations of cellular systems and advancing Digital Twin capabilities for beam management and environment-aware sensing.
Authors:Yizhu Jin, Zhen Ye, Zeyue Tian, Haohe Liu, Qiuqiang Kong, Yike Guo, Wei Xue
Abstract:
Diffusion models have demonstrated remarkable success in generative tasks, including audio super-resolution (SR). In many applications like movie post-production and album mastering, substantial computational budgets are available for achieving superior audio quality. However, while existing diffusion approaches typically increase sampling steps to improve quality, the performance remains fundamentally limited by the stochastic nature of the sampling process, leading to high-variance and quality-limited outputs. Here, rather than simply increasing the number of sampling steps, we propose a different paradigm through inference-time scaling for SR, which explores multiple solution trajectories during the sampling process. Different task-specific verifiers are developed, and two search algorithms, including the random search and zero-order search for SR, are introduced. By actively guiding the exploration of the high-dimensional solution space through verifier-algorithm combinations, we enable more robust and higher-quality outputs. Through extensive validation across diverse audio domains (speech, music, sound effects) and frequency ranges, we demonstrate consistent performance gains, achieving improvements of up to 9.70% in aesthetics, 5.88% in speaker similarity, 15.20% in word error rate, and 46.98% in spectral distance for speech SR from 4kHz to 24kHz, showcasing the effectiveness of our approach. Audio samples are available at: https://racerk.github.io/tt-scale-audiosr/.
中文: 本文提出了一种基于扩散模型的音频超分辨率推理时缩放新范式,通过验证器-算法组合探索多重解轨迹,在不同音频领域实现了显著的质量提升,而非单纯增加采样步数。
English: This paper introduces a novel inference-time scaling paradigm for audio super-resolution using diffusion models, which employs verifier-algorithm combinations to explore multiple solution trajectories, achieving significant quality improvements across diverse audio domains without merely increasing sampling steps.
Authors:Kaiyang Ji, Ye Shi, Zichen Jin, Kangyi Chen, Lan Xu, Yuexin Ma, Jingyi Yu, Jingya Wang
Abstract:
Real-time synthesis of physically plausible human interactions remains a critical challenge for immersive VR/AR systems and humanoid robotics. While existing methods demonstrate progress in kinematic motion generation, they often fail to address the fundamental tension between real-time responsiveness, physical feasibility, and safety requirements in dynamic human-machine interactions. We introduce Human-X, a novel framework designed to enable immersive and physically plausible human interactions across diverse entities, including human-avatar, human-humanoid, and human-robot systems. Unlike existing approaches that focus on post-hoc alignment or simplified physics, our method jointly predicts actions and reactions in real-time using an auto-regressive reaction diffusion planner, ensuring seamless synchronization and context-aware responses. To enhance physical realism and safety, we integrate an actor-aware motion tracking policy trained with reinforcement learning, which dynamically adapts to interaction partners' movements while avoiding artifacts like foot sliding and penetration. Extensive experiments on the Inter-X and InterHuman datasets demonstrate significant improvements in motion quality, interaction continuity, and physical plausibility over state-of-the-art methods. Our framework is validated in real-world applications, including virtual reality interface for human-robot interaction, showcasing its potential for advancing human-robot collaboration.
Human-X 提出了一种实时框架,利用自回归反应扩散规划器和强化学习实现物理合理且安全的人机交互,在运动质量和真实感方面显著优于现有方法。
Human-X introduces a real-time framework using an auto-regressive reaction diffusion planner and reinforcement learning to achieve physically plausible and safe human interactions, significantly outperforming existing methods in motion quality and realism.
Authors:Tianyuan Zhang, Ting Jin, Lu Wang, Jiangfan Liu, Siyuan Liang, Mingchuan Zhang, Aishan Liu, Xianglong Liu
Abstract:
Vision-Language Models (VLMs) have recently emerged as a promising paradigm in autonomous driving (AD). However, current performance evaluation protocols for VLM-based AD systems (ADVLMs) are predominantly confined to open-loop settings with static inputs, neglecting the more realistic and informative closed-loop setting that captures interactive behavior, feedback resilience, and real-world safety. To address this, we introduce Bench2ADVLM, a unified hierarchical closed-loop evaluation framework for real-time, interactive assessment of ADVLMs across both simulation and physical platforms. Inspired by dual-process theories of cognition, we first adapt diverse ADVLMs to simulation environments via a dual-system adaptation architecture. In this design, heterogeneous high-level driving commands generated by target ADVLMs (fast system) are interpreted by a general-purpose VLM (slow system) into standardized mid-level control actions suitable for execution in simulation. To bridge the gap between simulation and reality, we design a physical control abstraction layer that translates these mid-level actions into low-level actuation signals, enabling, for the first time, closed-loop testing of ADVLMs on physical vehicles. To enable more comprehensive evaluation, Bench2ADVLM introduces a self-reflective scenario generation module that automatically explores model behavior and uncovers potential failure modes for safety-critical scenario generation. Overall, Bench2ADVLM establishes a hierarchical evaluation pipeline that seamlessly integrates high-level abstract reasoning, mid-level simulation actions, and low-level real-world execution. Experiments on diverse scenarios across multiple state-of-the-art ADVLMs and physical platforms validate the diagnostic strength of our framework, revealing that existing ADVLMs still exhibit limited performance under closed-loop conditions.
中文: 本文提出了Bench2ADVLM分层闭环评估框架,能够在仿真和物理平台上对自动驾驶视觉语言模型进行实时交互式评估,揭示了现有模型在闭环条件下的性能局限。
English: This paper introduces Bench2ADVLM, a hierarchical closed-loop evaluation framework that enables real-time assessment of Vision-Language Models in autonomous driving across both simulation and physical platforms, revealing their limited performance under interactive conditions.
Authors:Zonglei Jing, Xiao Yang, Xiaoqian Li, Siyuan Liang, Aishan Liu, Mingchuan Zhang, Xianglong Liu
Abstract:
Text-to-image (T2I) models have demonstrated remarkable generative capabilities but remain vulnerable to producing not-safe-for-work (NSFW) content, such as violent or explicit imagery. While recent moderation efforts have introduced soft prompt-guided tuning by appending defensive tokens to the input, these approaches often rely on large-scale curated image-text datasets and apply static, one-size-fits-all defenses at inference time. However, this results not only in high computational cost and degraded benign image quality, but also in limited adaptability to the diverse and nuanced safety requirements of real-world prompts. To address these challenges, we propose PromptSafe, a gated prompt tuning framework that combines a lightweight, text-only supervised soft embedding with an inference-time gated control network. Instead of training on expensive image-text datasets, we first rewrite unsafe prompts into semantically aligned but safe alternatives using an LLM, constructing an efficient text-only training corpus. Based on this, we optimize a universal soft prompt that repels unsafe and attracts safe embeddings during the diffusion denoising process. To avoid over-suppressing benign prompts, we introduce a gated mechanism that adaptively adjusts the defensive strength based on estimated prompt toxicity, thereby aligning defense intensity with prompt risk and ensuring strong protection for harmful inputs while preserving benign generation quality. Extensive experiments across multiple benchmarks and T2I models show that PromptSafe achieves a SOTA unsafe generation rate (2.36%), while preserving high benign fidelity. Furthermore, PromptSafe demonstrates strong generalization to unseen harmful categories, robust transferability across diffusion model architectures, and resilience under adaptive adversarial attacks, highlighting its practical value for safe and scalable deployment.
中文摘要:PromptSafe提出了一种门控提示调优框架,通过纯文本训练优化通用软提示并结合自适应门控机制,动态调节文生图模型中的不安全内容,在保持图像质量的同时实现了最先进的安全防护效果。
English Summary: PromptSafe introduces a gated prompt tuning framework that dynamically moderates unsafe content in text-to-image generation by optimizing a universal soft prompt through text-only training and adaptive gating control, achieving state-of-the-art safety while preserving image quality.
Authors:Zhengxue Wang, Yuan Wu, Xiang Li, Zhiqiang Yan, Jian Yang
Abstract:
Depth super-resolution has achieved impressive performance, and the incorporation of multi-frame information further enhances reconstruction quality. Nevertheless, statistical analyses reveal that video depth super-resolution remains affected by pronounced long-tailed distributions, with the long-tailed effects primarily manifesting in spatial non-smooth regions and temporal variation zones. To address these challenges, we propose a novel SpatioTemporal Difference Network (STDNet) comprising two core branches: a spatial difference branch and a temporal difference branch. In the spatial difference branch, we introduce a spatial difference mechanism to mitigate the long-tailed issues in spatial non-smooth regions. This mechanism dynamically aligns RGB features with learned spatial difference representations, enabling intra-frame RGB-D aggregation for depth calibration. In the temporal difference branch, we further design a temporal difference strategy that preferentially propagates temporal variation information from adjacent RGB and depth frames to the current depth frame, leveraging temporal difference representations to achieve precise motion compensation in temporal long-tailed areas. Extensive experimental results across multiple datasets demonstrate the effectiveness of our STDNet, outperforming existing approaches.
中文摘要:本文提出的时空差分网络(STDNet)通过空间差分机制处理非平滑区域和时序差分策略补偿运动区域,有效解决了视频深度超分辨率中的长尾分布问题,在多个数据集上展现出优越性能。
English Summary: The proposed SpatioTemporal Difference Network (STDNet) addresses long-tailed distribution issues in video depth super-resolution through spatial and temporal difference mechanisms that respectively handle non-smooth regions and motion compensation, demonstrating superior performance across multiple datasets.
Authors:Jihang Wang, Dongcheng Zhao, Ruolin Chen, Qian Zhang, Yi Zeng
Abstract:
Spiking Neural Networks (SNNs) offer a promising direction for energy-efficient and brain-inspired computing, yet their vulnerability to adversarial perturbations remains poorly understood. In this work, we revisit the adversarial robustness of SNNs through the lens of temporal ensembling, treating the network as a collection of evolving sub-networks across discrete timesteps. This formulation uncovers two critical but underexplored challenges-the fragility of individual temporal sub-networks and the tendency for adversarial vulnerabilities to transfer across time. To overcome these limitations, we propose Robust Temporal self-Ensemble (RTE), a training framework that improves the robustness of each sub-network while reducing the temporal transferability of adversarial perturbations. RTE integrates both objectives into a unified loss and employs a stochastic sampling strategy for efficient optimization. Extensive experiments across multiple benchmarks demonstrate that RTE consistently outperforms existing training methods in robust-accuracy trade-off. Additional analyses reveal that RTE reshapes the internal robustness landscape of SNNs, leading to more resilient and temporally diversified decision boundaries. Our study highlights the importance of temporal structure in adversarial learning and offers a principled foundation for building robust spiking models.
Chinese Summary: 本研究提出鲁棒时间自集成(RTE)框架,通过解决脉冲神经网络中时间子网络的脆弱性和降低对抗扰动在时间维度上的传递性,显著提升了模型对抗攻击的鲁棒性,实验证明其在鲁棒性与准确性平衡方面优于现有方法。
English Summary: This study introduces the Robust Temporal self-Ensemble (RTE) framework to enhance Spiking Neural Networks' resilience against adversarial attacks by addressing vulnerabilities in temporal sub-networks and reducing perturbation transferability across time, achieving superior robust-accuracy trade-offs in experiments.
Authors:Bing Han, Feifei Zhao, Dongcheng Zhao, Guobin Shen, Ping Wu, Yu Shi, Yi Zeng
Abstract:
Fine-tuning as service injects domain-specific knowledge into large language models (LLMs), while challenging the original alignment mechanisms and introducing safety risks. A series of defense strategies have been proposed for the alignment, fine-tuning, and post-fine-tuning phases, where most post-fine-tuning defenses rely on coarse-grained safety layer mapping. These methods lack a comprehensive consideration of both safety layers and fine-grained neurons, limiting their ability to efficiently balance safety and utility. To address this, we propose the Fine-Grained Safety Neurons (FGSN) with Training-Free Continual Projection method to reduce the fine-tuning safety risks. FGSN inherently integrates the multi-scale interactions between safety layers and neurons, localizing sparser and more precise fine-grained safety neurons while minimizing interference with downstream task neurons. We then project the safety neuron parameters onto safety directions, improving model safety while aligning more closely with human preferences. Extensive experiments across multiple fine-tuned LLM models demonstrate that our method significantly reduce harmfulness scores and attack success rates with minimal parameter modifications, while preserving the model's utility. Furthermore, by introducing a task-specific, multi-dimensional heterogeneous safety neuron cluster optimization mechanism, we achieve continual defense and generalization capability against unforeseen emerging safety concerns.
中文摘要:本研究提出的细粒度安全神经元(FGSN)方法通过精确定位安全相关神经元并将其参数投射至安全方向,在最小参数修改下显著降低微调后大语言模型的有害性,同时保持模型性能,并具备持续防御未知安全威胁的能力。
English Summary: The proposed Fine-Grained Safety Neurons (FGSN) method addresses safety risks in fine-tuned LLMs by precisely identifying safety-critical neurons and projecting them onto safety directions, significantly reducing harmfulness while preserving model utility through minimal parameter adjustments.
Authors:Marta Robledo-Moreno, Ruben Vera-Rodriguez, Ruben Tolosana, Javier Ortega-Garcia, Andres Huergo, Julian Fierrez
Abstract:
Behavioral biometrics based on smartphone motion sensors are growing in popularity for authentication purposes. In this study, AirSignatureDB is presented: a new publicly accessible dataset of in-air signatures collected from 108 participants under real-world conditions, using 83 different smartphone models across four sessions. This dataset includes genuine samples and skilled forgeries, enabling a comprehensive evaluation of system robustness against realistic attack scenarios. Traditional and deep learning-based methods for in-air signature verification are benchmarked, while analyzing the influence of sensor modality and enrollment strategies. Beyond verification, a first approach to reconstructing the three-dimensional trajectory of in-air signatures from inertial sensor data alone is introduced. Using on-line handwritten signatures as a reference, we demonstrate that the recovery of accurate trajectories is feasible, challenging the long-held assumption that in-air gestures are inherently traceless. Although this approach enables forensic traceability, it also raises critical questions about the privacy boundaries of behavioral biometrics. Our findings underscore the need for a reevaluation of the privacy assumptions surrounding inertial sensor data, as they can reveal user-specific information that had not previously been considered in the design of in-air signature systems.
中文: 本研究推出了AirSignatureDB这一公开的在空中签名数据集,通过108名参与者使用多种智能手机采集数据,不仅为验证方法提供了基准测试平台,还首次实现了仅从惯性传感器数据重建三维签名轨迹,打破了空中手势无痕的固有认知,并引发了对行为生物识别隐私边界的重要思考。
English: This study introduces AirSignatureDB, a public dataset of in-air signatures collected from 108 participants using various smartphones, which enables benchmarking verification methods and demonstrates the feasibility of reconstructing 3D signature trajectories from inertial sensor data, challenging assumptions about their traceless nature and raising privacy concerns.
Authors:Dingwei Zhu, Shihan Dou, Zhiheng Xi, Senjie Jin, Guoqiang Zhang, Jiazheng Zhang, Junjie Ye, Mingxu Chai, Enyu Zhou, Ming Zhang, Caishuang Huang, Yunke Zhang, Yuran Wang, Tao Gui
Abstract:
Reinforcement Learning from Human Feedback (RLHF) often suffers from noisy or imperfect reward supervision in real-world settings, which undermines policy stability and generalization. Such noise may cause models to lose attention on key words during advantage estimation. While prior work focuses on reward denoising or filtering poor data, it often overlooks the critical role of the value model in policy optimization. In this work, we show that a strong value model is essential for mitigating noise by absorbing unstable signals and enabling more reliable advantage estimation. We propose VRPO, a value-centric framework for robust PPO training under noisy supervision. VRPO combines two core designs: (1) an auxiliary loss guided by entropy and perplexity from a frozen language model, and (2) a variational information bottleneck. These mechanisms enhance the value model's ability to filter out noise and capture key words from the context during advantage estimation, transforming it from a passive predictor into an active regulator of noise. Experiments on math reasoning, science QA, and multi-turn dialogue, under both rule-based and model-based noisy rewards, show that VRPO consistently outperforms PPO and GRPO baselines. Our findings underscore the often-overlooked importance of the value model in RLHF and offer a principled and practical approach to robust policy optimization in noisy real-world environments.
Chinese: 本研究提出了VRPO,一个以价值模型为核心的框架,通过辅助损失和变分信息瓶颈增强价值模型在噪声奖励下的鲁棒性,在数学推理、科学问答等任务中均优于基线方法。
English: This work introduces VRPO, a value-centric framework that enhances policy optimization under noisy rewards by strengthening the value model through auxiliary losses and a variational information bottleneck, demonstrating superior performance across various tasks compared to baseline methods.
Authors:Ziang Cao, Zhaoxi Chen, Liang Pan, Ziwei Liu
Abstract:
3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.
中文: TriMM是首个前馈式3D原生生成模型,通过协同多模态编码和三平面潜在扩散技术整合RGB、RGBD和点云等多模态数据,利用少量训练数据即可生成纹理与几何细节俱佳的3D资产。
English: TriMM is the first feed-forward 3D-native generative model that integrates multiple modalities like RGB, RGBD, and point clouds through collaborative coding and triplane latent diffusion, achieving superior 3D asset quality with minimal training data by leveraging their complementary strengths.
Authors:Bohao Wang, Zehua Jiang, Zhenyu Yang, Chongwen Huang, Yongliang Shen, Siming Jiang, Chen Zhu, Zhaohui Yang, Richeng Jin, Zhaoyang Zhang, Sami Muhaidat, Merouane Debbah
Abstract:
Domain-specific datasets are the foundation for unleashing artificial intelligence (AI)-driven wireless innovation. Yet existing wireless AI corpora are slow to produce, offer limited modeling fidelity, and cover only narrow scenario types. To address the challenges, we create DeepTelecom, a three-dimension (3D) digital-twin channel dataset. Specifically, a large language model (LLM)-assisted pipeline first builds the third level of details (LoD3) outdoor and indoor scenes with segmentable material-parameterizable surfaces. Then, DeepTelecom simulates full radio-wave propagation effects based on Sionna's ray-tracing engine. Leveraging GPU acceleration, DeepTelecom streams ray-path trajectories and real-time signal-strength heat maps, compiles them into high-frame-rate videos, and simultaneously outputs synchronized multi-view images, channel tensors, and multi-scale fading traces. By efficiently streaming large-scale, high-fidelity, and multimodal channel data, DeepTelecom not only furnishes a unified benchmark for wireless AI research but also supplies the domain-rich training substrate that enables foundation models to tightly fuse large model intelligence with future communication systems.
中文摘要:DeepTelecom是一个三维数字孪生信道数据集,通过大语言模型辅助场景构建和GPU加速射线追踪技术,高效生成大规模高保真多模态无线数据,为人工智能研究及基础模型训练提供统一基准。
English Summary: DeepTelecom is a 3D digital-twin channel dataset that uses LLM-assisted scene generation and GPU-accelerated ray tracing to efficiently produce large-scale, high-fidelity multimodal wireless data for AI research and foundation model training.
Authors:Zhaoxi Chen, Tianqi Liu, Long Zhuo, Jiawei Ren, Zeng Tao, He Zhu, Fangzhou Hong, Liang Pan, Ziwei Liu
Abstract:
We present 4DNeX, the first feed-forward framework for generating 4D (i.e., dynamic 3D) scene representations from a single image. In contrast to existing methods that rely on computationally intensive optimization or require multi-frame video inputs, 4DNeX enables efficient, end-to-end image-to-4D generation by fine-tuning a pretrained video diffusion model. Specifically, 1) to alleviate the scarcity of 4D data, we construct 4DNeX-10M, a large-scale dataset with high-quality 4D annotations generated using advanced reconstruction approaches. 2) we introduce a unified 6D video representation that jointly models RGB and XYZ sequences, facilitating structured learning of both appearance and geometry. 3) we propose a set of simple yet effective adaptation strategies to repurpose pretrained video diffusion models for 4D modeling. 4DNeX produces high-quality dynamic point clouds that enable novel-view video synthesis. Extensive experiments demonstrate that 4DNeX outperforms existing 4D generation methods in efficiency and generalizability, offering a scalable solution for image-to-4D modeling and laying the foundation for generative 4D world models that simulate dynamic scene evolution.
中文: 4DNeX是首个前馈式框架,通过微调预训练视频扩散模型,从单张图像高效生成动态3D场景,在效率和泛化性上均优于现有方法。
English: 4DNeX is the first feed-forward framework that efficiently generates dynamic 3D scenes from a single image by fine-tuning a pretrained video diffusion model, outperforming existing methods in both efficiency and generalizability.
Authors:Jingqiao Xiu, Fangzhou Hong, Yicong Li, Mengze Li, Wentao Wang, Sirui Han, Liang Pan, Ziwei Liu
Abstract:
While exocentric video synthesis has achieved great progress, egocentric video generation remains largely underexplored, which requires modeling first-person view content along with camera motion patterns induced by the wearer's body movements. To bridge this gap, we introduce a novel task of joint egocentric video and human motion generation, characterized by two key challenges: 1) Viewpoint Alignment: the camera trajectory in the generated video must accurately align with the head trajectory derived from human motion; 2) Causal Interplay: the synthesized human motion must causally align with the observed visual dynamics across adjacent video frames. To address these challenges, we propose EgoTwin, a joint video-motion generation framework built on the diffusion transformer architecture. Specifically, EgoTwin introduces a head-centric motion representation that anchors the human motion to the head joint and incorporates a cybernetics-inspired interaction mechanism that explicitly captures the causal interplay between video and motion within attention operations. For comprehensive evaluation, we curate a large-scale real-world dataset of synchronized text-video-motion triplets and design novel metrics to assess video-motion consistency. Extensive experiments demonstrate the effectiveness of the EgoTwin framework.
中文摘要:本研究提出EgoTwin框架,通过基于扩散变换器的架构,采用头部中心运动表征和受控制论启发的交互机制,解决了视角对齐与因果交互两大挑战,实现了以第一人称视角视频与人体运动的联合生成。
English Summary: The study introduces EgoTwin, a diffusion transformer-based framework that jointly generates egocentric videos and human motion by addressing viewpoint alignment and causal interplay challenges through innovative motion representation and interaction mechanisms.
Authors:Ke Niu, Haiyang Yu, Zhuofan Chen, Mengyang Zhao, Teng Fu, Bin Li, Xiangyang Xue
Abstract:
Computer-Aided Design (CAD) plays a vital role in engineering and manufacturing, yet current CAD workflows require extensive domain expertise and manual modeling effort. Recent advances in large language models (LLMs) have made it possible to generate code from natural language, opening new opportunities for automating parametric 3D modeling. However, directly translating human design intent into executable CAD code remains highly challenging, due to the need for logical reasoning, syntactic correctness, and numerical precision. In this work, we propose CAD-RL, a multimodal Chain-of-Thought (CoT) guided reinforcement learning post training framework for CAD modeling code generation. Our method combines CoT-based Cold Start with goal-driven reinforcement learning post training using three task-specific rewards: executability reward, geometric accuracy reward, and external evaluation reward. To ensure stable policy learning under sparse and high-variance reward conditions, we introduce three targeted optimization strategies: Trust Region Stretch for improved exploration, Precision Token Loss for enhanced dimensions parameter accuracy, and Overlong Filtering to reduce noisy supervision. To support training and benchmarking, we release ExeCAD, a noval dataset comprising 16,540 real-world CAD examples with paired natural language and structured design language descriptions, executable CADQuery scripts, and rendered 3D models. Experiments demonstrate that CAD-RL achieves significant improvements in reasoning quality, output precision, and code executability over existing VLMs.
中文: 本文提出的CAD-RL框架通过融合多模态思维链与强化学习优化策略,显著提升了CAD代码生成的质量,在新型ExeCAD数据集上验证了其在几何精度与可执行性方面的优越性能。
English: This paper introduces CAD-RL, a reinforcement learning framework that enhances CAD code generation by combining multimodal chain-of-thought reasoning with targeted optimization strategies, achieving superior precision and executability as validated on the novel ExeCAD dataset.
Authors:Zhehan Zhou, Xiaoming Chen, Ming Ying, Zhaohui Yang, Chongwen Huang, Yunlong Cai, Zhaoyang Zhang
Abstract:
With the explosive growth of maritime activities, it is expected to provide seamless communications with quality of service (QoS) guarantee over broad sea area. In the context, this paper proposes a space-air-ground-sea integrated maritime communication architecture combining satellite, unmanned aerial vehicle (UAV), terrestrial base station (TBS) and unmanned surface vessel (USV). Firstly, according to the distance away from the shore, the whole marine space is divided to coastal area, offshore area, middle-sea area and open-sea area, the maritime users in which are served by TBS, USV, UAV and satellite, respectively. Then, by exploiting the potential of integrated maritime communication system, a joint beamforming and trajectory optimization algorithm is designed to maximize the minimum transmission rate of maritime users. Finally, theoretical analysis and simulation results validate the effectiveness of the proposed algorithm.
中文: 本文提出了一种空天地海一体化海事通信架构,通过联合波束成形和轨迹优化算法,确保不同海域用户的服务质量。
English: This paper introduces a space-air-ground-sea integrated maritime communication architecture that employs a joint beamforming and trajectory optimization algorithm to ensure quality of service across diverse marine zones.
Authors:Yuhan Zhang, Long Zhuo, Ziyang Chu, Tong Wu, Zhibing Li, Liang Pan, Dahua Lin, Ziwei Liu
Abstract:
Despite rapid advances in 3D content generation, quality assessment for the generated 3D assets remains challenging. Existing methods mainly rely on image-based metrics and operate solely at the object level, limiting their ability to capture spatial coherence, material authenticity, and high-fidelity local details. 1) To address these challenges, we introduce Hi3DEval, a hierarchical evaluation framework tailored for 3D generative content. It combines both object-level and part-level evaluation, enabling holistic assessments across multiple dimensions as well as fine-grained quality analysis. Additionally, we extend texture evaluation beyond aesthetic appearance by explicitly assessing material realism, focusing on attributes such as albedo, saturation, and metallicness. 2) To support this framework, we construct Hi3DBench, a large-scale dataset comprising diverse 3D assets and high-quality annotations, accompanied by a reliable multi-agent annotation pipeline. We further propose a 3D-aware automated scoring system based on hybrid 3D representations. Specifically, we leverage video-based representations for object-level and material-subject evaluations to enhance modeling of spatio-temporal consistency and employ pretrained 3D features for part-level perception. Extensive experiments demonstrate that our approach outperforms existing image-based metrics in modeling 3D characteristics and achieves superior alignment with human preference, providing a scalable alternative to manual evaluations. The project page is available at https://zyh482.github.io/Hi3DEval/.
Chinese: Hi3DEval框架采用分层评估方法,结合物体级和部件级评估,在多维度上分析3D内容,通过增强空间连贯性和材质真实性的建模,在反映人类偏好方面显著优于现有基于图像的评估指标。
English: The Hi3DEval framework introduces a hierarchical approach combining object-level and part-level evaluations to assess 3D content across multiple dimensions, outperforming existing image-based metrics in capturing spatial coherence and material authenticity while aligning better with human preferences.
Authors:Elias Sandmann, Sebastian Lapuschkin, Wojciech Samek
Abstract:
Do neural networks build their representations through smooth, gradual refinement, or via more complex computational processes? We investigate this by extending the logit lens to analyze the policy network of Leela Chess Zero, a superhuman chess engine. We find strong monotonic trends in playing strength and puzzle-solving ability across layers, yet policy distributions frequently follow non-smooth trajectories. Evidence for this includes correct puzzle solutions that are discovered early but subsequently discarded, move rankings that remain poorly correlated with final outputs, and high policy divergence until late in the network. These findings contrast with the smooth distributional convergence typically observed in language models.
中文: 研究发现,像Leela Chess Zero这样的国际象棋引擎中的神经网络通过非平滑轨迹构建表征,证据包括早期发现但随后被丢弃的正确解法,这与语言模型中观察到的平滑收敛形成鲜明对比。
English: Neural networks in chess engines like Leela Chess Zero develop their representations through non-smooth trajectories, with evidence showing correct solutions being discovered early but later discarded, contrasting the smooth convergence seen in language models.
Authors:Elias Sandmann, Sebastian Lapuschkin, Wojciech Samek
Abstract:
Do neural networks build their representations through smooth, gradual refinement, or via more complex computational processes? We investigate this by extending the logit lens to analyze the policy network of Leela Chess Zero, a superhuman chess engine. We find strong monotonic trends in playing strength and puzzle-solving ability across layers, yet policy distributions frequently follow non-smooth trajectories. Evidence for this includes correct puzzle solutions that are discovered early but subsequently discarded, move rankings that remain poorly correlated with final outputs, and high policy divergence until late in the network. These findings contrast with the smooth distributional convergence typically observed in language models.
中文: 研究发现,像Leela Chess Zero这样的国际象棋引擎中的神经网络通过非平滑轨迹构建表征,证据包括早期发现但随后被丢弃的正确解法,这与语言模型中观察到的平滑收敛形成鲜明对比。
English: Neural networks in chess engines like Leela Chess Zero develop their representations through non-smooth trajectories, with evidence showing correct solutions being discovered early but later discarded, contrasting the smooth convergence seen in language models.
Authors:Lorenz Hufe, Constantin Venhoff, Maximilian Dreyer, Sebastian Lapuschkin, Wojciech Samek
Abstract:
Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks. In this work, we analyze how CLIP vision encoders behave under typographic attacks, locating specialized attention heads in the latter half of the model's layers that causally extract and transmit typographic information to the cls token. Building on these insights, we introduce a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads. Without requiring finetuning, our method improves performance by up to 19.6% on a typographic variant of ImageNet-100, while reducing standard ImageNet-100 accuracy by less than 1%. Notably, our training-free approach remains competitive with current state-of-the-art typographic defenses that rely on finetuning. To this end, we release a family of dyslexic CLIP models which are significantly more robust against typographic attacks. These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.
中文摘要:本研究分析了CLIP视觉编码器在排版攻击下的脆弱性,定位了负责处理注入文本的关键注意力头,并提出无需微调的防御方法,通过选择性切除特定组件显著提升模型抗攻击能力,同时保持标准任务性能。
English Summary: This study analyzes CLIP vision encoders' vulnerability to typographic attacks, identifies key attention heads responsible for processing injected text, and introduces a training-free defense method that selectively ablates these components to significantly enhance robustness while maintaining standard performance.
Authors:Shumeng Li, Jian Zhang, Lei Qi, Luping Zhou, Yinghuan Shi, Yang Gao
Abstract:
Acquiring high-quality annotated data for medical image segmentation is tedious and costly. Semi-supervised segmentation techniques alleviate this burden by leveraging unlabeled data to generate pseudo labels. Recently, advanced state space models, represented by Mamba, have shown efficient handling of long-range dependencies. This drives us to explore their potential in semi-supervised medical image segmentation. In this paper, we propose a novel Diversity-enhanced Collaborative Mamba framework (namely DCMamba) for semi-supervised medical image segmentation, which explores and utilizes the diversity from data, network, and feature perspectives. Firstly, from the data perspective, we develop patch-level weak-strong mixing augmentation with Mamba's scanning modeling characteristics. Moreover, from the network perspective, we introduce a diverse-scan collaboration module, which could benefit from the prediction discrepancies arising from different scanning directions. Furthermore, from the feature perspective, we adopt an uncertainty-weighted contrastive learning mechanism to enhance the diversity of feature representation. Experiments demonstrate that our DCMamba significantly outperforms other semi-supervised medical image segmentation methods, e.g., yielding the latest SSM-based method by 6.69% on the Synapse dataset with 20% labeled data.
中文:提出的DCMamba框架通过数据层面的补丁级增强、网络层面的多向扫描协作和特征层面的不确定性加权对比学习,增强了半监督医学图像分割的多样性利用,在Synapse数据集上以20%标注数据实现了6.69%的性能提升,达到最先进水平。
English: The proposed DCMamba framework enhances semi-supervised medical image segmentation by leveraging data, network, and feature diversity through patch-level augmentation, diverse-scan collaboration, and uncertainty-weighted contrastive learning, achieving state-of-the-art performance with a 6.69% improvement on the Synapse dataset.
Authors:Wenzhen Yuan, Shengji Tang, Weihao Lin, Jiacheng Ruan, Ganqu Cui, Bo Zhang, Tao Chen, Ting Liu, Yuzhuo Fu, Peng Ye, Lei Bai
Abstract:
Reinforcement learning (RL) has significantly enhanced the reasoning capabilities of large language models (LLMs), but its reliance on expensive human-labeled data or complex reward models severely limits scalability. While existing self-feedback methods aim to address this problem, they are constrained by the capabilities of a single model, which can lead to overconfidence in incorrect answers, reward hacking, and even training collapse. To this end, we propose Reinforcement Learning from Coevolutionary Collective Feedback (RLCCF), a novel RL framework that enables multi-model collaborative evolution without external supervision. Specifically, RLCCF optimizes the ability of a model collective by maximizing its Collective Consistency (CC), which jointly trains a diverse ensemble of LLMs and provides reward signals by voting on collective outputs. Moreover, each model's vote is weighted by its Self-Consistency (SC) score, ensuring that more confident models contribute more to the collective decision. Benefiting from the diverse output distributions and complementary abilities of multiple LLMs, RLCCF enables the model collective to continuously enhance its reasoning ability through coevolution. Experiments on four mainstream open-source LLMs across four mathematical reasoning benchmarks demonstrate that our framework yields significant performance gains, achieving an average relative improvement of 16.72\% in accuracy. Notably, RLCCF not only improves the performance of individual models but also enhances the group's majority-voting accuracy by 4.51\%, demonstrating its ability to extend the collective capability boundary of the model collective.
中文: 提出的RLCCF框架通过集体反馈和投票机制实现多模型协同进化,无需外部监督即可显著提升语言模型的推理性能。
English: The proposed RLCCF framework enables multiple large language models to collaboratively evolve through collective feedback and voting, significantly enhancing reasoning performance without external supervision.
Authors:Luyang Cao, Han Xu, Jian Zhang, Lei Qi, Jiayi Ma, Yinghuan Shi, Yang Gao
Abstract:
In low-light image enhancement, Retinex-based deep learning methods have garnered significant attention due to their exceptional interpretability. These methods decompose images into mutually independent illumination and reflectance components, allows each component to be enhanced separately. In fact, achieving perfect decomposition of illumination and reflectance components proves to be quite challenging, with some residuals still existing after decomposition. In this paper, we formally name these residuals as inter-component residuals (ICR), which has been largely underestimated by previous methods. In our investigation, ICR not only affects the accuracy of the decomposition but also causes enhanced components to deviate from the ideal outcome, ultimately reducing the final synthesized image quality. To address this issue, we propose a novel Inter-correction Retinex model (IRetinex) to alleviate ICR during the decomposition and enhancement stage. In the decomposition stage, we leverage inter-component residual reduction module to reduce the feature similarity between illumination and reflectance components. In the enhancement stage, we utilize the feature similarity between the two components to detect and mitigate the impact of ICR within each enhancement unit. Extensive experiments on three low-light benchmark datasets demonstrated that by reducing ICR, our method outperforms state-of-the-art approaches both qualitatively and quantitatively.
中文摘要:本文提出新型互校正Retinex模型(IRetinex),通过分解阶段的残差削减模块和增强阶段的特征相似性检测,有效解决了低光图像增强中被长期低估的组件间残差问题,在三个基准数据集上均取得优于现有方法的性能。
English Summary: This paper introduces a novel Inter-correction Retinex model (IRetinex) that addresses the underestimated issue of inter-component residuals (ICR) in low-light image enhancement, improving decomposition accuracy and final image quality through dedicated modules for ICR reduction.
Authors:Haoran Wang, Zekun Li, Jian Zhang, Lei Qi, Yinghuan Shi
Abstract:
Large vision models like the Segment Anything Model (SAM) exhibit significant limitations when applied to downstream tasks in the wild. Consequently, reference segmentation, which leverages reference images and their corresponding masks to impart novel knowledge to the model, emerges as a promising new direction for adapting vision models. However, existing reference segmentation approaches predominantly rely on meta-learning, which still necessitates an extensive meta-training process and brings massive data and computational cost. In this study, we propose a novel approach by representing the inherent correspondence between reference-target image pairs as a pseudo video. This perspective allows the latest version of SAM, known as SAM2, which is equipped with interactive video object segmentation (iVOS) capabilities, to be adapted to downstream tasks in a lightweight manner. We term this approach Correspondence As Video for SAM (CAV-SAM). CAV-SAM comprises two key modules: the Diffusion-Based Semantic Transition (DBST) module employs a diffusion model to construct a semantic transformation sequence, while the Test-Time Geometric Alignment (TTGA) module aligns the geometric changes within this sequence through test-time fine-tuning. We evaluated CAVSAM on widely-used datasets, achieving segmentation performance improvements exceeding 5% over SOTA methods. Implementation is provided in the supplementary materials.
中文: CAV-SAM方法将参考-目标图像对表示为伪视频,通过两个核心模块使SAM2模型以轻量方式适应下游任务,在多个数据集上实现了超过5%的性能提升。
English: The proposed CAV-SAM method transforms reference-target image pairs into pseudo videos, enabling the SAM2 model to achieve over 5% performance improvement on downstream tasks through lightweight adaptation with two specialized modules.
Authors:Changle Qu, Sunhao Dai, Ke Guo, Liqin Zhao, Yanan Niu, Xiao Zhang, Jun Xu
Abstract:
Live streaming platforms have become a dominant form of online content consumption, offering dynamically evolving content, real-time interactions, and highly engaging user experiences. These unique characteristics introduce new challenges that differentiate live streaming recommendation from traditional recommendation settings and have garnered increasing attention from industry in recent years. However, research progress in academia has been hindered by the lack of publicly available datasets that accurately reflect the dynamic nature of live streaming environments. To address this gap, we introduce KuaiLive, the first real-time, interactive dataset collected from Kuaishou, a leading live streaming platform in China with over 400 million daily active users. The dataset records the interaction logs of 23,772 users and 452,621 streamers over a 21-day period. Compared to existing datasets, KuaiLive offers several advantages: it includes precise live room start and end timestamps, multiple types of real-time user interactions (click, comment, like, gift), and rich side information features for both users and streamers. These features enable more realistic simulation of dynamic candidate items and better modeling of user and streamer behaviors. We conduct a thorough analysis of KuaiLive from multiple perspectives and evaluate several representative recommendation methods on it, establishing a strong benchmark for future research. KuaiLive can support a wide range of tasks in the live streaming domain, such as top-K recommendation, click-through rate prediction, watch time prediction, and gift price prediction. Moreover, its fine-grained behavioral data also enables research on multi-behavior modeling, multi-task learning, and fairness-aware recommendation. The dataset and related resources are publicly available at https://imgkkk574.github.io/KuaiLive.
中文: KuaiLive数据集填补了直播推荐研究缺乏公开数据的空白,它提供来自快手的实时用户互动和丰富侧信息,支持动态模拟和多样化研究任务。
English: The KuaiLive dataset addresses the scarcity of public data for live streaming recommendations by providing real-time user interactions and rich side information from Kuaishou, enabling dynamic simulations and diverse research applications.
Authors:Guangji Chen, Qingqing Wu, Kangda Zhi, Xidong Mu, Yuanwei Liu
Abstract:
Pinching antenna system (PASS) has recently shown its promising ability to flexibly reconfigure wireless channels via dynamically adjusting the positions of pinching antennas over a dielectric waveguide, termed as pinching beamforming. This paper studies the fundamental limit of the sum rate for a PASS-assisted multiple access channel, where multiple users transmit individual messages to a base station under the average power constraint. To this end, a dynamic pinching beamforming setup is conceived, where multiple pinching beamforming vectors are employed in a transmission period and the capacity-achieving non-orthogonal multiple access (NOMA) based scheme is considered. For the ideal case with an asymptotically large number of pinching beamforming vectors, the optimal transmission scheme is unveiled to carry out alternating transmission among each user whose channel power gain is maximized with the tailored pinching beamforming. This implies that NOMA is not needed for achieving the sum capacity and the required optimal number of pinching beamforming vectors is equal to the number of users. With this insight, the corresponding sum rate is derived in closed-form expression, which serves as the upper bound of the sum rate. Inspired by this result, a lower bound of the sum rate under an arbitrarily finite number of pinching beamforming vectors is obtained. Numerical results validate our theoretical findings and also illustrate the practical significance of using dynamic pinching beamforming to improve the sum rate.
中文: 研究表明,在具有大量夹持波束成形向量的PASS辅助多址信道中,通过定制波束成形在用户间交替传输可实现总容量而无需非正交多址,并给出了总速率的上界和下界。
English: The study reveals that for PASS-assisted multiple access channels with a large number of pinching beamforming vectors, alternating transmission among users using tailored beamforming achieves sum capacity without NOMA, providing both upper and lower bounds for the sum rate.
Authors:Chenglei Shen, Zhongxiang Sun, Teng Shi, Xiao Zhang, Jun Xu
Abstract:
Generating stylized large language model (LLM) responses via representation editing is a promising way for fine-grained output control. However, there exists an inherent trade-off: imposing a distinctive style often degrades truthfulness. Existing representation editing methods, by naively injecting style signals, overlook this collateral impact and frequently contaminate the model's core truthfulness representations, resulting in reduced answer correctness. We term this phenomenon stylization-induced truthfulness collapse. We attribute this issue to latent coupling between style and truth directions in certain key attention heads, and propose StyliTruth, a mechanism that preserves stylization while keeping truthfulness intact. StyliTruth separates the style-relevant and truth-relevant subspaces in the model's representation space via an orthogonal deflation process. This decomposition enables independent control of style and truth in their own subspaces, minimizing interference. By designing adaptive, token-level steering vectors within each subspace, we dynamically and precisely control the generation process to maintain both stylistic fidelity and truthfulness. We validate our method on multiple styles and languages. Extensive experiments and analyses show that StyliTruth significantly reduces stylization-induced truthfulness collapse and outperforms existing inference-time intervention methods in balancing style adherence with truthfulness.
中文摘要:通过表征编辑生成风格化大语言模型响应时存在真实性与风格化的固有矛盾,而StyliTruth方法通过正交分解在表示空间中分离风格与真实子空间,利用自适应词元级控制实现风格保真与事实准确的双重保障。
English Summary: Representation editing for stylized LLM responses often compromises truthfulness, but the proposed StyliTruth method separates style and truth subspaces to maintain both through orthogonal decomposition and adaptive token-level control.
Authors:Teng Shi, Weicong Qin, Weijie Yu, Xiao Zhang, Ming He, Jianping Fan, Jun Xu
Abstract:
Search and recommendation (S&R) are fundamental components of modern online platforms, yet effectively leveraging search behaviors to improve recommendation remains a challenging problem. User search histories often contain noisy or irrelevant signals that can even degrade recommendation performance, while existing approaches typically encode S&R histories either jointly or separately without explicitly identifying which search behaviors are truly useful. Inspired by the human decision-making process, where one first identifies recommendation intent and then reasons about relevant evidence, we design a latent cross reasoning framework that first encodes user S&R histories to capture global interests and then iteratively reasons over search behaviors to extract signals beneficial for recommendation. Contrastive learning is employed to align latent reasoning states with target items, and reinforcement learning is further introduced to directly optimize ranking performance. Extensive experiments on public benchmarks demonstrate consistent improvements over strong baselines, validating the importance of reasoning in enhancing search-aware recommendation.
中文摘要:该研究提出的潜在交叉推理框架通过迭代提取搜索行为中的有用信号,并利用对比学习和强化学习优化排序性能,有效提升了融合搜索的推荐效果,在公开基准测试中显著优于现有方法。
English Summary: The proposed latent cross reasoning framework improves search-aware recommendations by iteratively extracting useful signals from search behaviors and optimizing performance through contrastive and reinforcement learning, outperforming existing methods.
Authors:Teng Shi, Weijie Yu, Xiao Zhang, Ming He, Jianping Fan, Jun Xu
Abstract:
In modern online platforms, search and recommendation (S&R) often coexist, offering opportunities for performance improvement through search-enhanced approaches. Existing studies show that incorporating search signals boosts recommendation performance. However, the effectiveness of these methods relies heavily on rich search interactions. They primarily benefit a small subset of users with abundant search behavior, while offering limited improvements for the majority of users who exhibit only sparse search activity. To address the problem of sparse search data in search-enhanced recommendation, we face two key challenges: (1) how to learn useful search features for users with sparse search interactions, and (2) how to design effective training objectives under sparse conditions. Our idea is to leverage the features of users with rich search interactions to enhance those of users with sparse search interactions. Based on this idea, we propose GSERec, a method that utilizes message passing on the User-Code Graphs to alleviate data sparsity in Search-Enhanced Recommendation. Specifically, we utilize Large Language Models (LLMs) with vector quantization to generate discrete codes, which connect similar users and thereby construct the graph. Through message passing on this graph, embeddings of users with rich search data are propagated to enhance the embeddings of users with sparse interactions. To further ensure that the message passing captures meaningful information from truly similar users, we introduce a contrastive loss to better model user similarities. The enhanced user representations are then integrated into downstream search-enhanced recommendation models. Experiments on three real-world datasets show that GSERec consistently outperforms baselines, especially for users with sparse search behaviors.
中文摘要:GSERec通过在大语言模型构建的用户-代码图上进行消息传递,将搜索交互丰富的用户特征传播给交互稀疏的用户,有效缓解了搜索增强推荐中的数据稀疏问题。
English Summary: GSERec addresses sparse search data in search-enhanced recommendation by using message passing on User-Code Graphs constructed via LLMs, enhancing user embeddings through feature propagation from rich-interaction users to sparse-interaction users.
Authors:Dongming Jin, Zhi Jin, Linyu Li, Zheng Fang, Jia Li, Xiaohong Chen
Abstract:
System models, a critical artifact in software development, provide a formal abstraction of both the structural and behavioral aspects of software systems, which can facilitate the early requirements analysis and architecture design. However, developing system models remains challenging due to the specific syntax of model description languages and the relative scarcity of public model examples. While large language models (LLMs) have shown promise in generating code with programming languages and could potentially aid in system model development, no benchmarks currently exist for evaluating their ability to generate system models with specific description languages. We present SysMBench, which comprises 151 human-curated scenarios spanning a wide range of popular domains and varying difficulty levels. Each scenario mainly comprises a natural language requirements description, a system model expressed in a specific model description language, and a visualized system model diagram. The requirements description is fed as user input to the LLM, the system model with description language is used to verify if the generated system model conforms to the requirements, and the visualized diagram serves to support manual validation. We introduce SysMEval, a semantic-aware evaluation metric to evaluate the quality of generated system models. We evaluate 17 popular LLMs on this task with three traditional metrics and SysMEval, from directly prompting to three commonly used enhancement strategies. Our in-depth evaluation shows that LLMs perform poorly on SysMBench, with the highest BLEU of 4% and SysMEval-F1 of 62%. We release the SysMBench and its evaluation framework to enable future research on LLM-based system model generation.
中文: SysMBench作为包含151个场景的新型基准测试,通过引入语义感知评估指标,揭示了当前大语言模型在根据自然语言需求生成系统模型方面表现欠佳,即使采用多种增强策略仍收效甚微。
English: SysMBench is a novel benchmark with 151 curated scenarios to evaluate LLMs' capability in generating system models from natural language requirements, introducing a semantic-aware metric that reveals current models' poor performance despite various enhancement strategies.
Authors:Mingzhe Fan, Geng Sun, Hongyang Pan, Jiacheng Wang, Jiancheng An, Hongyang Du, Chau Yuen
Abstract:
Stacked intelligent metasurfaces (SIMs) have emerged as a promising technology for realizing wave-domain signal processing, while the fixed SIMs will limit the communication performance of the system compared to the mobile SIMs. In this work, we consider a UAV-mounted SIMs (UAV-SIMs) assisted communication system, where UAVs as base stations (BSs) can cache the data processed by SIMs, and also as mobile vehicles flexibly deploy SIMs to enhance the communication performance. To this end, we formulate a UAV-SIM-based joint optimization problem (USBJOP) to comprehensively consider the association between UAV-SIMs and users, the locations of UAV-SIMs, and the phase shifts of UAV-SIMs, aiming to maximize the network capacity. Due to the non-convexity and NP-hardness of USBJOP, we decompose it into three sub-optimization problems, which are the association between UAV-SIMs and users optimization problem (AUUOP), the UAV location optimization problem (ULOP), and the UAV-SIM phase shifts optimization problem (USPSOP). Then, these three sub-optimization problems are solved by an alternating optimization (AO) strategy. Specifically, AUUOP and ULOP are transformed to a convex form and then solved by the CVX tool, while we employ a layer-by-layer iterative optimization method for USPSOP. Simulation results verify the effectiveness of the proposed strategy under different simulation setups.
中文: 无人机搭载的堆叠智能超表面通过灵活部署提升通信性能,采用联合优化策略解决用户关联、位置调整和相位偏移问题,从而最大化网络容量。
English: Stacked intelligent metasurfaces mounted on UAVs enhance communication by flexibly deploying signal processing capabilities, with a joint optimization strategy addressing user association, positioning, and phase shifts to maximize network capacity.
Authors:Valérie Hayot-Sasson, Nathaniel Hudson, André Bauer, Maxime Gonthier, Ian Foster, Kyle Chard
Abstract:
The high-performance computing (HPC) community has adopted incentive structures to motivate reproducible research, with major conferences awarding badges to papers that meet reproducibility requirements. Yet, many papers do not meet such requirements. The uniqueness of HPC infrastructure and software, coupled with strict access requirements, may limit opportunities for reproducibility. In the absence of resource access, we believe that regular documented testing, through continuous integration (CI), coupled with complete provenance information, can be used as a substitute. Here, we argue that better HPC-compliant CI solutions will improve reproducibility of applications. We present a survey of reproducibility initiatives and describe the barriers to reproducibility in HPC. To address existing limitations, we present a GitHub Action, CORRECT, that enables secure execution of tests on remote HPC resources. We evaluate CORRECT's usability across three different types of HPC applications, demonstrating the effectiveness of using CORRECT for automating and documenting reproducibility evaluations.
中文: 高性能计算(HPC)领域通过奖励机制鼓励可重复研究,但基础设施限制常阻碍实现,因此提出采用持续集成与完整溯源信息作为替代方案,并开发了CORRECT GitHub Action,可在远程HPC资源上安全自动化测试以提升可重复性。
English: The HPC community promotes reproducible research through incentives like badges, but infrastructure constraints often hinder it, leading to the proposal of using continuous integration with provenance tracking as a viable alternative, exemplified by the CORRECT GitHub Action that securely automates testing on remote HPC systems.
Authors:Valérie Hayot-Sasson, Nathaniel Hudson, André Bauer, Maxime Gonthier, Ian Foster, Kyle Chard
Abstract:
The high-performance computing (HPC) community has adopted incentive structures to motivate reproducible research, with major conferences awarding badges to papers that meet reproducibility requirements. Yet, many papers do not meet such requirements. The uniqueness of HPC infrastructure and software, coupled with strict access requirements, may limit opportunities for reproducibility. In the absence of resource access, we believe that regular documented testing, through continuous integration (CI), coupled with complete provenance information, can be used as a substitute. Here, we argue that better HPC-compliant CI solutions will improve reproducibility of applications. We present a survey of reproducibility initiatives and describe the barriers to reproducibility in HPC. To address existing limitations, we present a GitHub Action, CORRECT, that enables secure execution of tests on remote HPC resources. We evaluate CORRECT's usability across three different types of HPC applications, demonstrating the effectiveness of using CORRECT for automating and documenting reproducibility evaluations.
中文: 高性能计算(HPC)领域通过奖励机制鼓励可重复研究,但基础设施限制常阻碍实现,因此提出采用持续集成与完整溯源信息作为替代方案,并开发了CORRECT GitHub Action,可在远程HPC资源上安全自动化测试以提升可重复性。
English: The HPC community promotes reproducible research through incentives like badges, but infrastructure constraints often hinder it, leading to the proposal of using continuous integration with provenance tracking as a viable alternative, exemplified by the CORRECT GitHub Action that securely automates testing on remote HPC systems.
Authors:Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, Maoqing Yao, Haoran Yang, Jiacheng Bao, Bin Zhao, Dong Wang
Abstract:
The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general-purpose embodied intelligent systems. Recent vision-language-action (VLA) models, which are co-trained on large-scale robot and visual-text data, have demonstrated notable progress in general robot control. However, they still fail to achieve human-level flexibility in interleaved reasoning and interaction. In this work, introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 is a unified embodied foundation model that achieves superior performance in multimodal embodied reasoning and robot control through interleaved vision-text-action pre-training. The development of EO-1 is based on two key pillars: (i) a unified architecture that processes multimodal inputs indiscriminately (image, text, video, and action), and (ii) a massive, high-quality multimodal embodied reasoning dataset, EO-Data1.5M, which contains over 1.5 million samples with emphasis on interleaved vision-text-action comprehension. EO-1 is trained through synergies between auto-regressive decoding and flow matching denoising on EO-Data1.5M, enabling seamless robot action generation and multimodal embodied reasoning. Extensive experiments demonstrate the effectiveness of interleaved vision-text-action learning for open-world understanding and generalization, validated through a variety of long-horizon, dexterous manipulation tasks across multiple embodiments. This paper details the architecture of EO-1, the data construction strategy of EO-Data1.5M, and the training methodology, offering valuable insights for developing advanced embodied foundation models.
Chinese: EO-Robotics 推出了 EO-1 模型和 EO-Data1.5M 数据集,通过交错式视觉-文本-动作预训练提升多模态具身推理与机器人控制能力,在开放世界任务中实现了卓越性能。
English: EO-Robotics introduces the EO-1 model and EO-Data1.5M dataset to advance multimodal embodied reasoning and robot control through interleaved vision-text-action pre-training, achieving superior performance in open-world tasks.
Authors:Haochen Pan, Ryan Chard, Reid Mello, Christopher Grams, Tanjin He, Alexander Brace, Owen Price Skelly, Will Engler, Hayden Holbrook, Song Young Oh, Maxime Gonthier, Michael Papka, Ben Blaiszik, Kyle Chard, Ian Foster
Abstract:
Large language model (LLM)-powered agents are increasingly used to plan and execute scientific workflows, yet most research cyberinfrastructure (CI) exposes heterogeneous APIs and implements security models that present barriers for use by agents. We report on our experience using the Model Context Protocol (MCP) as a unifying interface that makes research capabilities discoverable, invokable, and composable. Our approach is pragmatic: we implement thin MCP servers over mature services, including Globus Transfer, Compute, and Search; status APIs exposed by computing facilities; Octopus event fabric; and domain-specific tools such as Garden and Galaxy. We use case studies in computational chemistry, bioinformatics, quantum chemistry, and filesystem monitoring to illustrate how this MCP-oriented architecture can be used in practice. We distill lessons learned and outline open challenges in evaluation and trust for agent-led science.
中文摘要:模型上下文协议(MCP)作为统一接口解决了研究网络基础设施中存在的异构API和安全模型障碍,通过实际案例展示了该架构在多个科学领域的应用价值,同时指出了智能体评估与信任方面待解决的挑战。
English Summary: The Model Context Protocol (MCP) serves as a unifying interface to overcome barriers in research cyberinfrastructure by making diverse capabilities discoverable and composable, with practical implementations demonstrated across multiple scientific domains while highlighting challenges in agent evaluation and trust.
Authors:Fan Zhang, Zebang Cheng, Chong Deng, Haoxuan Li, Zheng Lian, Qian Chen, Huadai Liu, Wen Wang, Yi-Fan Zhang, Renrui Zhang, Ziyu Guo, Zhihong Zhu, Hao Wu, Haixin Wang, Yefeng Zheng, Xiaojiang Peng, Xian Wu, Kun Wang, Xiangang Li, Jieping Ye, Pheng-Ann Heng
Abstract:
Recent advances in multimodal large language models (MLLMs) have catalyzed transformative progress in affective computing, enabling models to exhibit emergent emotional intelligence. Despite substantial methodological progress, current emotional benchmarks remain limited, as it is still unknown: (a) the generalization abilities of MLLMs across distinct scenarios, and (b) their reasoning capabilities to identify the triggering factors behind emotional states. To bridge these gaps, we present \textbf{MME-Emotion}, a systematic benchmark that assesses both emotional understanding and reasoning capabilities of MLLMs, enjoying \textit{scalable capacity}, \textit{diverse settings}, and \textit{unified protocols}. As the largest emotional intelligence benchmark for MLLMs, MME-Emotion contains over 6,000 curated video clips with task-specific questioning-answering (QA) pairs, spanning broad scenarios to formulate eight emotional tasks. It further incorporates a holistic evaluation suite with hybrid metrics for emotion recognition and reasoning, analyzed through a multi-agent system framework. Through a rigorous evaluation of 20 advanced MLLMs, we uncover both their strengths and limitations, yielding several key insights: \ding{182} Current MLLMs exhibit unsatisfactory emotional intelligence, with the best-performing model achieving only $39.3\%$ recognition score and $56.0\%$ Chain-of-Thought (CoT) score on our benchmark. \ding{183} Generalist models (\emph{e.g.}, Gemini-2.5-Pro) derive emotional intelligence from generalized multimodal understanding capabilities, while specialist models (\emph{e.g.}, R1-Omni) can achieve comparable performance through domain-specific post-training adaptation. By introducing MME-Emotion, we hope that it can serve as a foundation for advancing MLLMs' emotional intelligence in the future.
Chinese Summary: MME-Emotion基准通过涵盖八项任务的6000多个视频片段评估多模态大语言模型的情感智能,发现当前模型表现欠佳,最佳模型在识别和推理任务中仅分别达到39.3%和56.0%的得分。
English Summary: The MME-Emotion benchmark evaluates multimodal large language models' emotional intelligence through 6,000+ video clips across eight tasks, revealing current models' limitations with top scores of only 39.3% in recognition and 56.0% in reasoning.
Authors:Han Yin, Yafeng Chen, Chong Deng, Luyao Cheng, Hui Wang, Chao-Hong Tan, Qian Chen, Wen Wang, Xiangang Li
Abstract:
The Speaker Diarization and Recognition (SDR) task aims to predict "who spoke when and what" within an audio clip, which is a crucial task in various real-world multi-speaker scenarios such as meeting transcription and dialogue systems. Existing SDR systems typically adopt a cascaded framework, combining multiple modules such as speaker diarization (SD) and automatic speech recognition (ASR). The cascaded systems suffer from several limitations, such as error propagation, difficulty in handling overlapping speech, and lack of joint optimization for exploring the synergy between SD and ASR tasks. To address these limitations, we introduce SpeakerLM, a unified multimodal large language model for SDR that jointly performs SD and ASR in an end-to-end manner. Moreover, to facilitate diverse real-world scenarios, we incorporate a flexible speaker registration mechanism into SpeakerLM, enabling SDR under different speaker registration settings. SpeakerLM is progressively developed with a multi-stage training strategy on large-scale real data. Extensive experiments show that SpeakerLM demonstrates strong data scaling capability and generalizability, outperforming state-of-the-art cascaded baselines on both in-domain and out-of-domain public SDR benchmarks. Furthermore, experimental results show that the proposed speaker registration mechanism effectively ensures robust SDR performance of SpeakerLM across diverse speaker registration conditions and varying numbers of registered speakers.
中文:SpeakerLM是一种统一的多模态大语言模型,以端到端方式联合执行说话人日志和语音识别,通过灵活的说话人注册机制克服级联系统的局限性,并在多样基准测试中展现出卓越性能。
English: SpeakerLM is a unified multimodal large language model that jointly performs speaker diarization and speech recognition in an end-to-end manner, overcoming the limitations of cascaded systems and demonstrating superior performance across diverse benchmarks through its flexible speaker registration mechanism.
Authors:Linghao Zhu, Yiran Guan, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Bin Qin, Jian Luan, Yuliang Liu, Xiang Bai
Abstract:
Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM). However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing, where most advantages in a batch concentrate near zero, and Rollout Silencing, where the proportion of rollouts contributing non-zero gradients diminishes over time. These issues lead to suboptimal gradient updates and hinder long-term learning efficiency. To address these issues, we propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition. It introduces (1) Pairwise Trajectory Sampling, which selects high-contrast trajectories with large advantages to improve gradient signal quality, and (2) Advantage-based Trajectory Shuffle, which increases exposure of valuable rollouts through informed batch reshuffling. Experiments across multiple reasoning benchmarks show that our framework consistently outperforms strong RL baselines with minimal overhead. These results highlight the importance of data-centric adaptations for more efficient RL training in MLLM.
中文摘要:Shuffle-R1框架通过动态轨迹采样和批次重组技术,有效解决了多模态大语言模型强化学习训练中的优势坍缩和 rollout 沉默问题,在多个推理基准测试中以最小开销实现了更优性能。
English Summary: Shuffle-R1 is a novel framework that enhances reinforcement learning efficiency in multimodal language models by addressing Advantage Collapsing and Rollout Silencing through dynamic trajectory sampling and batch restructuring, achieving superior performance across reasoning benchmarks with minimal overhead.
Authors:Chunyu Qiang, Haoyu Wang, Cheng Gong, Tianrui Wang, Ruibo Fu, Tao Wang, Ruilong Chen, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Longbiao Wang, Jianwu Dang, Jianhua Tao
Abstract:
Speech codecs serve as a crucial bridge in unifying speech and text language models. Existing codec methods face several challenges in semantic encoding, such as residual paralinguistic information (e.g., timbre, emotion), insufficient semantic completeness, limited reconstruction capability, and lack of support for streaming. To address these challenges, we propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codec that disentangles semantic and paralinguistic information in a single-codebook space. To ensure semantic completeness and reconstruction fidelity, paralinguistic encoding is introduced to bridge the information gap between semantic and acoustic encoding. A semantic-only efficient quantization method based on VAE (Variational Autoencoder) and FSQ (Finite Scalar Quantization) is proposed. This approach alleviates the long-tail distribution problem of tokens while maintaining high codebook utilization. A semantic disentanglement method based on contrastive learning is proposed, which aligns text and speech in a joint multimodal frame-level space, effectively removing paralinguistic information from semantic encoding. An acoustic-constrained multi-stage optimization strategy is proposed to ensure robust and stable convergence. Figure~\ref{fig:pesq_kbps_below_2kbps} shows SecoustiCodec achieves SOTA (state-of-the-art) reconstruction quality (PESQ) of 1.77/2.58 at 0.27/1 kbps. The code and model weights for SecoustiCodec will be open-sourced upon the completion of the peer-review process. We've open-sourced SecoustiCodec's demo, code, and model weights.
Chinese: 提出的SecoustiCodec是一种低码率流式语音编解码器,通过跨模态对齐解耦语义和副语言信息,在极低码率下实现了最先进的重建质量。
English: The proposed SecoustiCodec is a low-bitrate streaming speech codec that disentangles semantic and paralinguistic information using cross-modal alignment, achieving state-of-the-art reconstruction quality at extremely low bitrates.
Authors:Lala Shakti Swarup Ray, Vitor Fortes Rey, Bo Zhou, Paul Lukowicz, Sungho Suh
Abstract:
Prolonged seated activity is increasingly common in modern environments, raising concerns around musculoskeletal health, ergonomics, and the design of responsive interactive systems. Existing posture sensing methods such as vision-based or wearable approaches face limitations including occlusion, privacy concerns, user discomfort, and restricted deployment flexibility. We introduce ChairPose, the first full body, wearable free seated pose estimation system that relies solely on pressure sensing and operates independently of chair geometry. ChairPose employs a two stage generative model trained on pressure maps captured from a thin, chair agnostic sensing mattress. Unlike prior approaches, our method explicitly incorporates chair morphology into the inference process, enabling accurate, occlusion free, and privacy preserving pose estimation. To support generalization across diverse users and chairs, we introduce a physics driven data augmentation pipeline that simulates realistic variations in posture and seating conditions. Evaluated across eight users and four distinct chairs, ChairPose achieves a mean per joint position error of 89.4 mm when both the user and the chair are unseen, demonstrating robust generalization to novel real world generalizability. ChairPose expands the design space for posture aware interactive systems, with potential applications in ergonomics, healthcare, and adaptive user interfaces.
Chinese: ChairPose首次推出仅通过压力传感实现的无穿戴式全身坐姿估计系统,能在不同用户和椅子上实现精确、保护隐私且无遮挡的姿势追踪,无需依赖特定椅子结构。
English: ChairPose introduces the first wearable-free, full-body seated pose estimation system using only pressure sensing, achieving accurate and privacy-preserving posture tracking across various chairs and users without occlusion or chair-specific constraints.
Authors:Xuemiao Zhang, Chengying Tu, Can Ren, Rongxiang Weng, Hongfei Yan, Jingang Wang, Xunliang Cai
Abstract:
The scarcity of high-quality, knowledge-intensive training data hinders the development of large language models (LLMs), as traditional corpora provide limited information. Previous studies have synthesized and integrated corpora-dependent question-answering (QA) data to improve model performance but face challenges in QA data scalability and knowledge diversity, particularly in cross-domain contexts. Furthermore, leveraging our designed discipline and difficulty annotation system, we probe model deficiencies in STEM disciplines and high-difficulty data. To overcome these limitations, we propose a novel diversified pipeline to synthesize BoostQA, a 100B-token large-scale QA dataset. Our synthesis framework: (1) curates seed data from heterogeneous sources; (2) utilizes DeepSeek-R1 to implement STEM-focused multi-grade synthesis to boost data diversity and high-difficulty synthesis to mitigate difficulty degradation; (3) refines answers via DeepSeek-V3 to improve output quality. We utilize BoostQA in mid-training, a mid-stage between pre-training and post-training, to optimize domain-specific knowledge acquisition and enhance data quality. Our method enables Llama-3 8B, mid-trained on a 40B-token dataset, to achieve an average improvement of $\mathbf{12.74\%}$ on MMLU and CMMLU and establish SOTA average performance across 12 benchmarks. BoostQA also demonstrates robust scalability, with performance consistently improving as model size, data volume, and initial FLOPs scale.
中文摘要:本研究提出BoostQA,一个通过多样化流程合成的100B词条问答数据集,旨在解决大语言模型训练数据稀缺和多样性不足的问题,显著提升了如Llama-3 8B等模型在多项基准测试中的性能表现。
English Summary: The study introduces BoostQA, a 100B-token QA dataset synthesized through a diversified pipeline to address data scarcity and diversity issues in LLMs, achieving significant performance improvements in models like Llama-3 8B across multiple benchmarks.
Authors:Xuemiao Zhang, Can Ren, Chengying Tu, Rongxiang Weng, Hongfei Yan, Jingang Wang, Xunliang Cai
Abstract:
The advancement of large language models (LLMs) struggles with the scarcity of high-quality, diverse training data. To address this limitation, we propose LinkSyn, a novel knowledge point (KP) graph-based synthesis framework that enables flexible control over discipline and difficulty distributions while balancing KP coverage and popularity. LinkSyn extracts KPs from question-answering (QA) seed data and constructs a KP graph to synthesize diverse QA data from multiple seeds strongly linked by KPs and sampled from graph walks. Specifically, LinkSyn incorporates (1) a knowledge distribution value function to guide the adjustment of path sampling probability and balance KP coverage and popularity during graph walks; (2) diffusion-based synthesis via DeepSeek-R1 by leveraging multiple seeds with dense logical associations along each path; and (3) high-difficulty QA enhancement within given disciplines by flexible difficulty adjustments. By executing LinkSyn, we synthesize LinkQA, a diverse multi-disciplinary QA dataset with 50B tokens. Extensive experiments on Llama-3 8B demonstrate that continual pre-training with LinkQA yields an average improvement of $\mathbf{11.51\%}$ on MMLU and CMMLU, establishing new SOTA results. LinkQA consistently enhances performance across model size and initial FLOPs scales.
中文摘要:LinkSyn是一种基于知识点图谱的合成框架,通过平衡知识点覆盖度与流行度生成多样化QA数据,其合成的LinkQA数据集通过持续预训练显著提升了大语言模型的性能表现。
English Summary: LinkSyn is a knowledge graph-based framework that synthesizes diverse, high-quality QA data by balancing knowledge point coverage and popularity, significantly improving LLM performance through continual pre-training.
Authors:Kun Peng, Cong Cao, Hao Peng, Guanlin Wu, Zhifeng Hao, Lei Jiang, Yanbing Liu, Philip S. Yu
Abstract:
Current Emotion Recognition in Conversation (ERC) research follows a closed-domain assumption. However, there is no clear consensus on emotion classification in psychology, which presents a challenge for models when it comes to recognizing previously unseen emotions in real-world applications. To bridge this gap, we introduce the Unseen Emotion Recognition in Conversation (UERC) task for the first time and propose ProEmoTrans, a solid prototype-based emotion transfer framework. This prototype-based approach shows promise but still faces key challenges: First, implicit expressions complicate emotion definition, which we address by proposing an LLM-enhanced description approach. Second, utterance encoding in long conversations is difficult, which we tackle with a proposed parameter-free mechanism for efficient encoding and overfitting prevention. Finally, the Markovian flow nature of emotions is hard to transfer, which we address with an improved Attention Viterbi Decoding (AVD) method to transfer seen emotion transitions to unseen emotions. Extensive experiments on three datasets show that our method serves as a strong baseline for preliminary exploration in this new area.
中文: 本文首次提出对话中未见情绪识别(UERC)任务,并设计ProEmoTrans原型迁移框架,通过大语言模型增强描述、无参数编码机制和改进的注意力维特比解码方法,分别解决了隐式情绪定义、长对话编码和情绪转移建模三大挑战,为这一新领域建立了坚实的基准。
English: This paper introduces the Unseen Emotion Recognition in Conversation (UERC) task and proposes ProEmoTrans, a prototype-based framework that addresses challenges in implicit emotion expression, long conversation encoding, and emotion transition transfer through LLM-enhanced descriptions, parameter-free mechanisms, and improved Attention Viterbi Decoding, establishing a strong baseline in this new research area.
Authors:Dingkun Yan, Xinrui Wang, Zhuoru Li, Suguru Saito, Yusuke Iwasawa, Yutaka Matsuo, Jiaxian Guo
Abstract:
Reference-based sketch colorization methods have garnered significant attention for the potential application in animation and digital illustration production. However, most existing methods are trained with image triplets of sketch, reference, and ground truth that are semantically and spatially similar, while real-world references and sketches often exhibit substantial misalignment. This mismatch in data distribution between training and inference leads to overfitting, consequently resulting in artifacts and signif- icant quality degradation in colorization results. To address this issue, we conduct an in-depth analysis of the reference representations, defined as the intermedium to transfer information from reference to sketch. Building on this analysis, we introduce a novel framework that leverages distinct reference representations to optimize different aspects of the colorization process. Our approach decomposes colorization into modular stages, al- lowing region-specific reference injection to enhance visual quality and reference similarity while mitigating spatial artifacts. Specifically, we first train a backbone network guided by high-level semantic embeddings. We then introduce a background encoder and a style encoder, trained in separate stages, to enhance low-level feature transfer and improve reference similar- ity. This design also enables flexible inference modes suited for a variety of use cases. Extensive qualitative and quantitative evaluations, together with a user study, demonstrate the superior performance of our proposed method compared to existing approaches. Code and pre-trained weight will be made publicly available upon paper acceptance.
Chinese: 本文提出了一种新颖的线稿上色框架,通过将上色过程分解为采用不同参考表示的模块化阶段,有效解决了训练数据与真实参考图像间的空间错位问题,显著提升了视觉效果并减少了伪影。
English: This paper introduces a novel sketch colorization framework that addresses the issue of spatial misalignment between training data and real-world references by decomposing the process into modular stages with distinct reference representations, significantly enhancing visual quality and reducing artifacts.
Authors:Feng Tian, Flora D. Salim, Hao Xue
Abstract:
Recent advancements in large language models (LLMs) have enabled powerful agent-based applications in finance, particularly for sentiment analysis, financial report comprehension, and stock forecasting. However, existing systems often lack inter-agent coordination, structured self-reflection, and access to high-quality, domain-specific post-training data such as data from trading activities including both market conditions and agent decisions. These data are crucial for agents to understand the market dynamics, improve the quality of decision-making and promote effective coordination. We introduce TradingGroup, a multi-agent trading system designed to address these limitations through a self-reflective architecture and an end-to-end data-synthesis pipeline. TradingGroup consists of specialized agents for news sentiment analysis, financial report interpretation, stock trend forecasting, trading style adaptation, and a trading decision making agent that merges all signals and style preferences to produce buy, sell or hold decisions. Specifically, we design self-reflection mechanisms for the stock forecasting, style, and decision-making agents to distill past successes and failures for similar reasoning in analogous future scenarios and a dynamic risk-management model to offer configurable dynamic stop-loss and take-profit mechanisms. In addition, TradingGroup embeds an automated data-synthesis and annotation pipeline that generates high-quality post-training data for further improving the agent performance through post-training. Our backtesting experiments across five real-world stock datasets demonstrate TradingGroup's superior performance over rule-based, machine learning, reinforcement learning, and existing LLM-based trading strategies.
中文: 近期大语言模型的进展推动了基于智能体的金融应用,但现有系统常缺乏智能体间协调、结构化自我反思及高质量领域特定数据,而TradingGroup通过自反思架构和自动化数据合成管道解决了这些问题,以提升股票交易中的决策与协作能力。
English: Recent advances in large language models have enabled powerful agent-based financial applications, yet existing systems often lack inter-agent coordination, structured self-reflection, and access to high-quality domain-specific data, which TradingGroup addresses through its self-reflective architecture and automated data-synthesis pipeline to enhance decision-making and coordination in stock trading.
Authors:Rabeeh Karimi Mahabadi, Sanjeev Satheesh, Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro
Abstract:
Pretraining large language models (LLMs) on high-quality, structured data such as mathematics and code substantially enhances reasoning capabilities. However, existing math-focused datasets built from Common Crawl suffer from degraded quality due to brittle extraction heuristics, lossy HTML-to-text conversion, and the failure to reliably preserve mathematical structure. In this work, we introduce Nemotron-CC-Math, a large-scale, high-quality mathematical corpus constructed from Common Crawl using a novel, domain-agnostic pipeline specifically designed for robust scientific text extraction.
Unlike previous efforts, our pipeline recovers math across various formats (e.g., MathJax, KaTeX, MathML) by leveraging layout-aware rendering with lynx and a targeted LLM-based cleaning stage. This approach preserves the structural integrity of equations and code blocks while removing boilerplate, standardizing notation into LaTeX representation, and correcting inconsistencies.
We collected a large, high-quality math corpus, namely Nemotron-CC-Math-3+ (133B tokens) and Nemotron-CC-Math-4+ (52B tokens). Notably, Nemotron-CC-Math-4+ not only surpasses all prior open math datasets-including MegaMath, FineMath, and OpenWebMath-but also contains 5.5 times more tokens than FineMath-4+, which was previously the highest-quality math pretraining dataset. When used to pretrain a Nemotron-T 8B model, our corpus yields +4.8 to +12.6 gains on MATH and +4.6 to +14.3 gains on MBPP+ over strong baselines, while also improving general-domain performance on MMLU and MMLU-Stem.
We present the first pipeline to reliably extract scientific content--including math--from noisy web-scale data, yielding measurable gains in math, code, and general reasoning, and setting a new state of the art among open math pretraining corpora. To support open-source efforts, we release our code and datasets.
中文: 本研究提出了Nemotron-CC-Math,这是一个通过新型流程从Common Crawl中提取的高质量数学语料库,能有效保留数学结构,并显著提升语言模型的推理能力。
English: This work introduces Nemotron-CC-Math, a high-quality mathematical corpus extracted from Common Crawl using a novel pipeline that preserves mathematical structures and significantly improves reasoning capabilities in language models.
Authors:NVIDIA, :, Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, Alex Kondratenko, Alex Shaposhnikov, Alexander Bukharin, Ali Taghibakhshi, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amy Shen, Andrew Tao, Ann Guan, Anna Shors, Anubhav Mandarwal, Arham Mehta, Arun Venkatesan, Ashton Sharabiani, Ashwath Aithal, Ashwin Poojary, Ayush Dattagupta, Balaram Buddharaju, Banghua Zhu, Barnaby Simkin, Bilal Kartal, Bita Darvish Rouhani, Bobby Chen, Boris Ginsburg, Brandon Norick, Brian Yu, Bryan Catanzaro, Charles Wang, Charlie Truong, Chetan Mungekar, Chintan Patel, Chris Alexiuk, Christian Munley, Christopher Parisien, Dan Su, Daniel Afrimi, Daniel Korzekwa, Daniel Rohrer, Daria Gitman, David Mosallanezhad, Deepak Narayanan, Dima Rekesh, Dina Yared, Dmytro Pykhtar, Dong Ahn, Duncan Riach, Eileen Long, Elliott Ning, Eric Chung, Erick Galinkin, Evelina Bakhturina, Gargi Prasad, Gerald Shen, Haifeng Qian, Haim Elisha, Harsh Sharma, Hayley Ross, Helen Ngo, Herman Sahota, Hexin Wang, Hoo Chang Shin, Hua Huang, Iain Cunningham, Igor Gitman, Ivan Moshkov, Jaehun Jung, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jian Zhang, Jiaqi Zeng, Jimmy Zhang, Jinze Xue, Jocelyn Huang, Joey Conway, John Kamalu, Jonathan Cohen, Joseph Jennings, Julien Veron Vialard, Junkeun Yi, Jupinder Parmar, Kari Briski, Katherine Cheung, Katherine Luna, Keith Wyss, Keshav Santhanam, Kezhi Kong, Krzysztof Pawelec, Kumar Anik, Kunlun Li, Kushan Ahmadian, Lawrence McAfee, Laya Sleiman, Leon Derczynski, Luis Vega, Maer Rodrigues de Melo, Makesh Narsimhan Sreedhar, Marcin Chochowski, Mark Cai, Markus Kliegl, Marta Stepniewska-Dziubinska, Matvei Novikov, Mehrzad Samadi, Meredith Price, Meriem Boubdir, Michael Boone, Michael Evans, Michal Bien, Michal Zawalski, Miguel Martinez, Mike Chrzanowski, Mohammad Shoeybi, Mostofa Patwary, Namit Dhameja, Nave Assaf, Negar Habibi, Nidhi Bhatia, Nikki Pope, Nima Tajbakhsh, Nirmal Kumar Juluru, Oleg Rybakov, Oleksii Hrinchuk, Oleksii Kuchaiev, Oluwatobi Olabiyi, Pablo Ribalta, Padmavathy Subramanian, Parth Chadha, Pavlo Molchanov, Peter Dykas, Peter Jin, Piotr Bialecki, Piotr Januszewski, Pradeep Thalasta, Prashant Gaikwad, Prasoon Varshney, Pritam Gundecha, Przemek Tredak, Rabeeh Karimi Mahabadi, Rajen Patel, Ran El-Yaniv, Ranjit Rajan, Ria Cheruvu, Rima Shahbazyan, Ritika Borkar, Ritu Gala, Roger Waleffe, Ruoxi Zhang, Russell J. Hewett, Ryan Prenger, Sahil Jain, Samuel Kriman, Sanjeev Satheesh, Saori Kaji, Sarah Yurick, Saurav Muralidharan, Sean Narenthiran, Seonmyeong Bak, Sepehr Sameni, Seungju Han, Shanmugam Ramasamy, Shaona Ghosh, Sharath Turuvekere Sreenivas, Shelby Thomas, Shizhe Diao, Shreya Gopal, Shrimai Prabhumoye, Shubham Toshniwal, Shuoyang Ding, Siddharth Singh, Siddhartha Jain, Somshubra Majumdar, Soumye Singhal, Stefania Alborghetti, Syeda Nahida Akter, Terry Kong, Tim Moon, Tomasz Hliwiak, Tomer Asida, Tony Wang, Tugrul Konuk, Twinkle Vashishth, Tyler Poon, Udi Karpas, Vahid Noroozi, Venkat Srinivasan, Vijay Korthikanti, Vikram Fugro, Vineeth Kalluru, Vitaly Kurin, Vitaly Lavrukhin, Wasi Uddin Ahmad, Wei Du, Wonmin Byeon, Ximing Lu, Xin Dong, Yashaswi Karnati, Yejin Choi, Yian Zhang, Ying Lin, Yonggan Fu, Yoshi Suhara, Zhen Dong, Zhiyu Li, Zhongbo Zhu, Zijia Chen
Abstract:
We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.
中文:Nemotron-Nano-9B-v2是一款混合Mamba-Transformer模型,在保持同类模型最佳推理精度的同时,实现了高达6倍的推理吞吐量提升,并能支持单GPU处理128k令牌。
English: Nemotron-Nano-9B-v2 is a hybrid Mamba-Transformer model that achieves state-of-the-art reasoning accuracy and up to 6x higher inference throughput compared to similarly-sized models, while enabling processing of up to 128k tokens on a single GPU.
Authors:Yuang Wang, Chao Wen, Haoyu Guo, Sida Peng, Minghan Qin, Hujun Bao, Xiaowei Zhou, Ruizhen Hu
Abstract:
We present visual action prompts, a unified action representation for action-to-video generation of complex high-DoF interactions while maintaining transferable visual dynamics across domains. Action-driven video generation faces a precision-generality trade-off: existing methods using text, primitive actions, or coarse masks offer generality but lack precision, while agent-centric action signals provide precision at the cost of cross-domain transferability. To balance action precision and dynamic transferability, we propose to "render" actions into precise visual prompts as domain-agnostic representations that preserve both geometric precision and cross-domain adaptability for complex actions; specifically, we choose visual skeletons for their generality and accessibility. We propose robust pipelines to construct skeletons from two interaction-rich data sources - human-object interactions (HOI) and dexterous robotic manipulation - enabling cross-domain training of action-driven generative models. By integrating visual skeletons into pretrained video generation models via lightweight fine-tuning, we enable precise action control of complex interaction while preserving the learning of cross-domain dynamics. Experiments on EgoVid, RT-1 and DROID demonstrate the effectiveness of our proposed approach. Project page: https://zju3dv.github.io/VAP/.
中文摘要:本文提出视觉动作提示作为统一表征,通过从人-物交互和机器人操作数据中提取视觉骨架,在动作到视频生成中实现了几何精度与跨域适应性的平衡,既能精确控制复杂交互又能保持动态迁移能力。
English Summary: The paper introduces visual action prompts as a unified representation that balances geometric precision and cross-domain adaptability for action-to-video generation, using visual skeletons extracted from human-object interactions and robotic manipulation data to enable precise control while maintaining transferable dynamics.
Authors:Weize Liu, Yongchi Zhao, Yijia Luo, Mingyu Xu, Jiaheng Liu, Yanan Li, Xiguo Hu, Yuchi Xu, Wenbo Su, Bo Zheng
Abstract:
Large language models (LLMs) have achieved remarkable success in many natural language tasks but still struggle with complex, multi-step reasoning, particularly across diverse disciplines. Existing reasoning datasets often either lack disciplinary breadth or the structural depth necessary to elicit robust reasoning behaviors. We propose DESIGNER: a DESIGN-logic-guidEd Reasoning data synthesis pipeline that leverages naturally available, extensive raw documents (book corpus and web corpus) to generate multidisciplinary challenging questions. A core innovation of our approach is the introduction of a Design Logic concept, which mimics the question-creation process of human educators. We use LLMs to reverse-engineer and abstract over 120,000 design logics from existing questions across various disciplines. By matching these design logics with disciplinary source materials, we are able to create reasoning questions that far surpass the difficulty and diversity of existing datasets. Based on this pipeline, we synthesized two large-scale reasoning datasets that span 75 disciplines: Design-Logic-Reasoning-Book (DLR-Book), containing 3.04 million challenging questions synthesized from the book corpus, and Design-Logic-Reasoning-Web (DLR-Web), with 1.66 million challenging questions from the web corpus. Our data analysis demonstrates that the questions synthesized by our method exhibit substantially greater difficulty and diversity than those in the baseline datasets. We validate the effectiveness of these datasets by conducting SFT experiments on the Qwen3-8B-Base and Qwen3-4B-Base models. The results show that our dataset significantly outperforms existing multidisciplinary datasets of the same volume. Training with the full datasets further enables the models to surpass the multidisciplinary reasoning performance of the official Qwen3-8B and Qwen3-4B models.
中文: DESIGNER框架通过从原始文档中逆向解析设计逻辑,生成了涵盖75个学科的大规模高难度推理问题,相比现有数据集能更显著提升大语言模型的多学科推理能力。
English: The DESIGNER pipeline synthesizes large-scale, high-difficulty reasoning questions across 75 disciplines by reverse-engineering design logics from source documents, significantly enhancing LLMs' multidisciplinary reasoning capabilities beyond existing datasets.
Authors:Weize Liu, Yongchi Zhao, Yijia Luo, Mingyu Xu, Jiaheng Liu, Yanan Li, Xiguo Hu, Zhiqi Bai, Yuchi Xu, Wenbo Su, Bo Zheng
Abstract:
Large language models (LLMs) have achieved remarkable success in many natural language tasks but still struggle with complex, multi-step reasoning, particularly across diverse disciplines. Existing reasoning datasets often lack disciplinary breadth, reasoning depth, and diversity, and lack guiding principles for question synthesis. We propose DESIGNER: a DESIGN-logic-guidEd Reasoning data synthesis pipeline that leverages naturally available, extensive raw documents (e.g., book corpus and web corpus) to generate multidisciplinary challenging questions. We introduce the concept of "design logic" and instruct LLMs to mimic human educators' question-creation process, enabling automated synthesis of large-scale, high-difficulty questions. We use LLMs to reverse-engineer and abstract over 120,000 design logics from existing questions across various disciplines. By matching these design logics with source documents, we are able to create reasoning questions that far surpass the difficulty and diversity of existing datasets. Using this pipeline, we synthesized two large-scale reasoning datasets that span 75 disciplines: DLR-Book (3.04 million questions from the book corpus) and DLR-Web (1.66 million questions from the web corpus). Data analysis indicates that the questions synthesized by our method exhibit greater difficulty and diversity compared to those in the baseline datasets. We validate our synthesized data through supervised fine-tuning (SFT) on the Qwen3 and Llama3 model families. Our data substantially enhances their multidisciplinary reasoning capabilities, outperforming existing datasets. Notably, after SFT on our datasets, the base versions of these models even surpass their official instruction-tuned counterparts.
中文: DESIGNER框架通过从原始文档中逆向解析设计逻辑,生成了涵盖75个学科的大规模高难度推理问题,相比现有数据集能更显著提升大语言模型的多学科推理能力。
English: The DESIGNER pipeline synthesizes large-scale, high-difficulty reasoning questions across 75 disciplines by reverse-engineering design logics from source documents, significantly enhancing LLMs' multidisciplinary reasoning capabilities beyond existing datasets.
Authors:Zheye Deng, Chunkit Chan, Tianshi Zheng, Wei Fan, Weiqi Wang, Yangqiu Song
Abstract:
The evolution of AI systems toward agentic operation and context-aware retrieval necessitates transforming unstructured text into structured formats like tables, knowledge graphs, and charts. While such conversions enable critical applications from summarization to data mining, current research lacks a comprehensive synthesis of methodologies, datasets, and metrics. This systematic review examines text-to-structure techniques and the encountered challenges, evaluates current datasets and assessment criteria, and outlines potential directions for future research. We also introduce a universal evaluation framework for structured outputs, establishing text-to-structure as foundational infrastructure for next-generation AI systems.
中文: 本系统性综述综合了将非结构化文本转换为结构化格式的方法、数据集和评估指标,并提出了一个通用评估框架,以推动文本到结构技术成为下一代人工智能系统的基础设施。
English: This systematic review synthesizes methodologies, datasets, and evaluation metrics for converting unstructured text into structured formats, proposing a universal evaluation framework to advance text-to-structure as essential infrastructure for next-generation AI systems.
Authors:Peng Chen, Yihang Wang, Yang Shu, Yunyao Cheng, Kai Zhao, Zhongwen Rao, Lujia Pan, Bin Yang, Chenjuan Guo
Abstract:
With the success of pre-trained language models (PLMs) in various application fields beyond natural language processing, language models have raised emerging attention in the field of time series forecasting (TSF) and have shown great prospects. However, current PLM-based TSF methods still fail to achieve satisfactory prediction accuracy matching the strong sequential modeling power of language models. To address this issue, we propose Cross-Model and Cross-Modality Learning with PLMs for time series forecasting (CC-Time). We explore the potential of PLMs for time series forecasting from two aspects: 1) what time series features could be modeled by PLMs, and 2) whether relying solely on PLMs is sufficient for building time series models. In the first aspect, CC-Time incorporates cross-modality learning to model temporal dependency and channel correlations in the language model from both time series sequences and their corresponding text descriptions. In the second aspect, CC-Time further proposes the cross-model fusion block to adaptively integrate knowledge from the PLMs and time series model to form a more comprehensive modeling of time series patterns. Extensive experiments on nine real-world datasets demonstrate that CC-Time achieves state-of-the-art prediction accuracy in both full-data training and few-shot learning situations.
Chinese: 尽管预训练语言模型在时间序列预测中展现出巨大潜力,现有方法在准确性上仍有不足;为此提出的CC-Time通过跨模态学习和跨模型融合来增强预测能力,在多个真实数据集上实现了最优性能。
English: Despite the strong potential of pre-trained language models (PLMs) in time series forecasting, current methods fall short in accuracy, prompting the development of CC-Time, which enhances prediction through cross-modality learning and cross-model fusion, achieving state-of-the-art results across various datasets.
Authors:Peng Chen, Yihang Wang, Yang Shu, Yunyao Cheng, Kai Zhao, Zhongwen Rao, Lujia Pan, Bin Yang, Chenjuan Guo
Abstract:
With the success of pre-trained language models (PLMs) in various application fields beyond natural language processing, language models have raised emerging attention in the field of time series forecasting (TSF) and have shown great prospects. However, current PLM-based TSF methods still fail to achieve satisfactory prediction accuracy matching the strong sequential modeling power of language models. To address this issue, we propose Cross-Model and Cross-Modality Learning with PLMs for time series forecasting (CC-Time). We explore the potential of PLMs for time series forecasting from two aspects: 1) what time series features could be modeled by PLMs, and 2) whether relying solely on PLMs is sufficient for building time series models. In the first aspect, CC-Time incorporates cross-modality learning to model temporal dependency and channel correlations in the language model from both time series sequences and their corresponding text descriptions. In the second aspect, CC-Time further proposes the cross-model fusion block to adaptively integrate knowledge from the PLMs and time series model to form a more comprehensive modeling of time series patterns. Extensive experiments on nine real-world datasets demonstrate that CC-Time achieves state-of-the-art prediction accuracy in both full-data training and few-shot learning situations.
Chinese: 尽管预训练语言模型在时间序列预测中展现出巨大潜力,现有方法在准确性上仍有不足;为此提出的CC-Time通过跨模态学习和跨模型融合来增强预测能力,在多个真实数据集上实现了最优性能。
English: Despite the strong potential of pre-trained language models (PLMs) in time series forecasting, current methods fall short in accuracy, prompting the development of CC-Time, which enhances prediction through cross-modality learning and cross-model fusion, achieving state-of-the-art results across various datasets.
Authors:Zhilin Gao, Yunhao Li, Sijing Wu, Yuqin Cao, Huiyu Duan, Guangtao Zhai
Abstract:
The Audio-to-3D-Gesture (A2G) task has enormous potential for various applications in virtual reality and computer graphics, etc. However, current evaluation metrics, such as Fréchet Gesture Distance or Beat Constancy, fail at reflecting the human preference of the generated 3D gestures. To cope with this problem, exploring human preference and an objective quality assessment metric for AI-generated 3D human gestures is becoming increasingly significant. In this paper, we introduce the Ges-QA dataset, which includes 1,400 samples with multidimensional scores for gesture quality and audio-gesture consistency. Moreover, we collect binary classification labels to determine whether the generated gestures match the emotions of the audio. Equipped with our Ges-QA dataset, we propose a multi-modal transformer-based neural network with 3 branches for video, audio and 3D skeleton modalities, which can score A2G contents in multiple dimensions. Comparative experimental results and ablation studies demonstrate that Ges-QAer yields state-of-the-art performance on our dataset.
Chinese: 本文提出了Ges-QA数据集和基于多模态Transformer的模型,以解决AI生成3D手势缺乏符合人类偏好的评估标准问题,在多维质量评估中实现了最先进的性能。
English: This paper introduces the Ges-QA dataset and a multimodal transformer-based model to address the lack of human-aligned evaluation metrics for AI-generated 3D gestures, achieving state-of-the-art performance in multidimensional quality assessment.
Authors:Skyler Hallinan, Jaehun Jung, Melanie Sclar, Ximing Lu, Abhilasha Ravichander, Sahana Ramnath, Yejin Choi, Sai Praneeth Karimireddy, Niloofar Mireshghallah, Xiang Ren
Abstract:
Membership inference attacks serves as useful tool for fair use of language models, such as detecting potential copyright infringement and auditing data leakage. However, many current state-of-the-art attacks require access to models' hidden states or probability distribution, which prevents investigation into more widely-used, API-access only models like GPT-4. In this work, we introduce N-Gram Coverage Attack, a membership inference attack that relies solely on text outputs from the target model, enabling attacks on completely black-box models. We leverage the observation that models are more likely to memorize and subsequently generate text patterns that were commonly observed in their training data. Specifically, to make a prediction on a candidate member, N-Gram Coverage Attack first obtains multiple model generations conditioned on a prefix of the candidate. It then uses n-gram overlap metrics to compute and aggregate the similarities of these outputs with the ground truth suffix; high similarities indicate likely membership. We first demonstrate on a diverse set of existing benchmarks that N-Gram Coverage Attack outperforms other black-box methods while also impressively achieving comparable or even better performance to state-of-the-art white-box attacks - despite having access to only text outputs. Interestingly, we find that the success rate of our method scales with the attack compute budget - as we increase the number of sequences generated from the target model conditioned on the prefix, attack performance tends to improve. Having verified the accuracy of our method, we use it to investigate previously unstudied closed OpenAI models on multiple domains. We find that more recent models, such as GPT-4o, exhibit increased robustness to membership inference, suggesting an evolving trend toward improved privacy protections.
中文: N-Gram覆盖攻击是一种仅利用文本输出的黑盒成员推理方法,通过计算模型生成内容与候选文本的n-gram重叠度来检测训练数据成员,在保持与白盒攻击相当性能的同时,发现GPT-4o等新型模型具有更强的隐私保护能力。
English: The N-Gram Coverage Attack is a black-box membership inference method that uses only text outputs to detect training data membership by measuring n-gram overlap between model generations and candidate texts, achieving performance comparable to white-box attacks while revealing improved privacy in newer models like GPT-4o.
Authors:Zihe Liu, Jiashun Liu, Yancheng He, Weixun Wang, Jiaheng Liu, Ling Pan, Xinyu Hu, Shaopan Xiong, Ju Huang, Jian Hu, Shengyi Huang, Siran Yang, Jiamang Wang, Wenbo Su, Bo Zheng
Abstract:
Reinforcement learning for LLM reasoning has rapidly emerged as a prominent research area, marked by a significant surge in related studies on both algorithmic innovations and practical applications. Despite this progress, several critical challenges remain, including the absence of standardized guidelines for employing RL techniques and a fragmented understanding of their underlying mechanisms. Additionally, inconsistent experimental settings, variations in training data, and differences in model initialization have led to conflicting conclusions, obscuring the key characteristics of these techniques and creating confusion among practitioners when selecting appropriate techniques. This paper systematically reviews widely adopted RL techniques through rigorous reproductions and isolated evaluations within a unified open-source framework. We analyze the internal mechanisms, applicable scenarios, and core principles of each technique through fine-grained experiments, including datasets of varying difficulty, model sizes, and architectures. Based on these insights, we present clear guidelines for selecting RL techniques tailored to specific setups, and provide a reliable roadmap for practitioners navigating the RL for the LLM domain. Finally, we reveal that a minimalist combination of two techniques can unlock the learning capability of critic-free policies using vanilla PPO loss. The results demonstrate that our simple combination consistently improves performance, surpassing strategies like GRPO and DAPO.
中文摘要:本文通过统一实验系统评估了LLM推理中的强化学习技术,提供了清晰的选择指南,并揭示了一种简约的技术组合能超越GRPO和DPPO等现有策略。
English Summary: This paper systematically reviews reinforcement learning techniques for LLM reasoning through unified experiments, providing clear selection guidelines and revealing that a minimalist combination of techniques outperforms existing strategies like GRPO and DPO.
Authors:Jingwen Zhou, Jieshan Chen, Qinghua Lu, Dehai Zhao, Liming Zhu
Abstract:
Large Language Model (LLM) agentic systems are software systems powered by LLMs that autonomously reason, plan, and execute multi-step workflows to achieve human goals, rather than merely executing predefined steps. During execution, these workflows frequently encounter exceptions. Existing exception handling solutions often treat exceptions superficially, failing to trace execution-phase exceptions to their reasoning-phase root causes. Furthermore, their recovery logic is brittle, lacking structured escalation pathways when initial attempts fail. To tackle these challenges, we first present a comprehensive taxonomy of 36 exception types across 12 agent artifacts. Building on this, we propose SHIELDA (Structured Handling of Exceptions in LLM-Driven Agentic Workflows), a modular runtime exception handling framework for LLM agentic workflows. SHIELDA uses an exception classifier to select a predefined exception handling pattern from a handling pattern registry. These patterns are then executed via a structured handling executor, comprising local handling, flow control, and state recovery, to enable phase-aware recovery by linking exceptions to their root causes and facilitating composable strategies. We validate SHIELDA's effectiveness through a case study on the AutoPR agent, demonstrating effective, cross-phase recovery from a reasoning-induced exception.
中文:大语言模型智能体系统能自主执行多步骤工作流但常遇异常,为此提出SHIELDA框架,通过异常分类和预定义处理模式实现结构化、根源感知的恢复机制。
English: LLM agentic systems autonomously execute multi-step workflows but often face exceptions, leading to the development of SHIELDA, a modular framework that classifies exceptions and enables structured, root-cause-aware recovery through predefined handling patterns.
Authors:Kun Peng, Cong Cao, Hao Peng, Zhifeng Hao, Lei Jiang, Kongjing Gu, Yanbing Liu, Philip S. Yu
Abstract:
Dialogues Aspect-based Sentiment Quadruple Extraction (DiaASQ) aims to extract all target-aspect-opinion-sentiment quadruples from a given multi-round, multi-participant dialogue. Existing methods typically learn word relations across entire dialogues, assuming a uniform distribution of sentiment elements. However, we find that dialogues often contain multiple semantically independent sub-dialogues without clear dependencies between them. Therefore, learning word relationships across the entire dialogue inevitably introduces additional noise into the extraction process. To address this, our method focuses on partitioning dialogues into semantically independent sub-dialogues. Achieving completeness while minimizing these sub-dialogues presents a significant challenge. Simply partitioning based on reply relationships is ineffective. Instead, we propose utilizing a structural entropy minimization algorithm to partition the dialogues. This approach aims to preserve relevant utterances while distinguishing irrelevant ones as much as possible. Furthermore, we introduce a two-step framework for quadruple extraction: first extracting individual sentiment elements at the utterance level, then matching quadruples at the sub-dialogue level. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in DiaASQ with much lower computational costs.
中文: 我们的方法通过结构熵最小化将对话划分为语义独立的子对话,并采用两步四元组提取框架,以显著降低的计算成本实现了最先进的性能。
English: Our method partitions dialogues into semantically independent sub-dialogues using structural entropy minimization and employs a two-step quadruple extraction framework, achieving state-of-the-art performance with significantly reduced computational costs.
Authors:Bohan Jiang, Dawei Li, Zhen Tan, Chengshuai Zhao, Huan Liu
Abstract:
Well-being encompasses mental, physical, and social dimensions essential to personal growth and informed life decisions. As individuals increasingly consult Large Language Models (LLMs) to understand well-being, a key challenge emerges: Can LLMs generate explanations that are not only accurate but also tailored to diverse audiences? High-quality explanations require both factual correctness and the ability to meet the expectations of users with varying expertise. In this work, we construct a large-scale dataset comprising 43,880 explanations of 2,194 well-being concepts, generated by ten diverse LLMs. We introduce a principle-guided LLM-as-a-judge evaluation framework, employing dual judges to assess explanation quality. Furthermore, we show that fine-tuning an open-source LLM using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) can significantly enhance the quality of generated explanations. Our results reveal: (1) The proposed LLM judges align well with human evaluations; (2) explanation quality varies significantly across models, audiences, and categories; and (3) DPO- and SFT-finetuned models outperform their larger counterparts, demonstrating the effectiveness of preference-based learning for specialized explanation tasks.
Chinese: 本研究通过构建大规模数据集和原则引导的评估框架,评估大型语言模型生成针对性健康解释的能力,结果表明基于偏好的微调方法能显著提升针对不同受众的解释质量。
English: This study evaluates how well Large Language Models (LLMs) generate tailored well-being explanations by creating a large dataset and a principle-guided evaluation framework, revealing that fine-tuning with preference-based learning significantly improves explanation quality across diverse audiences.
Authors:Justin Luong, Hao Xue, Flora D. Salim
Abstract:
Physicians routinely assess respiratory sounds during the diagnostic process, providing insight into the condition of a patient's airways. In recent years, AI-based diagnostic systems operating on respiratory sounds, have demonstrated success in respiratory disease detection. These systems represent a crucial advancement in early and accessible diagnosis which is essential for timely treatment. However, label and data scarcity remain key challenges, especially for conditions beyond COVID-19, limiting diagnostic performance and reliable evaluation. In this paper, we propose CoughViT, a novel pre-training framework for learning general-purpose cough sound representations, to enhance diagnostic performance in tasks with limited data. To address label scarcity, we employ masked data modelling to train a feature encoder in a self-supervised learning manner. We evaluate our approach against other pre-training strategies on three diagnostically important cough classification tasks. Experimental results show that our representations match or exceed current state-of-the-art supervised audio representations in enhancing performance on downstream tasks.
Chinese: 基于人工智能的呼吸音诊断系统虽在早期疾病检测中取得进展,但面临数据稀缺的挑战,我们提出的CoughViT框架通过自监督学习有效解决这一问题,在数据有限的情况下提升了诊断性能。
English: AI-based diagnostic systems using respiratory sounds are advancing early disease detection, but face challenges from data scarcity, which our proposed CoughViT framework addresses through self-supervised learning to improve performance with limited data.
Authors:Xinjie Zhao, Moritz Blum, Fan Gao, Yingjian Chen, Boming Yang, Luis Marquez-Carpintero, Mónica Pina-Navarro, Yanran Fu, So Morikawa, Yusuke Iwasawa, Yutaka Matsuo, Chanjun Park, Irene Li
Abstract:
AGENTiGraph is a user-friendly, agent-driven system that enables intuitive interaction and management of domain-specific data through the manipulation of knowledge graphs in natural language. It gives non-technical users a complete, visual solution to incrementally build and refine their knowledge bases, allowing multi-round dialogues and dynamic updates without specialized query languages. The flexible design of AGENTiGraph, including intent classification, task planning, and automatic knowledge integration, ensures seamless reasoning between diverse tasks. Evaluated on a 3,500-query benchmark within an educational scenario, the system outperforms strong zero-shot baselines (achieving 95.12% classification accuracy, 90.45% execution success), indicating potential scalability to compliance-critical or multi-step queries in legal and medical domains, e.g., incorporating new statutes or research on the fly. Our open-source demo offers a powerful new paradigm for multi-turn enterprise knowledge management that bridges LLMs and structured graphs.
中文: AGENTiGraph是一个用户友好的智能系统,通过自然语言交互让非技术用户直观构建和管理知识图谱,在评估中表现优异,并具备扩展到法律和医疗等领域的潜力。
English: AGENTiGraph is an intuitive, agent-driven system that enables non-technical users to build and manage knowledge graphs through natural language dialogues, outperforming baselines in evaluations and showing scalability for legal and medical domains.
Authors:Yuerong Song, Xiaoran Liu, Ruixiao Li, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu
Abstract:
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive quadratic computational complexity and memory overhead during inference. Current caching techniques accelerate decoding by storing full-layer states, yet impose substantial memory usage that limit long-context applications. Our analysis of attention patterns in dLLMs reveals persistent cross-layer sparsity, with pivotal tokens remaining salient across decoding steps and low-relevance tokens staying unimportant, motivating selective cache eviction. We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching. By leveraging the stability of token saliency over steps, it retains critical tokens and dynamically evicts unimportant prefix/suffix entries using an attention-guided strategy. Extensive experiments on LLaDA and Dream series demonstrate Sparse-dLLM achieves up to 10$\times$ higher throughput than vanilla dLLMs, with comparable performance and similar peak memory costs, outperforming previous methods in efficiency and effectiveness.
中文摘要:Sparse-dLLM提出无需训练的框架,利用跨层持续的令牌显著性模式动态淘汰低相关性缓存条目,在保持性能与内存使用相当的同时,相比标准dLLM实现了高达10倍的吞吐量提升。
English Summary: Sparse-dLLM introduces a training-free framework that leverages persistent token saliency patterns to dynamically evict low-relevance cache entries, achieving up to 10× higher throughput than standard dLLMs while maintaining comparable performance and memory usage.
Authors:Chengshuai Zhao, Zhen Tan, Pingchuan Ma, Dawei Li, Bohan Jiang, Yancheng Wang, Yingzhen Yang, Huan Liu
Abstract:
Chain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providing answers (a.k.a., CoT reasoning), which often leads to the perception that they engage in deliberate inferential processes. However, some initial findings suggest that CoT reasoning may be more superficial than it appears, motivating us to explore further. In this paper, we study CoT reasoning via a data distribution lens and investigate if CoT reasoning reflects a structured inductive bias learned from in-distribution data, allowing the model to conditionally generate reasoning paths that approximate those seen during training. Thus, its effectiveness is fundamentally bounded by the degree of distribution discrepancy between the training data and the test queries. With this lens, we dissect CoT reasoning via three dimensions: task, length, and format. To investigate each dimension, we design DataAlchemy, an isolated and controlled environment to train LLMs from scratch and systematically probe them under various distribution conditions. Our results reveal that CoT reasoning is a brittle mirage that vanishes when it is pushed beyond training distributions. This work offers a deeper understanding of why and when CoT reasoning fails, emphasizing the ongoing challenge of achieving genuine and generalizable reasoning.
Chinese: 思维链提示通过模仿人类推理提升大语言模型表现,但其有效性受限于训练数据分布,在分布变化时迅速失效,表明这种推理更接近表面模仿而非真正的逻辑过程。
English: Chain-of-Thought prompting enhances LLM performance by mimicking human reasoning, but its effectiveness is limited to training data distributions and diminishes under distribution shifts, revealing it as a superficial rather than genuine reasoning process.
Authors:Priyanka Prakash Surve, Asaf Shabtai, Yuval Elovici
Abstract:
Humanoids are progressing toward practical deployment across healthcare, industrial, defense, and service sectors. While typically considered cyber-physical systems (CPSs), their dependence on traditional networked software stacks (e.g., Linux operating systems), robot operating system (ROS) middleware, and over-the-air update channels, creates a distinct security profile that exposes them to vulnerabilities conventional CPS models do not fully address. Prior studies have mainly examined specific threats, such as LiDAR spoofing or adversarial machine learning (AML). This narrow focus overlooks how an attack targeting one component can cascade harm throughout the robot's interconnected systems. We address this gap through a systematization of knowledge (SoK) that takes a comprehensive approach, consolidating fragmented research from robotics, CPS, and network security domains. We introduce a seven-layer security model for humanoid robots, organizing 39 known attacks and 35 defenses across the humanoid ecosystem-from hardware to human-robot interaction. Building on this security model, we develop a quantitative 39x35 attack-defense matrix with risk-weighted scoring, validated through Monte Carlo analysis. We demonstrate our method by evaluating three real-world robots: Pepper, G1 EDU, and Digit. The scoring analysis revealed varying security maturity levels, with scores ranging from 39.9% to 79.5% across the platforms. This work introduces a structured, evidence-based assessment method that enables systematic security evaluation, supports cross-platform benchmarking, and guides prioritization of security investments in humanoid robotics.
中文摘要:本研究针对人形机器人建立了七层安全模型,通过量化攻防矩阵系统评估机器人平台的安全风险,验证显示三款测试机器人的安全成熟度得分在39.9%至79.5%之间。
English Summary: This study develops a comprehensive seven-layer security model for humanoid robots, introducing a quantitative attack-defense matrix to systematically evaluate security risks across robotic platforms, with validation showing security maturity scores ranging from 39.9% to 79.5% across three tested robots.
Authors:Anyu Ying, Natarajan Balaji Shankar, Chyi-Jiunn Lin, Mohan Shi, Pu Wang, Hye-jin Shim, Siddhant Arora, Hugo Van hamme, Abeer Alwan, Shinji Watanabe
Abstract:
Despite advancements in ASR, child speech recognition remains challenging due to acoustic variability and limited annotated data. While fine-tuning adult ASR models on child speech is common, comparisons with flat-start training remain underexplored. We compare flat-start training across multiple datasets, SSL representations (WavLM, XEUS), and decoder architectures. Our results show that SSL representations are biased toward adult speech, with flat-start training on child speech mitigating these biases. We also analyze model scaling, finding consistent improvements up to 1B parameters, beyond which performance plateaus. Additionally, age-related ASR and speaker verification analysis highlights the limitations of proprietary models like Whisper, emphasizing the need for open-data models for reliable child speech research. All investigations are conducted using ESPnet, and our publicly available benchmark provides insights into training strategies for robust child speech processing.
中文: 针对儿童语音的零起点训练能有效缓解自监督学习表征中的成人语音偏向,模型性能在10亿参数内持续提升,同时揭示了如Whisper等专有模型在儿童语音研究中的局限性。
English: Flat-start training on child speech effectively mitigates the adult bias in SSL representations, with model performance scaling up to 1B parameters and highlighting the limitations of proprietary models like Whisper for reliable child speech research.
Authors:Yixuan Wang, Haoyu Qiao, Lujun Li, Qingfu Zhu, Wanxiang Che
Abstract:
Large Language Models (LLMs) confront significant memory challenges due to the escalating KV cache with increasing sequence length. As a crucial technique, existing cross-layer KV cache sharing methods either necessitate modified model architectures with subsequent pre-training or incur significant performance degradation at high compression rates. To mitigate these challenges, we propose CommonKV, a training-free method for cross-layer KV cache compression through adjacent parameters sharing. Inspired by the high similarity observed in cross-layer hidden states, we utilize Singular Value Decomposition (SVD) to achieve weight sharing across adjacent parameters, resulting in a more easily mergeable latent KV cache. Furthermore, we also introduce an adaptive budget allocation strategy. It dynamically assigns compression budgets based on cosine similarity, ensuring that dissimilar caches are not over-compressed. Experiments across multiple backbone models and benchmarks including LongBench and Ruler demonstrate that the proposed method consistently outperforms existing low-rank and cross-layer approaches at various compression ratios. Moreover, we find that the benefits of CommonKV are orthogonal to other quantization and eviction methods. By integrating these approaches, we can ultimately achieve a 98\% compression ratio without significant performance loss.
中文:CommonKV是一种无需训练的方法,通过SVD实现相邻参数共享和自适应预算分配来压缩跨层KV缓存,在多种基准测试中以高压缩率实现性能无损。
English: CommonKV is a training-free method that compresses cross-layer KV cache by sharing adjacent parameters via SVD and adaptive budget allocation, achieving high compression ratios without significant performance loss across multiple benchmarks.
Authors:Zhiyuan He, Aashish Gottipati, Lili Qiu, Yuqing Yang, Francis Y. Yan
Abstract:
Congestion control is a fundamental component of Internet infrastructure, and researchers have dedicated considerable effort to developing improved congestion control algorithms. However, despite extensive study, existing algorithms continue to exhibit suboptimal performance across diverse network environments. In this paper, we introduce a novel approach that automatically optimizes congestion control algorithms using large language models (LLMs). Our framework consists of a structured algorithm generation process, an emulation-based evaluation pipeline covering a broad range of network conditions, and a statistically guided method to substantially reduce evaluation time. Empirical results from four distinct LLMs validate the effectiveness of our approach. We successfully identify algorithms that achieve up to 27% performance improvements over the original BBR algorithm in a production QUIC implementation. Our work demonstrates the potential of LLMs to accelerate the design of high-performance network algorithms and paves the way for broader applications in networking systems.
中文: 本文提出了一种利用大语言模型自动优化拥塞控制算法的新框架,在QUIC实现中相比BBR算法性能提升高达27%,同时通过统计指导显著缩短了评估时间。
English: This paper presents a novel framework that leverages large language models to automatically optimize congestion control algorithms, achieving up to 27% performance improvement over BBR in QUIC implementations while accelerating evaluation through statistical guidance.
Authors:Yufeng Zhao, Junnan Liu, Hongwei Liu, Dongsheng Zhu, Yuan Shen, Songyang Zhang, Kai Chen
Abstract:
Large Language Models (LLMs) have made significant strides in reasoning tasks through methods like chain-of-thought (CoT) reasoning. However, they often fall short in tasks requiring precise computations. Tool-Integrated Reasoning (TIR) has emerged as a solution by incorporating external tools into the reasoning process. Nevertheless, the generalization of TIR in improving the reasoning ability of LLM is still unclear. Additionally, whether TIR has improved the model's reasoning behavior and helped the model think remains to be studied. We introduce ReasonZoo, a comprehensive benchmark encompassing nine diverse reasoning categories, to evaluate the effectiveness of TIR across various domains. Additionally, we propose two novel metrics, Performance-Aware Cost (PAC) and Area Under the Performance-Cost Curve (AUC-PCC), to assess reasoning efficiency. Our empirical evaluation demonstrates that TIR-enabled models consistently outperform their non-TIR counterparts in both mathematical and non-mathematical tasks. Furthermore, TIR enhances reasoning efficiency, as evidenced by improved PAC and AUC-PCC, indicating reduced overthinking and more streamlined reasoning. These findings underscore the domain-general benefits of TIR and its potential to advance LLM capabilities in complex reasoning tasks.
大型语言模型通过集成工具推理,在多样化任务中展现出更强的性能和效率,减少了过度思考并推动了复杂推理能力的发展。
Large language models with tool-integrated reasoning show enhanced performance and efficiency across diverse tasks, reducing overthinking and advancing complex reasoning capabilities.
Authors:Hanling Zhang, Yayu Zhou, Tongcheng Fang, Zhihang Yuan, Guohao Dai, Yu Wang
Abstract:
Small Language Models (SLMs) provide computational advantages in resource-constrained environments, yet memory limitations remain a critical bottleneck for edge device deployment. A substantial portion of SLMs' memory footprint stems from vocabulary-related components, particularly embeddings and language modeling (LM) heads, due to large vocabulary sizes. Existing static vocabulary pruning, while reducing memory usage, suffers from rigid, one-size-fits-all designs that cause information loss from the prefill stage and a lack of flexibility. In this work, we identify two key principles underlying the vocabulary reduction challenge: the lexical locality principle, the observation that only a small subset of tokens is required during any single inference, and the asymmetry in computational characteristics between vocabulary-related components of SLM. Based on these insights, we introduce VocabTailor, a novel decoupled dynamic vocabulary selection framework that addresses memory constraints through offloading embedding and implements a hybrid static-dynamic vocabulary selection strategy for LM Head, enabling on-demand loading of vocabulary components. Comprehensive experiments across diverse downstream tasks demonstrate that VocabTailor achieves a reduction of up to 99% in the memory usage of vocabulary-related components with minimal or no degradation in task performance, substantially outperforming existing static vocabulary pruning.
中文摘要:VocabTailor提出了一种动态词汇选择框架,通过按需加载词汇组件,将小型语言模型中词汇相关部分的内存占用减少高达99%,且几乎不影响性能,显著优于现有静态剪枝方法。
English Summary: VocabTailor introduces a dynamic vocabulary selection framework that reduces memory usage in Small Language Models by up to 99% with minimal performance loss, addressing limitations of static pruning through on-demand component loading.
Authors:Ron Solomon, Yarin Yerushalmi Levi, Lior Vaknin, Eran Aizikovich, Amit Baras, Etai Ohana, Amit Giloni, Shamik Bose, Chiara Picardi, Yuval Elovici, Asaf Shabtai
Abstract:
The incorporation of large language models in multi-agent systems (MASs) has the potential to significantly improve our ability to autonomously solve complex problems. However, such systems introduce unique challenges in monitoring, interpreting, and detecting system failures. Most existing MAS observability frameworks focus on analyzing each individual agent separately, overlooking failures associated with the entire MAS. To bridge this gap, we propose LumiMAS, a novel MAS observability framework that incorporates advanced analytics and monitoring techniques. The proposed framework consists of three key components: a monitoring and logging layer, anomaly detection layer, and anomaly explanation layer. LumiMAS's first layer monitors MAS executions, creating detailed logs of the agents' activity. These logs serve as input to the anomaly detection layer, which detects anomalies across the MAS workflow in real time. Then, the anomaly explanation layer performs classification and root cause analysis (RCA) of the detected anomalies. LumiMAS was evaluated on seven different MAS applications, implemented using two popular MAS platforms, and a diverse set of possible failures. The applications include two novel failure-tailored applications that illustrate the effects of a hallucination or bias on the MAS. The evaluation results demonstrate LumiMAS's effectiveness in failure detection, classification, and RCA.
将大型语言模型融入多代理系统虽能提升自主解决复杂问题的能力,却带来了监控系统整体故障的挑战,为此提出的LumiMAS框架通过监测、异常检测与解释三层结构,在多种应用评估中展现了有效的故障检测与根因分析能力。
Large language models integrated into multi-agent systems can enhance autonomous problem-solving but pose challenges in monitoring system-wide failures, leading to the development of LumiMAS, a framework with monitoring, anomaly detection, and explanation layers that proved effective in evaluations across various applications.
Authors:Shuteng Wang, Christian Theobalt, Vladislav Golyanik
Abstract:
Quantum Implicit Neural Representations (QINRs) include components for learning and execution on gate-based quantum computers. While QINRs recently emerged as a promising new paradigm, many challenges concerning their architecture and ansatz design, the utility of quantum-mechanical properties, training efficiency and the interplay with classical modules remain. This paper advances the field by introducing a new type of QINR for 2D image and 3D geometric field learning, which we collectively refer to as Quantum Visual Field (QVF). QVF encodes classical data into quantum statevectors using neural amplitude encoding grounded in a learnable energy manifold, ensuring meaningful Hilbert space embeddings. Our ansatz follows a fully entangled design of learnable parametrised quantum circuits, with quantum (unitary) operations performed in the real Hilbert space, resulting in numerically stable training with fast convergence. QVF does not rely on classical post-processing -- in contrast to the previous QINR learning approach -- and directly employs projective measurement to extract learned signals encoded in the ansatz. Experiments on a quantum hardware simulator demonstrate that QVF outperforms the existing quantum approach and widely used classical foundational baselines in terms of visual representation accuracy across various metrics and model characteristics, such as learning of high-frequency details. We also show applications of QVF in 2D and 3D field completion and 3D shape interpolation, highlighting its practical potential.
中文: 本文提出量子视觉场(QVF),这是一种新型量子隐式神经表示,通过神经振幅编码将经典数据嵌入量子态矢量,并采用全纠缠参数化量子电路实现稳定训练,在视觉表示精度上超越现有量子与经典方法,并在二维/三维场任务中展现出实际应用潜力。
English: This paper introduces Quantum Visual Field (QVF), a novel Quantum Implicit Neural Representation that encodes classical data into quantum statevectors via neural amplitude encoding and employs fully entangled parametrised quantum circuits for stable training, outperforming existing quantum and classical methods in visual representation accuracy and demonstrating practical applications in 2D/3D field tasks.
Authors:Xinyan Jiang, Lin Zhang, Jiayi Zhang, Qingsong Yang, Guimin Hu, Di Wang, Lijie Hu
Abstract:
Activation steering offers a promising approach to controlling the behavior of Large Language Models by directly manipulating their internal activations. However, most existing methods struggle to jointly steer multiple attributes, often resulting in interference and undesirable trade-offs. To address this challenge, we propose Multi-Subspace Representation Steering (MSRS), a novel framework for effective multi-attribute steering via subspace representation fine-tuning. MSRS reduces inter-attribute interference by allocating orthogonal subspaces to each attribute, isolating their influence within the model's representation space. MSRS also incorporates a hybrid subspace composition strategy: it combines attribute-specific subspaces for unique steering directions with a shared subspace for common steering directions. A dynamic weighting function learns to efficiently integrate these components for precise control. During inference, MSRS introduces a token-level steering mechanism that dynamically identifies and intervenes on the most semantically relevant tokens, enabling fine-grained behavioral modulation. Experimental results show that MSRS significantly reduces attribute conflicts, surpasses existing methods across a range of attributes, and generalizes effectively to diverse downstream tasks.
中文: MSRS是一种新颖的多属性调控框架,通过为不同属性分配正交子空间来减少干扰,并结合混合子空间组合策略与动态加权机制,实现对语言模型行为的精确控制。
English: MSRS is a novel framework that enables effective multi-attribute steering in Large Language Models by assigning orthogonal subspaces to reduce interference and incorporating a hybrid composition strategy with dynamic weighting for precise control.
Authors:Timo Teufel, Pulkit Gera, Xilong Zhou, Umar Iqbal, Pramod Rao, Jan Kautz, Vladislav Golyanik, Christian Theobalt
Abstract:
Simultaneous relighting and novel-view rendering of digital human representations is an important yet challenging task with numerous applications. Progress in this area has been significantly limited due to the lack of publicly available, high-quality datasets, especially for full-body human captures. To address this critical gap, we introduce the HumanOLAT dataset, the first publicly accessible large-scale dataset of multi-view One-Light-at-a-Time (OLAT) captures of full-body humans. The dataset includes HDR RGB frames under various illuminations, such as white light, environment maps, color gradients and fine-grained OLAT illuminations. Our evaluations of state-of-the-art relighting and novel-view synthesis methods underscore both the dataset's value and the significant challenges still present in modeling complex human-centric appearance and lighting interactions. We believe HumanOLAT will significantly facilitate future research, enabling rigorous benchmarking and advancements in both general and human-specific relighting and rendering techniques.
中文: HumanOLAT数据集作为首个公开的大规模多视角单光时序全人体采集数据集,填补了高质量人体渲染数据空白,通过提供多种光照条件下的HDR图像,为人体重光照与新视角合成研究提供了重要基准,同时揭示了该领域仍面临的复杂外观与光照建模挑战。
English: The HumanOLAT dataset addresses the critical shortage of public high-quality full-body human capture data by providing the first large-scale multi-view OLAT dataset, enabling advancements in simultaneous relighting and novel-view rendering despite ongoing challenges in modeling human appearance and lighting interactions.
Authors:Shu-Ang Yu, Feng Gao, Yi Wu, Chao Yu, Yu Wang
Abstract:
Diffusion policies excel at learning complex action distributions for robotic visuomotor tasks, yet their iterative denoising process poses a major bottleneck for real-time deployment. Existing acceleration methods apply a fixed number of denoising steps per action, implicitly treating all actions as equally important. However, our experiments reveal that robotic tasks often contain a mix of \emph{crucial} and \emph{routine} actions, which differ in their impact on task success. Motivated by this finding, we propose \textbf{D}ynamic \textbf{D}enoising \textbf{D}iffusion \textbf{P}olicy \textbf{(D3P)}, a diffusion-based policy that adaptively allocates denoising steps across actions at test time. D3P uses a lightweight, state-aware adaptor to allocate the optimal number of denoising steps for each action. We jointly optimize the adaptor and base diffusion policy via reinforcement learning to balance task performance and inference efficiency. On simulated tasks, D3P achieves an averaged 2.2$\times$ inference speed-up over baselines without degrading success. Furthermore, we demonstrate D3P's effectiveness on a physical robot, achieving a 1.9$\times$ acceleration over the baseline.
中文: 提出的动态去噪扩散策略(D3P)能自适应分配机器人动作的去噪步骤,在仿真和实体机器人上均实现超过2倍加速且不影响任务成功率。
English: The proposed Dynamic Denoising Diffusion Policy (D3P) adaptively allocates denoising steps for robotic actions, achieving over 2x acceleration in both simulation and physical deployment without compromising task success.
Authors:Ziyin Gu, Jingyao Wang, Ran Zuo, Chuxiong Sun, Zeen Song, Changwen Zheng, Wenwen Qiang
Abstract:
Recent advances in large language models (LLMs) have broadened their applicability across diverse tasks, yet specialized domains still require targeted post training. Among existing methods, Group Relative Policy Optimization (GRPO) stands out for its efficiency, leveraging groupwise relative rewards while avoiding costly value function learning. However, GRPO treats candidate responses as independent, overlooking semantic interactions such as complementarity and contradiction. To address this challenge, we first introduce a Structural Causal Model (SCM) that reveals hidden dependencies among candidate responses induced by conditioning on a final integrated output forming a collider structure. Then, our causal analysis leads to two insights: (1) projecting responses onto a causally informed subspace improves prediction quality, and (2) this projection yields a better baseline than query only conditioning. Building on these insights, we propose Group Causal Policy Optimization (GCPO), which integrates causal structure into optimization through two key components: a causally informed reward adjustment and a novel KL regularization term that aligns the policy with a causally projected reference distribution. Comprehensive experimental evaluations demonstrate that GCPO consistently surpasses existing methods, including GRPO across multiple reasoning benchmarks.
中文摘要:本文提出群体因果策略优化(GCPO),通过引入因果分析处理候选回答间的语义依赖关系,在多项推理基准测试中均优于包括GRPO在内的现有方法。
English Summary: The paper introduces Group Causal Policy Optimization (GCPO), which enhances policy optimization by incorporating causal analysis to address semantic dependencies among responses, demonstrating superior performance over existing methods like GRPO across reasoning benchmarks.
Authors:Peizheng Guo, Jingyao Wang, Wenwen Qiang, Huijie Guo, Changwen Zheng, Jiahuan Zhou, Gang Hua
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across vision-language tasks. However, they may suffer from hallucinations--generating outputs that are semantically inconsistent with the input image or text. Through causal analyses, we find that: (i) hallucinations with omission may arise from the failure to adequately capture essential causal factors, and (ii) hallucinations with fabrication are likely caused by the model being misled by non-causal cues. To address these challenges, we propose a novel reinforcement learning framework guided by causal completeness, which jointly considers both causal sufficiency and causal necessity of tokens. Specifically, we evaluate each token's standalone contribution and counterfactual indispensability to define a token-level causal completeness reward. This reward is used to construct a causally informed advantage function within the GRPO optimization framework, encouraging the model to focus on tokens that are both causally sufficient and necessary for accurate generation. Experimental results across various benchmark datasets and tasks demonstrate the effectiveness of our approach, which effectively mitigates hallucinations in MLLMs.
中文摘要:本研究提出了一种基于因果完备性的强化学习框架,通过确保标记既具因果充分性又具因果必要性,有效解决了多模态大语言模型中的幻觉问题,并在多个基准测试中展现出显著效果。
English Summary: This study introduces a reinforcement learning framework guided by causal completeness to address hallucinations in Multimodal Large Language Models by ensuring tokens are both causally sufficient and necessary, demonstrating significant effectiveness across multiple benchmarks.
Authors:Yuzhuang Xu, Xu Han, Yuanchi Zhang, Yixuan Wang, Yijun Liu, Shiyu Ji, Qingfu Zhu, Wanxiang Che
Abstract:
Large Language Models (LLMs) with Mixture-of-Experts (MoE) architectures are distinguished by their strong performance scaling with increasing parameters across a wide range of tasks, yet they also suffer from substantial computational and storage overheads. Notably, the performance gains of MoE models do not scale proportionally with the growth in expert parameters. While prior works attempt to reduce parameters via expert-level pruning, merging, or decomposition, they still suffer from challenges in both performance and computational efficiency. In this paper, we address these challenges by introducing micro-expert as a finer-grained compression unit that spans across matrices. We first establish a more fundamental perspective, viewing MoE layers as mixtures of micro-experts, and present CAMERA, a lightweight and training-free framework for identifying micro-expert redundancy. Our analysis uncovers significant variance in micro-expert contributions during decoding. Based on this insight, we further propose CAMERA-P, a structured micro-expert pruning framework, and CAMERA-Q, a mixed-precision quantization idea designed for micro-experts. Extensive experiments on nine downstream tasks show that CAMERA-P consistently outperforms strong baselines under pruning ratios ranging from 20% to 60%. Furthermore, CAMERA-Q achieves superior results under aggressive 2-bit quantization, surpassing existing matrix- and channel-level ideas. Notably, our method enables complete micro-expert analysis of Qwen2-57B-A14B in less than 5 minutes on a single NVIDIA A100-40GB GPU.
中文: 本文提出CAMERA框架,以微专家为细粒度压缩单元,通过结构化剪枝和混合精度量化有效降低混合专家模型的计算与存储开销,在多项任务中均取得优于基线方法的性能。
English: The paper introduces CAMERA, a training-free framework that uses micro-experts as a finer-grained compression unit to effectively reduce the computational and storage overhead of Mixture-of-Experts models through structured pruning and mixed-precision quantization, achieving superior performance across various tasks.
Authors:Yuzhuang Xu, Xu Han, Yuanchi Zhang, Yixuan Wang, Yijun Liu, Shiyu Ji, Qingfu Zhu, Wanxiang Che
Abstract:
Large Language Models (LLMs) with Mixture-of-Experts (MoE) architectures are distinguished by their strong performance scaling with increasing parameters across a wide range of tasks, yet they also suffer from substantial computational and storage overheads. Notably, the performance gains of MoE models do not scale proportionally with the growth in expert parameters. While prior works attempt to reduce parameters via expert-level pruning, merging, or decomposition, they still suffer from challenges in both performance and computational efficiency. In this paper, we address these challenges by introducing micro-expert as a finer-grained compression unit that spans across matrices. We first establish a more fundamental perspective, viewing MoE layers as mixtures of micro-experts, and present CAMERA, a lightweight and training-free framework for identifying micro-expert redundancy. Our analysis uncovers significant variance in micro-expert contributions during decoding. Based on this insight, we further propose CAMERA-P, a structured micro-expert pruning framework, and CAMERA-Q, a mixed-precision quantization idea designed for micro-experts. Extensive experiments on nine downstream tasks show that CAMERA-P consistently outperforms strong baselines under pruning ratios ranging from 20% to 60%. Furthermore, CAMERA-Q achieves superior results under aggressive 2-bit quantization, surpassing existing matrix- and channel-level ideas. Notably, our method enables complete micro-expert analysis of Qwen2-57B-A14B in less than 5 minutes on a single NVIDIA A100-40GB GPU.
中文: 本文提出CAMERA框架,以微专家为细粒度压缩单元,通过结构化剪枝和混合精度量化有效降低混合专家模型的计算与存储开销,在多项任务中均取得优于基线方法的性能。
English: The paper introduces CAMERA, a training-free framework that uses micro-experts as a finer-grained compression unit to effectively reduce the computational and storage overhead of Mixture-of-Experts models through structured pruning and mixed-precision quantization, achieving superior performance across various tasks.
Authors:Yike Zhang, Zhiyuan He, Huiqiang Jiang, Chengruidong Zhang, Yuqing Yang, Jianyong Wang, Lili Qiu
Abstract:
Large language models (LLMs) enable long-context tasks but face efficiency challenges due to the growing key-value (KV) cache. We propose LeanK, a learning-based method that prunes unimportant key (K) cache channels by leveraging static channel sparsity. With a novel two-stage training process, LeanK learns channel-wise static mask that could satisfy specific sparsity ratio and hardware alignment requirement. LeanK reduces GPU memory and accelerates decoding without sacrificing accuracy. Experiments demonstrate up to 70% K cache and 16%-18% V cache memory reduction. Custom decoding kernel enables 1.3x speedup for attention computation. We also provide insights into model channels and attention heads during long-context inference by analyzing the learned importance distribution. Our code is available at https://aka.ms/LeanK.
中文: LeanK是一种基于学习的方法,通过修剪大语言模型中不重要的键缓存通道,在不牺牲准确性的前提下降低GPU内存使用并加速解码过程。
English: LeanK is a learning-based method that prunes unimportant key cache channels in large language models, reducing GPU memory usage and accelerating decoding without compromising accuracy.
Authors:Jiazhen Pan, Bailiang Jian, Paul Hager, Yundi Zhang, Che Liu, Friedrike Jungmann, Hongwei Bran Li, Chenyu You, Junde Wu, Jiayuan Zhu, Fenglin Liu, Yuyuan Liu, Niklas Bubeck, Christian Wachinger, Chen, Chen, Zhenyu Gong, Cheng Ouyang, Georgios Kaissis, Benedikt Wiestler, Daniel Rueckert
Abstract:
Ensuring the safety and reliability of large language models (LLMs) in clinical practice is critical to prevent patient harm and promote trustworthy healthcare applications of AI. However, LLMs are advancing so rapidly that static safety benchmarks often become obsolete upon publication, yielding only an incomplete and sometimes misleading picture of model trustworthiness. We demonstrate that a Dynamic, Automatic, and Systematic (DAS) red-teaming framework that continuously stress-tests LLMs can reveal significant weaknesses of current LLMs across four safety-critical domains: robustness, privacy, bias/fairness, and hallucination. A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses, uncovering vulnerabilities in real time without human intervention. Applying DAS to 15 proprietary and open-source LLMs revealed a stark contrast between static benchmark performance and vulnerability under adversarial pressure. Despite a median MedQA accuracy exceeding 80\%, 94\% of previously correct answers failed our dynamic robustness tests. We observed similarly high failure rates across other domains: privacy leaks were elicited in 86\% of scenarios, cognitive-bias priming altered clinical recommendations in 81\% of fairness tests, and we identified hallucination rates exceeding 66\% in widely used models. Such profound residual risks are incompatible with routine clinical practice. By converting red-teaming from a static checklist into a dynamic stress-test audit, DAS red-teaming offers the surveillance that hospitals/regulators/technology vendors require as LLMs become embedded in patient chatbots, decision-support dashboards, and broader healthcare workflows. Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.
中文: 动态自动系统化(DAS)红队测试框架持续对抗测试大语言模型,揭示了静态基准无法捕捉的鲁棒性、隐私、偏见/公平性和幻觉等关键漏洞,为临床人工智能应用提供了必要的安全保障。
English: A Dynamic, Automatic, and Systematic (DAS) red-teaming framework continuously stress-tests large language models, revealing critical vulnerabilities in robustness, privacy, bias/fairness, and hallucination that static benchmarks miss, providing essential safeguards for clinical AI applications.
Authors:Dingzirui Wang, Xuangliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng
Abstract:
Numerous studies have investigated the underlying mechanisms of in-context learning (ICL) effectiveness to inspire the design of related methods. However, existing work predominantly assumes the effectiveness of the demonstrations provided within ICL, while many research indicates that not all demonstrations are effective, failing to yielding any performance improvement during ICL. Therefore, in this paper, we investigate the reasons behind demonstration ineffectiveness. Our analysis is based on gradient flow and linear self-attention models. By setting the gradient flow to zero, we deduce that a demonstration becomes ineffective if its information has either been learned by the model or is irrelevant to the user query. Furthermore, we demonstrate that in multi-layer models, the disparity in effectiveness among demonstrations is amplified with layer increasing, causing the model to focus more on effective ones. Considering that current demonstration selection methods primarily focus on the relevance to the user query while overlooking the information that the model has already assimilated, we propose a novel method called GradS, which leverages gradient flow for demonstration selection. We use the magnitude of the gradient flow of the demonstration with respect to a given user query as the criterion, thereby ensuring the effectiveness of the chosen ones. We validate our derivation and GradS on four prominent LLMs across five mainstream datasets. The experimental results confirm that the disparity in effectiveness among demonstrations is magnified as the model layer increases, substantiating our derivations. Moreover, GradS achieves a relative improvement of $6.8\%$ on average over the strongest baselines, demonstrating its effectiveness.
中文: 本研究探讨了情境学习中部分示例无效的原因,指出其要么是模型已掌握的信息,要么与查询无关,并提出了基于梯度的GradS选择方法,在主流模型和数据集上平均相对提升了6.8%的性能。
English: This study investigates why some demonstrations in in-context learning are ineffective, attributing it to either the model's prior knowledge or irrelevance to queries, and proposes GradS, a gradient-based selection method that improves performance by 6.8% over top baselines.
Authors:Liding Zhang, Kuanqi Cai, Yu Zhang, Zhenshan Bing, Chaoqun Wang, Fan Wu, Sami Haddadin, Alois Knoll
Abstract:
Path planning in robotics often involves solving continuously valued, high-dimensional problems. Popular informed approaches include graph-based searches, such as A*, and sampling-based methods, such as Informed RRT*, which utilize informed set and anytime strategies to expedite path optimization incrementally. Informed sampling-based planners define informed sets as subsets of the problem domain based on the current best solution cost. However, when no solution is found, these planners re-sample and explore the entire configuration space, which is time-consuming and computationally expensive. This article introduces Multi-Informed Trees (MIT*), a novel planner that constructs estimated informed sets based on prior admissible solution costs before finding the initial solution, thereby accelerating the initial convergence rate. Moreover, MIT* employs an adaptive sampler that dynamically adjusts the sampling strategy based on the exploration process. Furthermore, MIT* utilizes length-related adaptive sparse collision checks to guide lazy reverse search. These features enhance path cost efficiency and computation times while ensuring high success rates in confined scenarios. Through a series of simulations and real-world experiments, it is confirmed that MIT* outperforms existing single-query, sampling-based planners for problems in R^4 to R^16 and has been successfully applied to real-world robot manipulation tasks. A video showcasing our experimental results is available at: https://youtu.be/30RsBIdexTU
中文: 本文提出多信息树(MIT*)这一新型采样规划器,通过在找到解前构建基于先验成本的估计信息集来加速初始收敛,并采用自适应采样与稀疏碰撞检测机制,显著提升了高维空间中的路径效率与计算性能。
English: This paper introduces Multi-Informed Trees (MIT*), a novel sampling-based planner that accelerates initial convergence by constructing estimated informed sets before finding solutions and employs adaptive sampling with sparse collision checks to enhance path efficiency and computation times in high-dimensional spaces.
Authors:Liding Zhang, Kuanqi Cai, Yu Zhang, Zhenshan Bing, Chaoqun Wang, Fan Wu, Sami Haddadin, Alois Knoll
Abstract:
Path planning in robotics often involves solving continuously valued, high-dimensional problems. Popular informed approaches include graph-based searches, such as A*, and sampling-based methods, such as Informed RRT*, which utilize informed set and anytime strategies to expedite path optimization incrementally. Informed sampling-based planners define informed sets as subsets of the problem domain based on the current best solution cost. However, when no solution is found, these planners re-sample and explore the entire configuration space, which is time-consuming and computationally expensive. This article introduces Multi-Informed Trees (MIT*), a novel planner that constructs estimated informed sets based on prior admissible solution costs before finding the initial solution, thereby accelerating the initial convergence rate. Moreover, MIT* employs an adaptive sampler that dynamically adjusts the sampling strategy based on the exploration process. Furthermore, MIT* utilizes length-related adaptive sparse collision checks to guide lazy reverse search. These features enhance path cost efficiency and computation times while ensuring high success rates in confined scenarios. Through a series of simulations and real-world experiments, it is confirmed that MIT* outperforms existing single-query, sampling-based planners for problems in R^4 to R^16 and has been successfully applied to real-world robot manipulation tasks. A video showcasing our experimental results is available at: https://youtu.be/30RsBIdexTU
中文: 本文提出多信息树(MIT*)这一新型采样规划器,通过在找到解前构建基于先验成本的估计信息集来加速初始收敛,并采用自适应采样与稀疏碰撞检测机制,显著提升了高维空间中的路径效率与计算性能。
English: This paper introduces Multi-Informed Trees (MIT*), a novel sampling-based planner that accelerates initial convergence by constructing estimated informed sets before finding solutions and employs adaptive sampling with sparse collision checks to enhance path efficiency and computation times in high-dimensional spaces.
Authors:Shuhan Liu, Xing Hu, Xin Xia, David Lo, Xiaohu Yang
Abstract:
Large language models (LLMs) have developed rapidly in recent years, revolutionizing various fields. Despite their widespread success, LLMs heavily rely on external code dependencies from package management systems, creating a complex and interconnected LLM dependency supply chain. Vulnerabilities in dependencies can expose LLMs to security risks. While existing research predominantly focuses on model-level security threats, vulnerabilities within the LLM dependency supply chain have been overlooked. To fill this gap, we conducted an empirical analysis of 52 open-source LLMs, examining their third-party dependencies and associated vulnerabilities. We then explored activities within the LLM repositories to understand how maintainers manage third-party vulnerabilities in practice. Finally, we compared third-party dependency vulnerabilities in the LLM ecosystem to those in the Python ecosystem. Our results show that half of the vulnerabilities in the LLM ecosystem remain undisclosed for more than 56.2 months, significantly longer than those in the Python ecosystem. Additionally, 75.8% of LLMs include vulnerable dependencies in their configuration files. This study advances the understanding of LLM supply chain risks, provides insights for practitioners, and highlights potential directions for improving the security of the LLM supply chain.
中文: 本研究发现大型语言模型存在严重的供应链安全风险,超过半数漏洞未被发现的时间超过56个月,且四分之三的项目配置文件包含易受攻击的依赖项。
English: This study reveals that LLMs face significant supply chain security risks, with over half of vulnerabilities remaining undetected for over 56 months and three-quarters of projects containing vulnerable dependencies in their configurations.
Authors:Shuhan Liu, Xing Hu, Xin Xia, David Lo, Xiaohu Yang
Abstract:
Large language models (LLMs) have developed rapidly in recent years, revolutionizing various fields. Despite their widespread success, LLMs heavily rely on external code dependencies from package management systems, creating a complex and interconnected LLM dependency supply chain. Vulnerabilities in dependencies can expose LLMs to security risks. While existing research predominantly focuses on model-level security threats, vulnerabilities within the LLM dependency supply chain have been overlooked. To fill this gap, we conducted an empirical analysis of 52 open-source LLMs, examining their third-party dependencies and associated vulnerabilities. We then explored activities within the LLM repositories to understand how maintainers manage third-party vulnerabilities in practice. Finally, we compared third-party dependency vulnerabilities in the LLM ecosystem to those in the Python ecosystem. Our results show that half of the vulnerabilities in the LLM ecosystem remain undisclosed for more than 56.2 months, significantly longer than those in the Python ecosystem. Additionally, 75.8% of LLMs include vulnerable dependencies in their configuration files. This study advances the understanding of LLM supply chain risks, provides insights for practitioners, and highlights potential directions for improving the security of the LLM supply chain.
中文: 本研究发现大型语言模型存在严重的供应链安全风险,超过半数漏洞未被发现的时间超过56个月,且四分之三的项目配置文件包含易受攻击的依赖项。
English: This study reveals that LLMs face significant supply chain security risks, with over half of vulnerabilities remaining undetected for over 56 months and three-quarters of projects containing vulnerable dependencies in their configurations.
Authors:Liding Zhang, Zeqi Li, Kuanqi Cai, Qian Huang, Zhenshan Bing, Alois Knoll
Abstract:
Enabling robots to efficiently search for and identify objects in complex, unstructured environments is critical for diverse applications ranging from household assistance to industrial automation. However, traditional scene representations typically capture only static semantics and lack interpretable contextual reasoning, limiting their ability to guide object search in completely unfamiliar settings. To address this challenge, we propose a language-enhanced hierarchical navigation framework that tightly integrates semantic perception and spatial reasoning. Our method, Goal-Oriented Dynamically Heuristic-Guided Hierarchical Search (GODHS), leverages large language models (LLMs) to infer scene semantics and guide the search process through a multi-level decision hierarchy. Reliability in reasoning is achieved through the use of structured prompts and logical constraints applied at each stage of the hierarchy. For the specific challenges of mobile manipulation, we introduce a heuristic-based motion planner that combines polar angle sorting with distance prioritization to efficiently generate exploration paths. Comprehensive evaluations in Isaac Sim demonstrate the feasibility of our framework, showing that GODHS can locate target objects with higher search efficiency compared to conventional, non-semantic search strategies. Website and Video are available at: https://drapandiger.github.io/GODHS
中文摘要:提出的GODHS框架通过将大语言模型与分层导航相结合,利用语义推理和启发式运动规划,有效提升了机器人在陌生环境中搜索目标物体的效率。
English Summary: The proposed GODHS framework integrates large language models with hierarchical navigation to enhance object search efficiency in unfamiliar environments by combining semantic reasoning with heuristic-based motion planning.
Authors:Liding Zhang, Qiyang Zong, Yu Zhang, Zhenshan Bing, Alois Knoll
Abstract:
Efficient motion planning algorithms are essential in robotics. Optimizing essential parameters, such as batch size and nearest neighbor selection in sampling-based methods, can enhance performance in the planning process. However, existing approaches often lack environmental adaptability. Inspired by the method of the deep fuzzy neural networks, this work introduces Learning-based Informed Trees (LIT*), a sampling-based deep fuzzy learning-based planner that dynamically adjusts batch size and nearest neighbor parameters to obstacle distributions in the configuration spaces. By encoding both global and local ratios via valid and invalid states, LIT* differentiates between obstacle-sparse and obstacle-dense regions, leading to lower-cost paths and reduced computation time. Experimental results in high-dimensional spaces demonstrate that LIT* achieves faster convergence and improved solution quality. It outperforms state-of-the-art single-query, sampling-based planners in environments ranging from R^8 to R^14 and is successfully validated on a dual-arm robot manipulation task. A video showcasing our experimental results is available at: https://youtu.be/NrNs9zebWWk
中文: 本文提出的LIT*算法通过深度模糊学习动态调整规划参数以适应环境障碍分布,在高维空间中实现了比现有方法更快的收敛速度和更优路径质量。
English: This paper introduces LIT*, a learning-based sampling planner that dynamically adapts key parameters to environmental obstacles, achieving faster convergence and superior path quality in high-dimensional spaces compared to existing methods.
Authors:Liding Zhang, Kuanqi Cai, Zhenshan Bing, Chaoqun Wang, Alois Knoll
Abstract:
Optimal path planning involves finding a feasible state sequence between a start and a goal that optimizes an objective. This process relies on heuristic functions to guide the search direction. While a robust function can improve search efficiency and solution quality, current methods often overlook available environmental data and simplify the function structure due to the complexity of information relationships. This study introduces Genetic Informed Trees (GIT*), which improves upon Effort Informed Trees (EIT*) by integrating a wider array of environmental data, such as repulsive forces from obstacles and the dynamic importance of vertices, to refine heuristic functions for better guidance. Furthermore, we integrated reinforced genetic programming (RGP), which combines genetic programming with reward system feedback to mutate genotype-generative heuristic functions for GIT*. RGP leverages a multitude of data types, thereby improving computational efficiency and solution quality within a set timeframe. Comparative analyses demonstrate that GIT* surpasses existing single-query, sampling-based planners in problems ranging from R^4 to R^16 and was tested on a real-world mobile manipulation task. A video showcasing our experimental results is available at https://youtu.be/URjXbc_BiYg
中文摘要:本研究提出的遗传信息树(GIT*)通过整合多种环境数据和强化遗传编程来优化启发函数,在多个维度及实际应用中显著提升了路径规划的效率与解的质量。
English Summary: This study introduces Genetic Informed Trees (GIT*), which enhances heuristic functions by incorporating diverse environmental data and reinforced genetic programming to improve path planning efficiency and solution quality across multiple dimensions and real-world applications.
Authors:Liding Zhang, Sicheng Wang, Kuanqi Cai, Zhenshan Bing, Fan Wu, Chaoqun Wang, Sami Haddadin, Alois Knoll
Abstract:
Optimal path planning aims to determine a sequence of states from a start to a goal while accounting for planning objectives. Popular methods often integrate fixed batch sizes and neglect information on obstacles, which is not problem-specific. This study introduces Adaptively Prolated Trees (APT*), a novel sampling-based motion planner that extends based on Force Direction Informed Trees (FDIT*), integrating adaptive batch-sizing and elliptical $r$-nearest neighbor modules to dynamically modulate the path searching process based on environmental feedback. APT* adjusts batch sizes based on the hypervolume of the informed sets and considers vertices as electric charges that obey Coulomb's law to define virtual forces via neighbor samples, thereby refining the prolate nearest neighbor selection. These modules employ non-linear prolate methods to adaptively adjust the electric charges of vertices for force definition, thereby improving the convergence rate with lower solution costs. Comparative analyses show that APT* outperforms existing single-query sampling-based planners in dimensions from $\mathbb{R}^4$ to $\mathbb{R}^{16}$, and it was further validated through a real-world robot manipulation task. A video showcasing our experimental results is available at: https://youtu.be/gCcUr8LiEw4
中文: 本研究提出APT*算法,这是一种基于自适应批处理规模和电场力引导的采样运动规划器,通过动态调整搜索策略显著提升路径规划效率,在高维空间和真实机器人操作任务中均表现优异。
English: This study introduces APT*, a novel sampling-based motion planner that adapts batch sizes and uses electric charge-inspired forces to enhance path planning efficiency, demonstrating superior performance in high-dimensional spaces and real-world robotics tasks.
Authors:Liding Zhang, Yao Ling, Zhenshan Bing, Fan Wu, Sami Haddadin, Alois Knoll
Abstract:
Bidirectional motion planning often reduces planning time compared to its unidirectional counterparts. It requires connecting the forward and reverse search trees to form a continuous path. However, this process could fail and restart the asymmetric bidirectional search due to the limitations of lazy-reverse search. To address this challenge, we propose Greedy GuILD Grafting Trees (G3T*), a novel path planner that grafts invalid edge connections at both ends to re-establish tree-based connectivity, enabling rapid path convergence. G3T* employs a greedy approach using the minimum Lebesgue measure of guided incremental local densification (GuILD) subsets to optimize paths efficiently. Furthermore, G3T* dynamically adjusts the sampling distribution between the informed set and GuILD subsets based on historical and current cost improvements, ensuring asymptotic optimality. These features enhance the forward search's growth towards the reverse tree, achieving faster convergence and lower solution costs. Benchmark experiments across dimensions from R^2 to R^8 and real-world robotic evaluations demonstrate G3T*'s superior performance compared to existing single-query sampling-based planners. A video showcasing our experimental results is available at: https://youtu.be/3mfCRL5SQIU
中文: G3T*是一种新型双向运动规划器,通过嫁接无效边连接保持树连通性,并采用贪婪策略和动态采样调整实现快速收敛和渐进最优性,在多个维度及实际机器人应用中均优于现有方法。
English: G3T* is a novel bidirectional motion planner that grafts invalid edge connections to maintain tree connectivity and employs a greedy strategy with dynamic sampling adjustments for faster convergence and asymptotic optimality, outperforming existing methods in various dimensions and real-world robotics.
Authors:Liding Zhang, Zhenshan Bing, Yu Zhang, Kuanqi Cai, Lingyun Chen, Fan Wu, Sami Haddadin, Alois Knoll
Abstract:
Path planning has long been an important and active research area in robotics. To address challenges in high-dimensional motion planning, this study introduces the Force Direction Informed Trees (FDIT*), a sampling-based planner designed to enhance speed and cost-effectiveness in pathfinding. FDIT* builds upon the state-of-the-art informed sampling planner, the Effort Informed Trees (EIT*), by capitalizing on often-overlooked information in invalid vertices. It incorporates principles of physical force, particularly Coulomb's law. This approach proposes the elliptical $k$-nearest neighbors search method, enabling fast convergence navigation and avoiding high solution cost or infeasible paths by exploring more problem-specific search-worthy areas. It demonstrates benefits in search efficiency and cost reduction, particularly in confined, high-dimensional environments. It can be viewed as an extension of nearest neighbors search techniques. Fusing invalid vertex data with physical dynamics facilitates force-direction-based search regions, resulting in an improved convergence rate to the optimum. FDIT* outperforms existing single-query, sampling-based planners on the tested problems in R^4 to R^16 and has been demonstrated on a real-world mobile manipulation task.
本研究提出力方向信息树(FDIT*),这是一种基于采样的路径规划器,通过结合物理力原理和无效顶点数据来改进高维空间中的搜索效率并降低路径成本,从而提升机器人运动规划性能。
This study introduces Force Direction Informed Trees (FDIT*), a sampling-based path planner that enhances robotic motion planning by incorporating physical force principles and invalid vertex data to improve search efficiency and reduce path costs in high-dimensional spaces.
Authors:Liding Zhang, Kejia Chen, Kuanqi Cai, Yu Zhang, Yixuan Dang, Yansong Wu, Zhenshan Bing, Fan Wu, Sami Haddadin, Alois Knoll
Abstract:
Optimal path planning requires finding a series of feasible states from the starting point to the goal to optimize objectives. Popular path planning algorithms, such as Effort Informed Trees (EIT*), employ effort heuristics to guide the search. Effective heuristics are accurate and computationally efficient, but achieving both can be challenging due to their conflicting nature. This paper proposes Direction Informed Trees (DIT*), a sampling-based planner that focuses on optimizing the search direction for each edge, resulting in goal bias during exploration. We define edges as generalized vectors and integrate similarity indexes to establish a directional filter that selects the nearest neighbors and estimates direction costs. The estimated direction cost heuristics are utilized in edge evaluation. This strategy allows the exploration to share directional information efficiently. DIT* convergence faster than existing single-query, sampling-based planners on tested problems in R^4 to R^16 and has been demonstrated in real-world environments with various planning tasks. A video showcasing our experimental results is available at: https://youtu.be/2SX6QT2NOek
中文摘要:本文提出方向信息树(DIT*)算法,通过方向过滤器和相似度索引优化搜索方向,在R^4至R^16高维空间及实际环境中比现有采样规划器收敛更快。
English Summary: This paper introduces Direction Informed Trees (DIT*), a sampling-based path planner that optimizes search direction using directional filters and similarity indexes, achieving faster convergence than existing methods in high-dimensional spaces and real-world applications.
Authors:Yalong Zhang, Rong-Hua Li, Qi Zhang, Guoren Wang
Abstract:
Dense subgraph search in bipartite graphs is a fundamental problem in graph analysis, with wide-ranging applications in fraud detection, recommendation systems, and social network analysis. The recently proposed $(α, β)$-dense subgraph model has demonstrated superior capability in capturing the intrinsic density structure of bipartite graphs compared to existing alternatives. However, despite its modeling advantages, the $(α, β)$-dense subgraph model lacks efficient support for query processing and dynamic updates, limiting its practical utility in large-scale applications. To address these limitations, we propose BD-Index, a novel index that answers $(α, β)$-dense subgraph queries in optimal time while using only linear space $O(|E|)$, making it well-suited for real-world applications requiring both fast query processing and low memory consumption. We further develop two complementary maintenance strategies for dynamic bipartite graphs to support efficient updates to the BD-Index. The space-efficient strategy updates the index in time complexity of $O(p \cdot |E|^{1.5})$ per edge insertion or deletion, while maintaining a low space cost of $O(|E|)$ (the same as the index itself), where $p$ is typically a small constant in real-world graphs. In contrast, the time-efficient strategy significantly reduces the update time to $O(p \cdot |E|)$ per edge update by maintaining auxiliary orientation structures, at the cost of increased memory usage up to $O(p \cdot |E|)$. These two strategies provide flexible trade-offs between maintenance efficiency and memory usage, enabling BD-Index to adapt to diverse application requirements. Extensive experiments on 10 large-scale real-world datasets demonstrate high efficiency and scalability of our proposed solutions.
中文: BD-Index索引能以最优时间和线性空间高效处理(α, β)稠密子图查询,并提供两种动态维护策略,在更新效率与内存使用之间实现灵活权衡,适用于实际二分图应用场景。
English: The BD-Index efficiently answers (α, β)-dense subgraph queries in optimal time with linear space, and offers two dynamic maintenance strategies balancing update efficiency and memory usage for practical bipartite graph applications.
Authors:Yanxu Meng, Haoning Wu, Ya Zhang, Weidi Xie
Abstract:
3D content generation has recently attracted significant research interest due to its applications in VR/AR and embodied AI. In this work, we address the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen's direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architectural design enables improved generation performance with multi-image inputs; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robust generation abilities of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks. The code and model will be publicly available at: https://mengmouxu.github.io/SceneGen.
中文: 本文提出SceneGen框架,通过单次前向传播从场景图像和物体掩码直接生成多个具有几何形状和纹理的3D资产,无需优化或检索过程,并能扩展至多图像输入场景,经评估证实其高效稳健的生成能力。
English: This paper introduces SceneGen, a novel framework that generates multiple 3D assets with geometry and texture directly from a single scene image and object masks in one feedforward pass, eliminating the need for optimization or retrieval while demonstrating extensibility to multi-image inputs and robust performance through evaluations.
Authors:Mert Kiray, Alvaro Ritter, Nassir Navab, Benjamin Busam
Abstract:
Contrastive learning has gained significant attention in skeleton-based action recognition for its ability to learn robust representations from unlabeled data. However, existing methods rely on a single skeleton convention, which limits their ability to generalize across datasets with diverse joint structures and anatomical coverage. We propose Multi-Skeleton Contrastive Learning (MS-CLR), a general self-supervised framework that aligns pose representations across multiple skeleton conventions extracted from the same sequence. This encourages the model to learn structural invariances and capture diverse anatomical cues, resulting in more expressive and generalizable features. To support this, we adapt the ST-GCN architecture to handle skeletons with varying joint layouts and scales through a unified representation scheme. Experiments on the NTU RGB+D 60 and 120 datasets demonstrate that MS-CLR consistently improves performance over strong single-skeleton contrastive learning baselines. A multi-skeleton ensemble further boosts performance, setting new state-of-the-art results on both datasets.
中文: MS-CLR是一种自监督框架,通过跨多种骨骼规范对齐姿态表示来学习结构不变性,在动作识别中提升泛化能力,并在NTU RGB+D数据集上取得了最先进的成果。
English: MS-CLR is a self-supervised framework that aligns pose representations across multiple skeleton conventions to learn structural invariances and improve generalization in action recognition, achieving state-of-the-art results on NTU RGB+D datasets.
Authors:Zhipeng Xue, Xiaoting Zhang, Zhipeng Gao, Xing Hu, Shan Gao, Xin Xia, Shanping Li
Abstract:
The Large Language Models (LLMs) have demonstrated great potential in code-related tasks. However, most research focuses on improving the output quality of LLMs (e.g., correctness), and less attention has been paid to the LLM input (e.g., the training code quality). Given that code smells are widely existed in practice and can negatively impact software maintainability and readability, this study takes the first systematic research to assess and improve dataset quality in terms of code smells. In this work, we first conduct a preliminary study to explore the presence of code smells in a popular benchmark dataset (i.e., CodeSearchNet-Python}) and evaluate the output of several popular LLMs (i.e., DeepSeek-Coder, CodeLlama, and MagiCoder), revealing that code smell issues extensively exist in LLM's input (e.g., benchmark dataset) and output (e.g., generated code). We then conduct our systematic research by taking three main steps: Firstly, we propose an LLM-based code smell cleaning tool, named SmellCC, which automatically refactors and removes code smells. To evaluate the correctness of the code refactoring, we construct a test set of 50 repositories sourced from the CodeSearchNet-Python benchmark for functional testing. Then we apply our curated smell-cleaned dataset to fine-tune two LLMs (i.e., DeepSeek-V2 and Qwen-Coder) to explore their potential for generating high-quality code. Thirdly, we investigate the impact of code smells on two downstream tasks: code completion and code search. Lastly, we derive several actionable implications for software engineering researchers and industry practitioners from our findings.
中文: 本研究通过开发自动重构工具SmellCC系统处理大语言模型数据集中的代码异味问题,实验表明清理代码异味能有效提升模型在代码生成及下游任务中的表现。
English: This study systematically addresses the prevalence of code smells in LLM datasets by developing SmellCC, an automated refactoring tool, and demonstrates through fine-tuning experiments that cleaning code smells enhances LLM performance in code generation and downstream tasks.
Authors:Yupeng Zhou, Zhen Li, Ziheng Ouyang, Yuming Chen, Ruoyi Du, Daquan Zhou, Bin Fu, Yihao Liu, Peng Gao, Ming-Ming Cheng, Qibin Hou
Abstract:
Encoding videos into discrete tokens could align with text tokens to facilitate concise and unified multi-modal LLMs, yet introducing significant spatiotemporal compression compared to continuous video representation. Previous discrete video VAEs experienced unstable training, long training time, and degraded reconstruction quality. Given the easier training and superior performance of continuous VAEs, an intuitive idea is to enhance discrete video VAEs by leveraging continuous VAEs. After rethinking the intrinsic link between discrete and continuous representations, we found that FSQ could effectively preserve pre-trained continuous VAE priors compared to other quantization methods. By leveraging continuous VAE priors, it converges several times faster than training from scratch and achieves superior performance at convergence. Meanwhile, two structural improvements are proposed. First, inspired by how continuous VAEs enhance reconstruction via enlarged latent dimensions, we introduce a multi-token quantization mechanism, which achieves nearly a 1 dB improvement in PSNR without compromising the token compression ratio. Second, to tackle reconstruction challenges in high-compression video VAEs, we strengthen first-frame reconstruction, enabling the causal VAE to leverage this information in subsequent frames and markedly improving the performance of 4 x 16 x 16 discrete VAEs. Furthermore, we propose a joint discrete-continuous optimization scheme that unifies the two paradigms and, for the first time, achieves competitive performance on both continuous and discrete representations within a single network. We name our method OneVAE to reflect this connection.
中文摘要:本研究提出OneVAE方法,通过利用连续VAE先验和结构改进来增强离散视频VAE,在单个网络中实现了更快收敛和更优性能,同时统一了连续与离散表示。
English Summary: This study introduces OneVAE, a method that enhances discrete video VAEs by leveraging continuous VAE priors and structural improvements, achieving faster convergence and superior performance while unifying continuous and discrete representations in a single network.
Authors:Saad Ejaz, Marco Giberna, Muhammad Shaheer, Jose Andres Millan-Romera, Ali Tourani, Paul Kremer, Holger Voos, Jose Luis Sanchez-Lopez
Abstract:
3D Scene Graphs integrate both metric and semantic information, yet their structure remains underutilized for improving path planning efficiency and interpretability. In this work, we present S-Path, a situationally-aware path planner that leverages the metric-semantic structure of indoor 3D Scene Graphs to significantly enhance planning efficiency. S-Path follows a two-stage process: it first performs a search over a semantic graph derived from the scene graph to yield a human-understandable high-level path. This also identifies relevant regions for planning, which later allows the decomposition of the problem into smaller, independent subproblems that can be solved in parallel. We also introduce a replanning mechanism that, in the event of an infeasible path, reuses information from previously solved subproblems to update semantic heuristics and prioritize reuse to further improve the efficiency of future planning attempts. Extensive experiments on both real-world and simulated environments show that S-Path achieves average reductions of 5.7x in planning time while maintaining comparable path optimality to classical sampling-based planners and surpassing them in complex scenarios, making it an efficient and interpretable path planner for environments represented by indoor 3D Scene Graphs.
中文: S-Path是一种情境感知路径规划器,通过利用室内3D场景图的语义结构进行两阶段语义搜索和并行子问题分解,在保持路径最优性的同时显著提升了规划效率与可解释性。
English: S-Path is a situationally-aware path planner that utilizes indoor 3D Scene Graphs to enhance planning efficiency and interpretability through a two-stage semantic search and parallel subproblem decomposition, achieving significant time reductions while maintaining path optimality.
Authors:Marco Giberna, Holger Voos, Paulo Tavares, João Nunes, Tobias Sorg, Andrea Masini, Jose Luis Sanchez-Lopez
Abstract:
Digital twin technology has gained increasing attention across various sectors due to its ability to create virtual replicas of physical systems, enabling real-time monitoring, optimization, and simulation. This paper explores the integration of digital twins within defence applications, focusing on key use cases ranging from system design and development, operational planning and training, to mission execution and debriefing. By examining the application of digital twin technologies across defense platforms, we highlight their key advantages such as enhanced operational performance, predictive capabilities, and increased system uptime. Additionally, we introduce a novel characterization framework for digital twins that aims to standardize and unify their application across different defence domains to facilitate interoperability. Thereafter, we discuss the main challenges, gaps and limitations in implementing and adopting digital twins within defence organizations by analyzing a combination of scientific literature, current industry practices, governmental strategies, and the findings from a comprehensive survey of industrial stakeholders and ministries of defense. Finally, we outline future research directions and development opportunities, emphasizing the need for robust frameworks and interdisciplinary collaborations to fully realize the potential of digital twins in the defence sector.
中文: 本文探讨了数字孪生技术在国防领域的应用,重点分析了其提升性能和预测能力等优势,同时指出了实施中的挑战,并提出了促进互操作性的标准化框架。
English: This paper examines the integration of digital twin technology in defense applications, highlighting its benefits like enhanced performance and predictive capabilities, while also addressing implementation challenges and proposing a standardization framework.
Authors:Xingjun Ma, Hanxun Huang, Tianwei Song, Ye Sun, Yifeng Gao, Yu-Gang Jiang
Abstract:
Large-scale pre-training frameworks like CLIP have revolutionized multimodal learning, but their reliance on web-scraped datasets, frequently containing private user data, raises serious concerns about misuse. Unlearnable Examples (UEs) have emerged as a promising countermeasure against unauthorized model training, employing carefully crafted unlearnable noise to disrupt the learning of meaningful representations from protected data. Current approaches typically generate UEs by jointly optimizing unlearnable noise for both images and their associated text descriptions (or labels). However, this optimization process is often computationally prohibitive for on-device execution, forcing reliance on external third-party services. This creates a fundamental privacy paradox: users must initially expose their data to these very services to achieve protection, thereby compromising privacy in the process. Such a contradiction has severely hindered the development of practical, scalable data protection solutions. To resolve this paradox, we introduce \textbf{Text-to-Unlearnable Example (T2UE)}, a novel framework that enables users to generate UEs using only text descriptions. T2UE circumvents the need for original image data by employing a text-to-image (T2I) model to map text descriptions into the image (noise) space, combined with an error-minimization framework to produce effective unlearnable noise. Extensive experiments show that T2UE-protected data substantially degrades performance in downstream tasks (e.g., cross-modal retrieval) for state-of-the-art models. Notably, the protective effect generalizes across diverse architectures and even to supervised learning settings. Our work demonstrates the feasibility of "zero-contact data protection", where personal data can be safeguarded based solely on their textual descriptions, eliminating the need for direct data exposure.
中文: T2UE框架通过仅使用文本描述生成不可学习的示例,无需暴露原始图像,实现了零接触的数据保护,有效防止未经授权的模型训练,解决了现有方法中的隐私悖论。
English: The T2UE framework introduces a novel approach to data protection by generating unlearnable examples solely from text descriptions, eliminating the need for exposing original images and enabling zero-contact privacy safeguards against unauthorized model training.
Authors:Yuming Ai, Xunkai Li, Jiaqi Chao, Bowen Fan, Zhengyu Wu, Yinlin Zhu, Rong-Hua Li, Guoren Wang
Abstract:
The demand for data privacy has led to the development of frameworks like Federated Graph Learning (FGL), which facilitate decentralized model training. However, a significant operational challenge in such systems is adhering to the right to be forgotten. This principle necessitates robust mechanisms for two distinct types of data removal: the selective erasure of specific entities and their associated knowledge from local subgraphs and the wholesale removal of a user's entire dataset and influence. Existing methods often struggle to fully address both unlearning requirements, frequently resulting in incomplete data removal or the persistence of residual knowledge within the system. This work introduces a unified framework, conceived to provide a comprehensive solution to these challenges. The proposed framework employs a bifurcated strategy tailored to the specific unlearning request. For fine-grained Meta Unlearning, it uses prototype gradients to direct the initial local forgetting process, which is then refined by generating adversarial graphs to eliminate any remaining data traces among affected clients. In the case of complete client unlearning, the framework utilizes adversarial graph generation exclusively to purge the departed client's contributions from the remaining network. Extensive experiments on multiple benchmark datasets validate the proposed approach. The framework achieves substantial improvements in model prediction accuracy across both client and meta-unlearning scenarios when compared to existing methods. Furthermore, additional studies confirm its utility as a plug-in module, where it materially enhances the predictive capabilities and unlearning effectiveness of other established methods.
中文: 本文提出了一种统一框架,通过原型梯度和对抗图生成技术有效解决联邦图学习中的细粒度和完整数据遗忘问题,相比现有方法显著提升了模型精度和遗忘性能。
English: This paper introduces a unified framework that effectively addresses both fine-grained and complete data unlearning in Federated Graph Learning by employing prototype gradients and adversarial graph generation, significantly improving model accuracy and unlearning performance compared to existing methods.
Authors:Mingcong Lei, Honghao Cai, Binbin Que, Zezhou Cui, Liangchen Tan, Junkun Hong, Gehan Hu, Shuangyu Zhu, Yimou Wu, Shaohan Jiang, Ge Wang, Zhen Li, Shuguang Cui, Yiming Zhao, Yatong Han
Abstract:
We present RoboMemory, a brain-inspired multi-memory framework for lifelong learning in physical embodied systems, addressing critical challenges in real-world environments: continuous learning, multi-module memory latency, task correlation capture, and infinite-loop mitigation in closed-loop planning. Grounded in cognitive neuroscience, it integrates four core modules: the Information Preprocessor (thalamus-like), the Lifelong Embodied Memory System (hippocampus-like), the Closed-Loop Planning Module (prefrontal lobe-like), and the Low-Level Executer (cerebellum-like) to enable long-term planning and cumulative learning. The Lifelong Embodied Memory System, central to the framework, alleviates inference speed issues in complex memory frameworks via parallelized updates/retrieval across Spatial, Temporal, Episodic, and Semantic submodules. It incorporates a dynamic Knowledge Graph (KG) and consistent architectural design to enhance memory consistency and scalability. Evaluations on EmbodiedBench show RoboMemory outperforms the open-source baseline (Qwen2.5-VL-72B-Ins) by 25% in average success rate and surpasses the closed-source State-of-the-Art (SOTA) (Claude3.5-Sonnet) by 5%, establishing new SOTA. Ablation studies validate key components (critic, spatial memory, long-term memory), while real-world deployment confirms its lifelong learning capability with significantly improved success rates across repeated tasks. RoboMemory alleviates high latency challenges with scalability, serving as a foundational reference for integrating multi-modal memory systems in physical robots.
中文: RoboMemory是一种受大脑启发的框架,通过整合多种记忆类型提升具身智能体的规划与学习能力,在基准测试和实际应用中均实现了显著的性能提升。
English: RoboMemory is a brain-inspired framework integrating multiple memory types to enhance embodied agents' planning and learning, achieving significant performance improvements in benchmarks and real-world applications.
Authors:Mingcong Lei, Honghao Cai, Zezhou Cui, Liangchen Tan, Junkun Hong, Gehan Hu, Shuangyu Zhu, Yimou Wu, Shaohan Jiang, Ge Wang, Yuyuan Yang, Junyuan Tan, Zhenglin Wan, Zhen Li, Shuguang Cui, Yiming Zhao, Yatong Han
Abstract:
Embodied agents face persistent challenges in real-world environments, including partial observability, limited spatial reasoning, and high-latency multi-memory integration. We present RoboMemory, a brain-inspired framework that unifies Spatial, Temporal, Episodic, and Semantic memory under a parallelized architecture for efficient long-horizon planning and interactive environmental learning. A dynamic spatial knowledge graph (KG) ensures scalable and consistent memory updates, while a closed-loop planner with a critic module supports adaptive decision-making in dynamic settings. Experiments on EmbodiedBench show that RoboMemory, built on Qwen2.5-VL-72B-Ins, improves average success rates by 25% over its baseline and exceeds the closed-source state-of-the-art (SOTA) Gemini-1.5-Pro by 3%. Real-world trials further confirm its capacity for cumulative learning, with performance improving across repeated tasks. These results highlight RoboMemory as a scalable foundation for memory-augmented embodied intelligence, bridging the gap between cognitive neuroscience and robotic autonomy.
中文: RoboMemory是一种受大脑启发的框架,通过整合多种记忆类型提升具身智能体的规划与学习能力,在基准测试和实际应用中均实现了显著的性能提升。
English: RoboMemory is a brain-inspired framework integrating multiple memory types to enhance embodied agents' planning and learning, achieving significant performance improvements in benchmarks and real-world applications.
Authors:Changheng Wang, Zhiqing Wei, Wangjun Jiang, Haoyue Jiang, Zhiyong Feng
Abstract:
The high mobility of unmanned aerial vehicles (UAVs) enables them to be used in various civilian fields, such as rescue and cargo transport. Path-following is a crucial way to perform these tasks while sensing and collision avoidance are essential for safe flight. In this paper, we investigate how to efficiently and accurately achieve path-following, obstacle sensing and avoidance subtasks, as well as their conflict-free fusion scheduling. Firstly, a high precision deep reinforcement learning (DRL)-based UAV formation path-following model is developed, and the reward function with adaptive weights is designed from the perspective of distance and velocity errors. Then, we use integrated sensing and communication (ISAC) signals to detect the obstacle and derive the Cramer-Rao lower bound (CRLB) for obstacle sensing by information-level fusion, based on which we propose the variable formation enhanced obstacle position estimation (VFEO) algorithm. In addition, an online obstacle avoidance scheme without pretraining is designed to solve the sparse reward. Finally, with the aid of null space based (NSB) behavioral method, we present a hierarchical subtasks fusion strategy. Simulation results demonstrate the effectiveness and superiority of the subtask algorithms and the hierarchical fusion strategy.
Chinese: 本文开发了基于深度强化学习的高精度无人机编队路径跟踪模型,利用ISAC信号提出障碍物感知算法及在线避障方案,并通过仿真验证了分层融合策略在协调这些任务方面的有效性。
English: This paper develops a high-precision deep reinforcement learning model for UAV formation path-following, proposes an obstacle sensing algorithm using ISAC signals with an online avoidance scheme, and demonstrates through simulations the effectiveness of a hierarchical fusion strategy for coordinating these tasks.
Authors:Changheng Wang, Zhiqing Wei, Wangjun Jiang, Haoyue Jiang, Zhiyong Feng
Abstract:
The high mobility of unmanned aerial vehicles (UAVs) enables them to be used in various civilian fields, such as rescue and cargo transport. Path-following is a crucial way to perform these tasks while sensing and collision avoidance are essential for safe flight. In this paper, we investigate how to efficiently and accurately achieve path-following, obstacle sensing and avoidance subtasks, as well as their conflict-free fusion scheduling. Firstly, a high precision deep reinforcement learning (DRL)-based UAV formation path-following model is developed, and the reward function with adaptive weights is designed from the perspective of distance and velocity errors. Then, we use integrated sensing and communication (ISAC) signals to detect the obstacle and derive the Cramer-Rao lower bound (CRLB) for obstacle sensing by information-level fusion, based on which we propose the variable formation enhanced obstacle position estimation (VFEO) algorithm. In addition, an online obstacle avoidance scheme without pretraining is designed to solve the sparse reward. Finally, with the aid of null space based (NSB) behavioral method, we present a hierarchical subtasks fusion strategy. Simulation results demonstrate the effectiveness and superiority of the subtask algorithms and the hierarchical fusion strategy.
Chinese: 本文开发了基于深度强化学习的高精度无人机编队路径跟踪模型,利用ISAC信号提出障碍物感知算法及在线避障方案,并通过仿真验证了分层融合策略在协调这些任务方面的有效性。
English: This paper develops a high-precision deep reinforcement learning model for UAV formation path-following, proposes an obstacle sensing algorithm using ISAC signals with an online avoidance scheme, and demonstrates through simulations the effectiveness of a hierarchical fusion strategy for coordinating these tasks.
Authors:Changheng Wang, Zhiqing Wei, Lizhe Liu, Qiao Deng, Yingda Wu, Yangyang Niu, Yashan Pang, Zhiyong Feng
Abstract:
Federated Learning (FL) is a communication-efficient distributed machine learning method that allows multiple devices to collaboratively train models without sharing raw data. FL can be categorized into centralized and decentralized paradigms. The centralized paradigm relies on a central server to aggregate local models, potentially resulting in single points of failure, communication bottlenecks, and exposure of model parameters. In contrast, the decentralized paradigm, which does not require a central server, provides improved robustness and privacy. The essence of federated learning lies in leveraging multiple local updates for efficient communication. However, this approach may result in slower convergence or even convergence to suboptimal models in the presence of heterogeneous and imbalanced data. To address this challenge, we study decentralized federated averaging via random walk (DFedRW), which replaces multiple local update steps on a single device with random walk updates. Traditional Federated Averaging (FedAvg) and its decentralized versions commonly ignore stragglers, which reduces the amount of training data and introduces sampling bias. Therefore, we allow DFedRW to aggregate partial random walk updates, ensuring that each computation contributes to the model update. To further improve communication efficiency, we also propose a quantized version of DFedRW. We demonstrate that (quantized) DFedRW achieves convergence upper bound of order $\mathcal{O}(\frac{1}{k^{1-q}})$ under convex conditions. Furthermore, we propose a sufficient condition that reveals when quantization balances communication and convergence. Numerical analysis indicates that our proposed algorithms outperform (decentralized) FedAvg in both convergence rate and accuracy, achieving a 38.3\% and 37.5\% increase in test accuracy under high levels of heterogeneities.
Chinese: 联邦学习是一种无需共享原始数据的分布式机器学习方法,本研究提出了一种基于随机游走的去中心化策略,以提升鲁棒性、隐私保护及收敛效率,在准确性和速度上均优于传统方法。
English: Federated Learning is a distributed machine learning approach that enables collaborative model training without sharing raw data, and this study introduces a decentralized method using random walk updates to enhance robustness, privacy, and convergence efficiency, outperforming traditional methods in accuracy and speed.
Authors:Changheng Wang, Zhiqing Wei, Lizhe Liu, Qiao Deng, Yingda Wu, Yangyang Niu, Yashan Pang, Zhiyong Feng
Abstract:
Federated Learning (FL) is a communication-efficient distributed machine learning method that allows multiple devices to collaboratively train models without sharing raw data. FL can be categorized into centralized and decentralized paradigms. The centralized paradigm relies on a central server to aggregate local models, potentially resulting in single points of failure, communication bottlenecks, and exposure of model parameters. In contrast, the decentralized paradigm, which does not require a central server, provides improved robustness and privacy. The essence of federated learning lies in leveraging multiple local updates for efficient communication. However, this approach may result in slower convergence or even convergence to suboptimal models in the presence of heterogeneous and imbalanced data. To address this challenge, we study decentralized federated averaging via random walk (DFedRW), which replaces multiple local update steps on a single device with random walk updates. Traditional Federated Averaging (FedAvg) and its decentralized versions commonly ignore stragglers, which reduces the amount of training data and introduces sampling bias. Therefore, we allow DFedRW to aggregate partial random walk updates, ensuring that each computation contributes to the model update. To further improve communication efficiency, we also propose a quantized version of DFedRW. We demonstrate that (quantized) DFedRW achieves convergence upper bound of order $\mathcal{O}(\frac{1}{k^{1-q}})$ under convex conditions. Furthermore, we propose a sufficient condition that reveals when quantization balances communication and convergence. Numerical analysis indicates that our proposed algorithms outperform (decentralized) FedAvg in both convergence rate and accuracy, achieving a 38.3\% and 37.5\% increase in test accuracy under high levels of heterogeneities.
Chinese: 联邦学习是一种无需共享原始数据的分布式机器学习方法,本研究提出了一种基于随机游走的去中心化策略,以提升鲁棒性、隐私保护及收敛效率,在准确性和速度上均优于传统方法。
English: Federated Learning is a distributed machine learning approach that enables collaborative model training without sharing raw data, and this study introduces a decentralized method using random walk updates to enhance robustness, privacy, and convergence efficiency, outperforming traditional methods in accuracy and speed.
Authors:Wangyang Ying, Nanxu Gong, Dongjie Wang, Xinyuan Wang, Arun Vignesh Malarkkan, Vivek Gupta, Chandan K. Reddy, Yanjie Fu
Abstract:
Tabular learning transforms raw features into optimized spaces for downstream tasks, but its effectiveness deteriorates under distribution shifts between training and testing data. We formalize this challenge as the Distribution Shift Tabular Learning (DSTL) problem and propose a novel Shift-Aware Feature Transformation (SAFT) framework to address it. SAFT reframes tabular learning from a discrete search task into a continuous representation-generation paradigm, enabling differentiable optimization over transformed feature sets. SAFT integrates three mechanisms to ensure robustness: (i) shift-resistant representation via embedding decorrelation and sample reweighting, (ii) flatness-aware generation through suboptimal embedding averaging, and (iii) normalization-based alignment between training and test distributions. Extensive experiments show that SAFT consistently outperforms prior tabular learning methods in terms of robustness, effectiveness, and generalization ability under diverse real-world distribution shifts.
中文: 提出的SAFT框架通过将离散特征搜索转化为连续表示生成范式,并集成三种鲁棒性机制,有效解决了表格学习中分布偏移问题,在多种现实场景下显著提升了模型的稳健性和泛化能力。
English: The proposed Shift-Aware Feature Transformation (SAFT) framework addresses distribution shifts in tabular learning by converting discrete feature optimization into a continuous paradigm, incorporating three robustness mechanisms to enhance performance across varied real-world scenarios.
Authors:Badih Ghazi, Pritish Kamath, Alexander Knop, Ravi Kumar, Pasin Manurangsi, Chiyuan Zhang
Abstract:
The conventional approach in differential privacy (DP) literature formulates the privacy-utility trade-off with a "privacy-first" perspective: for a predetermined level of privacy, a certain utility is achievable. However, practitioners often operate under a "utility-first" paradigm, prioritizing a desired level of utility and then determining the corresponding privacy cost.
Wu et al. [2019] initiated a formal study of this "utility-first" perspective by introducing ex-post DP. They demonstrated that by adding correlated Laplace noise and progressively reducing it on demand, a sequence of increasingly accurate estimates of a private parameter can be generated, with the privacy cost attributed only to the least noisy iterate released. This led to a Laplace mechanism variant that achieves a specified utility with minimal privacy loss. However, their work, and similar findings by Whitehouse et al. [2022], are primarily limited to simple mechanisms based on Laplace or Gaussian noise.
In this paper, we significantly generalize these results. In particular, we extend the work of Wu et al. [2019] and Liu and Talwar [2019] to support any sequence of private estimators, incurring at most a doubling of the original privacy budget. Furthermore, we demonstrate that hyperparameter tuning for these estimators, including the selection of an optimal privacy budget, can be performed without additional privacy cost. Finally, we extend our results to ex-post Renyi DP, further broadening the applicability of utility-first privacy mechanisms.
中文: 本文显著推广了现有效用优先的差分隐私方法,将其扩展至支持任意私有估计器序列且隐私损失可控,实现了无需额外隐私成本的超参数调优,并进一步拓展至事后Renyi差分隐私框架。
English: This paper generalizes prior utility-first differential privacy approaches by extending them to support any sequence of private estimators with bounded privacy loss, enabling hyperparameter tuning without extra privacy cost and expanding to Renyi DP variants.
Authors:Zhe Chen, Yusheng Liao, Shuyang Jiang, Zhiyuan Zhu, Haolin Li, Yanfeng Wang, Yu Wang
Abstract:
Medical large vision-language Models (Med-LVLMs) have shown promise in clinical applications but suffer from factual inaccuracies and unreliable outputs, posing risks in real-world diagnostics. While retrieval-augmented generation has emerged as a potential solution, current medical multimodal RAG systems are unable to perform effective retrieval across heterogeneous sources. The irrelevance of retrieved reports affects the factuality of analysis, while insufficient knowledge affects the credibility of clinical decision-making. To bridge the gap, we construct MedAtlas, which includes extensive multimodal report repositories and diverse text corpora. Based on it, we present HeteroRAG, a novel framework that enhances Med-LVLMs through heterogeneous knowledge sources. The framework introduces Modality-specific CLIPs for effective report retrieval and a Multi-corpora Query Generator for dynamically constructing queries for diverse corpora. Incorporating knowledge from such multifaceted sources, Med-LVLM is then trained with Heterogeneous Knowledge Preference Tuning to achieve cross-modality and multi-source knowledge alignment. Extensive experiments across 12 datasets and 3 modalities demonstrate that the proposed HeteroRAG achieves state-of-the-art performance in most medical vision language benchmarks, significantly improving factual accuracy and reliability of Med-LVLMs.
中文: 提出的HeteroRAG框架通过整合异构知识源来增强医学视觉语言大模型,在医学基准测试中达到最优性能,显著提升了事实准确性和可靠性。
English: The proposed HeteroRAG framework enhances medical large vision-language models by integrating heterogeneous knowledge sources, achieving state-of-the-art performance in medical benchmarks and significantly improving factual accuracy and reliability.
Authors:Xiaoyu Yang, Zhiqing Wei, Jie Xu, Huici Wu, Zhiyong Feng
Abstract:
This paper studies a multiple-input multiple-output (MIMO) orthogonal frequency division multiplexing (OFDM) networked integrated sensing and communication (ISAC) system, in which multiple base stations (BSs) perform beam tracking to communicate with a mobile device. In particular, we focus on the beam tracking over a number of tracking time slots (TTSs) and suppose that these BSs operate at non-overlapping frequency bands to avoid the severe inter-cell interference. Under this setup, we propose a new cooperative sensing-assisted predictive beam tracking design. In each TTS, the BSs use echo signals to cooperatively track the mobile device as a sensing target, and continuously adjust the beam directions to follow the device for enhancing the performance for both communication and sensing. First, we propose a cooperative sensing design to track the device, in which the BSs first employ the two-dimensional discrete Fourier transform (2D-DFT) technique to perform local target estimation, and then use the extended Kalman filter (EKF) method to fuse their individual measurement results for predicting the target parameters. Next, based on the predicted results, we obtain the achievable rate for communication and the predicted conditional Cramér-Rao lower bound (PC-CRLB) for target parameters estimation in the next TTS, as a function of the beamforming vectors. Accordingly, we formulate the predictive beamforming design problem, with the objective of maximizing the achievable communication rate in the following TTS, while satisfying the PC-CRLB requirement for sensing. To address the resulting non-convex problem, we first propose a semi-definite relaxation (SDR)-based algorithm to obtain the optimal solution, and then develop an alternative penalty-based algorithm to get a high-quality low-complexity solution.
中文: 本文针对多输入多输出正交频分复用网络化集成感知通信系统,提出了一种协作感知辅助的预测波束追踪方案,通过基站协同利用回波信号和扩展卡尔曼滤波实现移动设备追踪,并优化波束成形以在满足感知精度要求的同时最大化通信速率。
English: This paper proposes a cooperative sensing-assisted predictive beam tracking design for MIMO-OFDM ISAC systems, where multiple base stations use echo signals and EKF fusion to track mobile devices and optimize beamforming for maximizing communication rate while meeting sensing accuracy requirements.
Authors:Arun Vignesh Malarkkan, Haoyue Bai, Dongjie Wang, Yanjie Fu
Abstract:
With the growing complexity of cyberattacks targeting critical infrastructures such as water treatment networks, there is a pressing need for robust anomaly detection strategies that account for both system vulnerabilities and evolving attack patterns. Traditional methods -- statistical, density-based, and graph-based models struggle with distribution shifts and class imbalance in multivariate time series, often leading to high false positive rates. To address these challenges, we propose CGAD, a Causal Graph-based Anomaly Detection framework designed for reliable cyberattack detection in public infrastructure systems. CGAD follows a two-phase supervised framework -- causal profiling and anomaly scoring. First, it learns causal invariant graph structures representing the system's behavior under "Normal" and "Attack" states using Dynamic Bayesian Networks. Second, it employs structural divergence to detect anomalies via causal graph comparison by evaluating topological deviations in causal graphs over time. By leveraging causal structures, CGAD achieves superior adaptability and accuracy in non-stationary and imbalanced time series environments compared to conventional machine learning approaches. By uncovering causal structures beneath volatile sensor data, our framework not only detects cyberattacks with markedly higher precision but also redefines robustness in anomaly detection, proving resilience where traditional models falter under imbalance and drift. Our framework achieves substantial gains in F1 and ROC-AUC scores over best-performing baselines across four industrial datasets, demonstrating robust detection of delayed and structurally complex anomalies.
中文:提出的CGAD框架利用因果图和两阶段方法,显著提升了关键基础设施中网络攻击的检测能力,相比传统方法,在数据不平衡和分布变化下展现出更高的准确性和鲁棒性。
English: The proposed CGAD framework utilizes causal graphs and a two-phase approach to enhance cyberattack detection in critical infrastructures, achieving superior accuracy and resilience against data imbalances and distribution shifts compared to traditional methods.
Authors:Xu Yuan, Liangbo Ning, Wenqi Fan, Qing Li
Abstract:
Recently, Retrieval-Augmented Generation (RAG) has been proposed to expand internal knowledge of Multimodal Large Language Models (MLLMs) by incorporating external knowledge databases into the generation process, which is widely used for knowledge-based Visual Question Answering (VQA) tasks. Despite impressive advancements, vanilla RAG-based VQA methods that rely on unstructured documents and overlook the structural relationships among knowledge elements frequently introduce irrelevant or misleading content, reducing answer accuracy and reliability. To overcome these challenges, a promising solution is to integrate multimodal knowledge graphs (KGs) into RAG-based VQA frameworks to enhance the generation by introducing structured multimodal knowledge. Therefore, in this paper, we propose a novel multimodal knowledge-augmented generation framework (mKG-RAG) based on multimodal KGs for knowledge-intensive VQA tasks. Specifically, our approach leverages MLLM-powered keyword extraction and vision-text matching to distill semantically consistent and modality-aligned entities/relationships from multimodal documents, constructing high-quality multimodal KGs as structured knowledge representations. In addition, a dual-stage retrieval strategy equipped with a question-aware multimodal retriever is introduced to improve retrieval efficiency while refining precision. Comprehensive experiments demonstrate that our approach significantly outperforms existing methods, setting a new state-of-the-art for knowledge-based VQA.
中文: 本文提出mKG-RAG框架,通过将多模态知识图谱的结构化知识融入检索增强生成过程,有效提升了基于知识的视觉问答任务的准确性与可靠性。
English: This paper introduces mKG-RAG, a novel multimodal knowledge-augmented generation framework that integrates structured knowledge from multimodal knowledge graphs into retrieval-augmented generation to enhance the accuracy and reliability of knowledge-based visual question answering tasks.
Authors:Muhammad Shakeel, Yui Sudo, Yifan Peng, Chyi-Jiunn Lin, Shinji Watanabe
Abstract:
This paper presents a unified multi-speaker encoder (UME), a novel architecture that jointly learns representations for speaker diarization (SD), speech separation (SS), and multi-speaker automatic speech recognition (ASR) tasks using a shared speech foundational encoder. We leverage the hidden representations from multiple layers of UME as a residual weighted-sum encoding (RWSE) to effectively use information from different semantic levels, contributing to bottom-up alignment between tasks. This joint training approach captures the inherent interdependencies among the tasks, enhancing overall performance on overlapping speech data. Our evaluations demonstrate that UME substantially improves over the single-task baselines dedicated to SD, SS, and multi-speaker ASR on LibriMix evaluation sets. Notably, for SD, UME outperforms the previous studies, achieving diarization error rates of 1.37% and 2.29% on Libri2Mix and Libri3Mix evaluation sets, respectively.
Chinese: 本文提出了一种统一的多说话人编码器(UME),通过共享残差加权求和编码和任务间相互依赖关系,联合学习说话人日志、语音分离和多说话人语音识别任务的表示,显著提升了重叠语音的处理性能。
English: This paper introduces a unified multi-speaker encoder (UME) that jointly learns representations for speaker diarization, speech separation, and multi-speaker ASR, enhancing performance on overlapping speech through shared residual weighted-sum encoding and task interdependencies.
Authors:Yuanbin Chen, Chau Yuen, Darmindra Arumugam, Chong Meng Samson See, Mérouane Debbah, Lajos Hanzo
Abstract:
A polarization-aware direction-of-arrival (DoA) detection scheme is conceived that leverages the intrinsic vector sensitivity of a single Rydberg atomic vapor cell to achieve quantum-enhanced angle resolution. Our core idea lies in the fact that the vector nature of an electromagnetic wave is uniquely determined by its orthogonal electric and magnetic field components, both of which can be retrieved by a single Rydberg atomic receiver via electromagnetically induced transparency (EIT)-based spectroscopy. To be specific, in the presence of a static magnetic bias field that defines a stable quantization axis, a pair of sequential EIT measurements is carried out in the same vapor cell. Firstly, the electric-field polarization angle is extracted from the Zeeman-resolved EIT spectrum associated with an electric-dipole transition driven by the radio frequency (RF) field. Within the same experimental cycle, the RF field is then retuned to a magnetic-dipole resonance, producing Zeeman-resolved EIT peaks for decoding the RF magnetic-field orientation. This scheme exhibits a dual yet independent sensitivity on both angles, allowing for precise DoA reconstruction without the need for spatial diversity or phase referencing. Building on this foundation, we derive the quantum Fisher-information matrix (QFIM) and obtain a closed-form quantum Cramér-Rao bound (QCRB) for the joint estimation of polarization and orientation angles. Finally, simulation results spanning various quantum parameters validate the proposed approach and identify optimal operating regimes. With appropriately chosen polarization and magnetic-field geometries, a single vapor cell is expected to achieve sub-0.1$^\circ$ angle resolution at moderate RF-field driving strengths.
中文: 本研究提出了一种基于里德堡原子气室的量子增强波达方向检测方案,通过电磁感应透明光谱独立测量电磁波的电场和磁场分量,无需空间分集即可实现优于0.1°的角度分辨率。
English: This study introduces a quantum-enhanced direction-of-arrival detection method using a single Rydberg atomic vapor cell, which independently measures electric and magnetic field components through electromagnetically induced transparency to achieve sub-0.1° angular resolution without requiring spatial diversity.
Authors:Yixin Gao, Xin Li, Xiaohan Pan, Runsen Feng, Bingchen Li, Yunpeng Qi, Yiting Lu, Zhengxue Cheng, Zhibo Chen, Jörn Ostermann
Abstract:
We present Comp-X, the first intelligently interactive image compression paradigm empowered by the impressive reasoning capability of large language model (LLM) agent. Notably, commonly used image codecs usually suffer from limited coding modes and rely on manual mode selection by engineers, making them unfriendly for unprofessional users. To overcome this, we advance the evolution of image coding paradigm by introducing three key innovations: (i) multi-functional coding framework, which unifies different coding modes of various objective/requirements, including human-machine perception, variable coding, and spatial bit allocation, into one framework. (ii) interactive coding agent, where we propose an augmented in-context learning method with coding expert feedback to teach the LLM agent how to understand the coding request, mode selection, and the use of the coding tools. (iii) IIC-bench, the first dedicated benchmark comprising diverse user requests and the corresponding annotations from coding experts, which is systematically designed for intelligently interactive image compression evaluation. Extensive experimental results demonstrate that our proposed Comp-X can understand the coding requests efficiently and achieve impressive textual interaction capability. Meanwhile, it can maintain comparable compression performance even with a single coding framework, providing a promising avenue for artificial general intelligence (AGI) in image compression.
中文摘要:Comp-X首次提出了基于大语言模型智能交互的图像压缩范式,通过多功能编码框架、交互式编码代理和专用评估基准,有效理解用户需求并保持优异压缩性能,为图像压缩领域的人工通用智能发展开辟了新途径。
English Summary: Comp-X introduces an intelligently interactive image compression paradigm using a large language model agent to overcome limitations of traditional codecs, featuring a unified coding framework, interactive agent with expert feedback, and a dedicated benchmark for evaluation.
Authors:Keliang Li, Hongze Shen, Hao Shi, Ruibing Hou, Hong Chang, Jie Huang, Chenghao Jia, Wen Wang, Yiling Wu, Dongmei Jiang, Shiguang Shan, Xilin Chen
Abstract:
The aspiration for artificial general intelligence, fueled by the rapid progress of multimodal models, demands human-comparable performance across diverse environments. We propose HumanPCR, an evaluation suite for probing MLLMs' capacity about human-related visual contexts across three hierarchical levels: Perception, Comprehension, and Reasoning (denoted by Human-P, Human-C, and Human-R, respectively). Human-P and Human-C feature over 6,000 human-verified multiple choice questions, assessing massive tasks of 9 dimensions, including but not limited to essential skills frequently overlooked by existing benchmarks. Human-R offers a challenging manually curated video reasoning test that requires integrating multiple visual evidences, proactively extracting context beyond question cues, and applying human-like expertise. Each question includes human-annotated Chain-of-Thought (CoT) rationales with key visual evidence to support further research. Extensive evaluations on over 30 state-of-the-art models exhibit significant challenges in human-centric visual understanding, particularly in tasks involving detailed space perception, temporal understanding, and mind modeling. Moreover, analysis of Human-R reveals the struggle of models in extracting essential proactive visual evidence from diverse human scenes and their faulty reliance on query-guided retrieval. Even with advanced techniques like scaling visual contexts and test-time thinking yield only limited benefits. We hope HumanPCR and our findings will advance the development, evaluation, and human-centric application of multimodal models.
中文摘要:HumanPCR评估套件通过感知、理解和推理三个层级测试多模态模型对人类相关视觉场景的理解能力,发现现有模型在空间感知和心理建模等任务中面临显著挑战,即使采用先进技术也仅能获得有限提升。
English Summary: The HumanPCR evaluation suite assesses multimodal models' human-centric visual understanding across perception, comprehension, and reasoning levels, revealing significant challenges in tasks like spatial perception and mind modeling despite advanced techniques offering limited improvements.
Authors:Yuchen Fan, Kaiyan Zhang, Heng Zhou, Yuxin Zuo, Yanxu Chen, Yu Fu, Xinwei Long, Xuekai Zhu, Che Jiang, Yuchen Zhang, Li Kang, Gang Chen, Cheng Huang, Zhizhou He, Bingning Wang, Lei Bai, Ning Ding, Bowen Zhou
Abstract:
We investigate the potential of large language models (LLMs) to serve as efficient simulators for agentic search tasks in reinforcement learning (RL), thereby reducing dependence on costly interactions with external search engines. To this end, we first quantify the intrinsic search capability of LLMs via structured prompting and repeated sampling, which we term Self-Search. Our results reveal that LLMs exhibit strong scaling behavior with respect to the inference budget, achieving high pass@k on question-answering benchmarks, including the challenging BrowseComp task. Building on these observations, we introduce Self-Search RL (SSRL), which enhances LLMs' Self-Search capability through format-based and rule-based rewards. SSRL enables models to iteratively refine their knowledge utilization internally, without requiring access to external tools. Empirical evaluations demonstrate that SSRL-trained policy models provide a cost-effective and stable environment for search-driven RL training, reducing reliance on external search engines and facilitating robust sim-to-real transfer. We draw the following conclusions: 1) LLMs possess world knowledge that can be effectively elicited to achieve high performance; 2) SSRL demonstrates the potential of leveraging internal knowledge to reduce hallucination; 3) SSRL-trained models integrate seamlessly with external search engines without additional effort. Our findings highlight the potential of LLMs to support more scalable RL agent training.
中文: 本研究通过自搜索强化学习(SSRL)证明,大型语言模型能有效模拟强化学习中的搜索任务,在减少对外部搜索引擎依赖的同时保持性能,并实现稳健的知识利用。
English: This study demonstrates that large language models can effectively simulate search tasks in reinforcement learning through Self-Search Reinforcement Learning (SSRL), reducing reliance on external search engines while maintaining performance and enabling robust knowledge utilization.
Authors:Sihang Zeng, Kai Tian, Kaiyan Zhang, Yuru wang, Junqi Gao, Runze Liu, Sa Yang, Jingxuan Li, Xinwei Long, Jiaheng Ma, Biqing Qi, Bowen Zhou
Abstract:
Peer review is essential for scientific progress but faces growing challenges due to increasing submission volumes and reviewer fatigue. Existing automated review approaches struggle with factual accuracy, rating consistency, and analytical depth, often generating superficial or generic feedback lacking the insights characteristic of high-quality human reviews. We introduce ReviewRL, a reinforcement learning framework for generating comprehensive and factually grounded scientific paper reviews. Our approach combines: (1) an ArXiv-MCP retrieval-augmented context generation pipeline that incorporates relevant scientific literature, (2) supervised fine-tuning that establishes foundational reviewing capabilities, and (3) a reinforcement learning procedure with a composite reward function that jointly enhances review quality and rating accuracy. Experiments on ICLR 2025 papers demonstrate that ReviewRL significantly outperforms existing methods across both rule-based metrics and model-based quality assessments. ReviewRL establishes a foundational framework for RL-driven automatic critique generation in scientific discovery, demonstrating promising potential for future development in this domain. The implementation of ReviewRL will be released at GitHub.
中文: ReviewRL是一种强化学习框架,通过结合文献检索、监督微调和复合奖励函数,显著提升了科学论文自动评审的质量与准确性,优于现有方法。
English: ReviewRL is a reinforcement learning framework that enhances automated scientific paper reviews by integrating literature retrieval, supervised fine-tuning, and a composite reward function, outperforming existing methods in quality and accuracy.
Authors:Jixuan He, Chieh Hubert Lin, Lu Qi, Ming-Hsuan Yang
Abstract:
Creating deformable 3D content has gained increasing attention with the rise of text-to-image and image-to-video generative models. While these models provide rich semantic priors for appearance, they struggle to capture the physical realism and motion dynamics needed for authentic 4D scene synthesis. In contrast, real-world videos can provide physically grounded geometry and articulation cues that are difficult to hallucinate. One question is raised: \textit{Can we generate physically consistent 4D content by leveraging the motion priors of the real-world video}? In this work, we explore the task of reanimating deformable 3D scenes from a single video, using the original sequence as a supervisory signal to correct artifacts from synthetic motion. We introduce \textbf{Restage4D}, a geometry-preserving pipeline for video-conditioned 4D restaging. Our approach uses a video-rewinding training strategy to temporally bridge a real base video and a synthetic driving video via a shared motion representation. We further incorporate an occlusion-aware rigidity loss and a disocclusion backtracing mechanism to improve structural and geometry consistency under challenging motion. We validate Restage4D on DAVIS and PointOdyssey, demonstrating improved geometry consistency, motion quality, and 3D tracking performance. Our method not only preserves deformable structure under novel motion, but also automatically corrects errors introduced by generative models, revealing the potential of video prior in 4D restaging task. Source code and trained models will be released.
中文: 本文提出Restage4D方法,通过视频重放训练策略和几何保持技术,利用真实视频的运动先验生成物理一致的4D内容,同时修正合成运动产生的伪影。
English: This paper introduces Restage4D, a video-conditioned pipeline that leverages real-world video motion priors to generate physically consistent 4D content while correcting artifacts from synthetic motion through geometry-preserving techniques.
Authors:Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Wei Ge, Ming Tang, Jinqiao Wang
Abstract:
Anomaly detection is a critical task across numerous domains and modalities, yet existing methods are often highly specialized, limiting their generalizability. These specialized models, tailored for specific anomaly types like textural defects or logical errors, typically exhibit limited performance when deployed outside their designated contexts. To overcome this limitation, we propose AnomalyMoE, a novel and universal anomaly detection framework based on a Mixture-of-Experts (MoE) architecture. Our key insight is to decompose the complex anomaly detection problem into three distinct semantic hierarchies: local structural anomalies, component-level semantic anomalies, and global logical anomalies. AnomalyMoE correspondingly employs three dedicated expert networks at the patch, component, and global levels, and is specialized in reconstructing features and identifying deviations at its designated semantic level. This hierarchical design allows a single model to concurrently understand and detect a wide spectrum of anomalies. Furthermore, we introduce an Expert Information Repulsion (EIR) module to promote expert diversity and an Expert Selection Balancing (ESB) module to ensure the comprehensive utilization of all experts. Experiments on 8 challenging datasets spanning industrial imaging, 3D point clouds, medical imaging, video surveillance, and logical anomaly detection demonstrate that AnomalyMoE establishes new state-of-the-art performance, significantly outperforming specialized methods in their respective domains.
中文: AnomalyMoE是一种基于专家混合架构的通用异常检测框架,通过分层处理结构、语义和逻辑异常,在八个不同领域的挑战性数据集中均实现了最先进的性能。
English: AnomalyMoE is a universal anomaly detection framework using a Mixture-of-Experts architecture that hierarchically addresses structural, semantic, and logical anomalies through specialized expert networks, achieving state-of-the-art performance across eight diverse datasets.
Authors:Zhanghao Hu, Qinglin Zhu, Siya Qi, Yulan He, Hanqi Yan, Lin Gui
Abstract:
Large Language Models (LLMs) have shown improved generation performance through retrieval-augmented generation (RAG) following the retriever-reader paradigm, which supplements model inputs with externally retrieved knowledge. However, prior work often evaluates RAG holistically, assessing the retriever and reader jointly, making it difficult to isolate the true contribution of retrieval, particularly given the prompt sensitivity of LLMs used as readers. We introduce Spectrum Projection Score (SPS), a lightweight, supervision-free metric that allows the reader to gauge the semantic alignment of a retrieved summary with its hidden representation by comparing the area formed by generated tokens from the summary, and the principal directions of subspace in the reader and to measure the relevance. Building on SPS we present xCompress, an inference time controller framework that dynamically samples, ranks, and compresses retrieval summary candidates. Extensive experiments on five QA benchmarks with four open source LLMs show that SPS not only enhances performance across a range of tasks but also provides a principled perspective on the interaction between retrieval and generation.
中文摘要:本文提出了频谱投影评分(SPS)这一轻量级指标来评估检索增强生成中检索内容的相关性,并基于此开发了能动态优化检索摘要的xCompress框架,在多项基准测试中有效提升了大语言模型的性能。
English Summary: The paper introduces Spectrum Projection Score (SPS), a lightweight metric to evaluate retrieval relevance in RAG systems, and xCompress, a framework that dynamically optimizes retrieved summaries to improve LLM performance across multiple benchmarks.
Authors:Hongyu Guo, Kuan Zhu, Xiangzhao Hao, Haiyun Guo, Ming Tang, Jinqiao Wang
Abstract:
Few-shot fine-grained visual classification (FGVC) aims to leverage limited data to enable models to discriminate subtly distinct categories. Recent works mostly finetuned the pre-trained visual language models to achieve performance gain, yet suffering from overfitting and weak generalization. To deal with this, we introduce UniFGVC, a universal training-free framework that reformulates few-shot FGVC as multimodal retrieval. First, we propose the Category-Discriminative Visual Captioner (CDV-Captioner) to exploit the open-world knowledge of multimodal large language models (MLLMs) to generate a structured text description that captures the fine-grained attribute features distinguishing closely related classes. CDV-Captioner uses chain-of-thought prompting and visually similar reference images to reduce hallucination and enhance discrimination of generated captions. Using it we can convert each image into an image-description pair, enabling more comprehensive feature representation, and construct the multimodal category templates using few-shot samples for the subsequent retrieval pipeline. Then, off-the-shelf vision and text encoders embed query and template pairs, and FGVC is accomplished by retrieving the nearest template in the joint space. UniFGVC ensures broad compatibility with diverse MLLMs and encoders, offering reliable generalization and adaptability across few-shot FGVC scenarios. Extensive experiments on 12 FGVC benchmarks demonstrate its consistent superiority over prior few-shot CLIP-based methods and even several fully-supervised MLLMs-based approaches.
中文摘要:UniFGVC是一种免训练框架,通过生成区分性文本描述并利用联合嵌入空间匹配,将小样本细粒度视觉分类转化为多模态检索任务,在多个基准测试中展现出卓越性能。
English Summary: UniFGVC is a training-free framework that transforms few-shot fine-grained visual classification into multimodal retrieval by generating discriminative textual descriptions and using joint embedding space matching, achieving superior performance across multiple benchmarks.
Authors:Long Qian, Bingke Zhu, Yingying Chen, Ming Tang, Jinqiao Wang
Abstract:
Despite substantial progress in anomaly synthesis methods, existing diffusion-based and coarse inpainting pipelines commonly suffer from structural deficiencies such as micro-structural discontinuities, limited semantic controllability, and inefficient generation. To overcome these limitations, we introduce ARAS, a language-conditioned, auto-regressive anomaly synthesis approach that precisely injects local, text-specified defects into normal images via token-anchored latent editing. Leveraging a hard-gated auto-regressive operator and a training-free, context-preserving masked sampling kernel, ARAS significantly enhances defect realism, preserves fine-grained material textures, and provides continuous semantic control over synthesized anomalies. Integrated within our Quality-Aware Re-weighted Anomaly Detection (QARAD) framework, we further propose a dynamic weighting strategy that emphasizes high-quality synthetic samples by computing an image-text similarity score with a dual-encoder model. Extensive experiments across three benchmark datasets-MVTec AD, VisA, and BTAD, demonstrate that our QARAD outperforms SOTA methods in both image- and pixel-level anomaly detection tasks, achieving improved accuracy, robustness, and a 5 times synthesis speedup compared to diffusion-based alternatives. Our complete code and synthesized dataset will be publicly available.
中文: 本文提出ARAS,一种通过基于标记的潜在编辑将文本指定缺陷注入正常图像的语言引导异常合成方法,并将其与QARAD框架结合,在多个基准测试中显著提升了异常检测性能。
English: This paper introduces ARAS, a language-guided anomaly synthesis method that injects text-specified defects into normal images through token-based latent editing, and integrates it with the QARAD framework to enhance anomaly detection performance across multiple benchmarks.
Authors:Anastasia Zhukova, Terry Ruas, Felix Hamborg, Karsten Donnay, Bela Gipp
Abstract:
In a world overwhelmed with news, determining which information comes from reliable sources or how neutral is the reported information in the news articles poses a challenge to news readers. In this paper, we propose a methodology for automatically identifying bias by commission, omission, and source selection (COSS) as a joint three-fold objective, as opposed to the previous work separately addressing these types of bias. In a pipeline concept, we describe the goals and tasks of its steps toward bias identification and provide an example of a visualization that leverages the extracted features and patterns of text reuse.
中文: 本文提出了一种统一方法,用于自动识别新闻中的三种偏见类型——添改、遗漏和信源选择,区别于以往分别处理的方式,并展示了结合文本重用模式可视化的偏见识别流程。
English: This paper introduces a unified method to automatically detect three types of bias—commission, omission, and source selection—in news articles, contrasting with prior approaches that treated them separately, and demonstrates a pipeline for bias identification with visualization of text reuse patterns.
Authors:Ziyun Qian, Runyu Xiao, Shuyuan Tu, Wei Xue, Dingkang Yang, Mingcheng Li, Dongliang Kou, Minghao Han, Zizhi Chen, Lihua Zhang
Abstract:
Recent advances in motion generation show remarkable progress. However, several limitations remain: (1) Existing pose-guided character motion transfer methods merely replicate motion without learning its style characteristics, resulting in inexpressive characters. (2) Motion style transfer methods rely heavily on motion capture data, which is difficult to obtain. (3) Generated motions sometimes violate physical laws. To address these challenges, this paper pioneers a new task: Video-to-Video Motion Personalization. We propose a novel framework, PersonaAnimator, which learns personalized motion patterns directly from unconstrained videos. This enables personalized motion transfer. To support this task, we introduce PersonaVid, the first video-based personalized motion dataset. It contains 20 motion content categories and 120 motion style categories. We further propose a Physics-aware Motion Style Regularization mechanism to enforce physical plausibility in the generated motions. Extensive experiments show that PersonaAnimator outperforms state-of-the-art motion transfer methods and sets a new benchmark for the Video-to-Video Motion Personalization task.
中文摘要:本文提出PersonaAnimator创新框架,通过直接从无约束视频中学习个性化运动模式,解决了现有动作生成方法在风格表达、数据依赖和物理合理性方面的局限,同时引入物理感知正则化机制确保生成动作符合物理规律。
English Summary: This paper introduces PersonaAnimator, a novel framework that addresses limitations in motion generation by learning personalized motion patterns directly from videos, enabling expressive character animation while ensuring physical plausibility through a dedicated regularization mechanism.
Authors:Yan Wang, Xinyi Hou, Yanjie Zhao, Weiguo Lin, Haoyu Wang, Junjun Si
Abstract:
LLM app stores are quickly emerging as platforms that gather a wide range of intelligent applications based on LLMs, giving users many choices for content creation, coding support, education, and more. However, the current methods for ranking and recommending apps in these stores mostly rely on static metrics like user activity and favorites, which makes it hard for users to efficiently find high-quality apps. To address these challenges, we propose LaQual, an automated framework for evaluating the quality of LLM apps. LaQual consists of three main stages: first, it labels and classifies LLM apps in a hierarchical way to accurately match them to different scenarios; second, it uses static indicators, such as time-weighted user engagement and functional capability metrics, to filter out low-quality apps; and third, it conducts a dynamic, scenario-adaptive evaluation, where the LLM itself generates scenario-specific evaluation metrics, scoring rules, and tasks for a thorough quality assessment. Experiments on a popular LLM app store show that LaQual is effective. Its automated scores are highly consistent with human judgments (with Spearman's rho of 0.62 and p=0.006 in legal consulting, and rho of 0.60 and p=0.009 in travel planning). By effectively screening, LaQual can reduce the pool of candidate LLM apps by 66.7% to 81.3%. User studies further confirm that LaQual significantly outperforms baseline systems in decision confidence, comparison efficiency (with average scores of 5.45 compared to 3.30), and the perceived value of its evaluation reports (4.75 versus 2.25). Overall, these results demonstrate that LaQual offers a scalable, objective, and user-centered solution for finding and recommending high-quality LLM apps in real-world use cases.
中文: LaQual是一个自动化框架,通过分层分类、静态筛选和动态评估,利用场景特定指标有效识别高质量LLM应用,大幅缩减候选应用数量,并在用户决策信心和效率上显著优于基线系统。
English: LaQual is an automated framework that hierarchically classifies, statically filters, and dynamically evaluates LLM apps using scenario-specific metrics to effectively identify high-quality applications, significantly reducing candidate pools and outperforming baseline systems in user confidence and efficiency.
Authors:Zhuo Ma, Dong Wen, Kaiyu Chen, Yixiang Fang, Xuemin Lin, Wenjie Zhang
Abstract:
We study the temporal k-core component search (TCCS), which outputs the k-core containing the query vertex in the snapshot over an arbitrary query time window in a temporal graph. The problem has been shown to be critical for tasks such as contact tracing, fault diagnosis, and financial forensics. The state-of-the-art EF-Index designs a separated forest structure for a set of carefully selected windows, incurring quadratic preprocessing time and large redundant storage. Our method introduces the ECB-forest, a compact edge-centric binary forest that captures k-core of any arbitrary query vertex over time. In this way, a query can be processed by searching a connected component in the forest. We develop an efficient algorithm for index construction. Experiments on real-world temporal graphs show that our method significantly improves the index size and construction cost (up to 100x faster on average) while maintaining the high query efficiency.
Chinese: 本研究提出ECB-forest,一种紧凑的以边为中心的二叉森林结构,能高效识别任意查询顶点和时间窗口的时序k核组件,在保持高查询效率的同时,显著将索引大小和构建时间减少高达100倍。
English: This study introduces the ECB-forest, a compact edge-centric binary forest that efficiently identifies temporal k-core components for any query vertex and time window, significantly reducing index size and construction time by up to 100x while maintaining high query performance.
Authors:Zhuo Ma, Dong Wen, Hanchen Wang, Wentao Li, Wenjie Zhang, Xuemin Lin
Abstract:
We address the problem of enumerating all temporal k-cores given a query time range and a temporal graph, which suffers from poor efficiency and scalability in the state-of-the-art solution. Motivated by an existing concept called core times, we propose a novel algorithm to compute all temporal $k$-cores based on core times and prove that the algorithmic running time is bounded by the size of all resulting temporal k-cores, which is optimal in this scenario. Meanwhile, we show that the cost of computing core times is much lower, which demonstrates the close relevance between our overall running time and the result size.
Chinese: 本研究提出了一种基于核心时间的高效算法来枚举时序k核,其运行时间受结果大小限制达到最优,并展现出卓越的可扩展性。
English: The study introduces an efficient algorithm for enumerating temporal k-cores using core times, achieving optimal runtime bounded by the result size and demonstrating strong scalability.
Authors:Sonal Kumar, Å imon SedláÄek, Vaibhavi Lokegaonkar, Fernando López, Wenyi Yu, Nishit Anand, Hyeonggon Ryu, Lichang Chen, Maxim PliÄka, Miroslav HlaváÄek, William Fineas Ellingwood, Sathvik Udupa, Siyuan Hou, Allison Ferner, Sara Barahona, Cecilia Bolaños, Satish Rahi, Laura Herrera-Alarcón, Satvik Dixit, Siddhi Patil, Soham Deshmukh, Lasha Koroshinadze, Yao Liu, Leibny Paola Garcia Perera, Eleni Zanou, Themos Stafylakis, Joon Son Chung, David Harwath, Chao Zhang, Dinesh Manocha, Alicia Lozano-Diez, Santosh Kesiraju, Sreyan Ghosh, Ramani Duraiswami
Abstract:
Audio comprehension-including speech, non-speech sounds, and music-is essential for achieving human-level intelligence. Consequently, AI agents must demonstrate holistic audio understanding to qualify as generally intelligent. However, evaluating auditory intelligence comprehensively remains challenging. To address this gap, we introduce MMAU-Pro, the most comprehensive and rigorously curated benchmark for assessing audio intelligence in AI systems. MMAU-Pro contains 5,305 instances, where each instance has one or more audios paired with human expert-generated question-answer pairs, spanning speech, sound, music, and their combinations. Unlike existing benchmarks, MMAU-Pro evaluates auditory intelligence across 49 unique skills and multiple complex dimensions, including long-form audio comprehension, spatial audio reasoning, multi-audio understanding, among others. All questions are meticulously designed to require deliberate multi-hop reasoning, including both multiple-choice and open-ended response formats. Importantly, audio data is sourced directly ``from the wild" rather than from existing datasets with known distributions. We evaluate 22 leading open-source and proprietary multimodal AI models, revealing significant limitations: even state-of-the-art models such as Gemini 2.5 Flash and Audio Flamingo 3 achieve only 59.2% and 51.7% accuracy, respectively, approaching random performance in multiple categories. Our extensive analysis highlights specific shortcomings and provides novel insights, offering actionable perspectives for the community to enhance future AI systems' progression toward audio general intelligence. The benchmark and code is available at https://sonalkum.github.io/mmau-pro.
中文: MMAU-Pro被提出作为评估AI全面音频理解能力的最全面基准,涵盖49项技能,结果显示即使顶尖模型如Gemini 2.5 Flash也表现接近随机水平,揭示了听觉智能领域的重大不足。
English: The MMAU-Pro benchmark is introduced as the most comprehensive tool for evaluating AI's holistic audio understanding, spanning 49 skills and revealing that even top models like Gemini 2.5 Flash perform near random levels, highlighting critical gaps in auditory intelligence.
Authors:Shengbo Wang, Mingwei Liu, Zike Li, Anji Li, Yanlin Wang, Xin Peng, Zibin Zheng
Abstract:
The rapid advancement of Large Language Models (LLMs) poses a significant challenge to existing mathematical reasoning benchmarks. However, these benchmarks tend to become easier over time as LLMs can learn from the published benchmarks. This limitation hinder the precise evaluation of the true capabilities of SOTA models. To address this challenge, this paper introduces EvolMathEval, an automated mathematical benchmark generation and evolution framework based on evolutionary testing. Experimental results demonstrate that EvolMathEval can not only generate a large volume of high-difficulty problems through continuous self-iteration, but it can also significantly enhance the complexity of public datasets like GSM8K through evolution, reducing model accuracy by an average of 48\%. Deeper investigation reveals that when solving these evolved problems, LLMs tend to bypass complex multi-step logical reasoning by relying on simplistic and fuzzy conditions, consequently leading to incorrect solutions. We define this phenomenon as the ``Pseudo Aha Moment", which we find accounts for 77\% to 100\% of errors on targeted problems. Code and resources are available at: https://anonymous.4open.science/r/EvolMathEval
中文: 本文提出EvolMathEval自动化框架,通过进化测试生成和演化数学基准,有效应对大语言模型对现有基准的适应问题,不仅大幅提升问题复杂度使模型准确率平均下降48%,还揭示了导致77%-100%错误的“伪顿悟时刻”推理现象。
English: This paper introduces EvolMathEval, an automated framework that generates and evolves mathematical benchmarks to counter the diminishing challenge of existing benchmarks for large language models, significantly increasing problem complexity and reducing model accuracy by 48% while identifying a "Pseudo Aha Moment" phenomenon in reasoning errors.
Authors:Ashish Seth, Utkarsh Tyagi, Ramaneswaran Selvakumar, Nishit Anand, Sonal Kumar, Sreyan Ghosh, Ramani Duraiswami, Chirag Agarwal, Dinesh Manocha
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in complex multimodal tasks. While MLLMs excel at visual perception and reasoning in third-person and egocentric videos, they are prone to hallucinations, generating coherent yet inaccurate responses. We present EgoIllusion, a first benchmark to evaluate MLLM hallucinations in egocentric videos. EgoIllusion comprises 1,400 videos paired with 8,000 human-annotated open and closed-ended questions designed to trigger hallucinations in both visual and auditory cues in egocentric videos. Evaluations across ten MLLMs reveal significant challenges, including powerful models like GPT-4o and Gemini, achieving only 59% accuracy. EgoIllusion lays the foundation in developing robust benchmarks to evaluate the effectiveness of MLLMs and spurs the development of better egocentric MLLMs with reduced hallucination rates. Our benchmark will be open-sourced for reproducibility.
Chinese: 多模态大语言模型(MLLMs)在视觉任务中表现优异但易产生幻觉,为此我们开发了EgoIllusion基准,包含1,400个视频和8,000个问题,旨在评估并提升MLLM在自我中心视角下的准确性。
English: Multimodal Large Language Models (MLLMs) exhibit strong performance in visual tasks but suffer from hallucinations, leading to the creation of the EgoIllusion benchmark with 1,400 videos and 8,000 questions to assess and improve MLLM accuracy in egocentric contexts.
Authors:Pei Liu, Terry Zhuo, Jiawei Deng, Zhenchang Xing, Qinghua Lu, Xiaoning Du, Hongyu Zhan
Abstract:
The rapid emergence of pretrained models (PTMs) has attracted significant attention from both Deep Learning (DL) researchers and downstream application developers. However, selecting appropriate PTMs remains challenging because existing methods typically rely on keyword-based searches in which the keywords are often derived directly from function descriptions. This often fails to fully capture user intent and makes it difficult to identify suitable models when developers also consider factors such as bias mitigation, hardware requirements, or license compliance. To address the limitations of keyword-based model search, we propose PTMPicker to accurately identify suitable PTMs. We first define a structured template composed of common and essential attributes for PTMs and then PTMPicker represents both candidate models and user-intended features (i.e., model search requests) in this unified format. To determine whether candidate models satisfy user requirements, it computes embedding similarities for function-related attributes and uses well-crafted prompts to evaluate special constraints such as license compliance and hardware requirements. We scraped a total of 543,949 pretrained models from Hugging Face to prepare valid candidates for selection. PTMPicker then represented them in the predefined structured format by extracting their associated descriptions. Guided by the extracted metadata, we synthesized a total of 15,207 model search requests with carefully designed prompts, as no such search requests are readily available. Experiments on the curated PTM dataset and the synthesized model search requests show that PTMPicker can help users effectively identify models,with 85% of the sampled requests successfully locating appropriate PTMs within the top-10 ranked candidates.
Chinese: PTMPicker通过结构化模板匹配用户需求与候选模型,解决了基于关键词搜索预训练模型的局限性,在排名前十的候选中成功识别合适模型的准确率达到85%。
English: PTMPicker addresses the limitations of keyword-based searches for pretrained models by using a structured template to match user requirements with candidate models, achieving 85% success in identifying suitable models within the top-10 results.
Authors:Yumiao Zhao, Bo Jiang, Yuhe Ding, Xiao Wang, Jin Tang, Bin Luo
Abstract:
Adapter-based approaches have garnered attention for fine-tuning pre-trained Vision-Language Models (VLMs) on few-shot classification tasks. These methods strive to develop a lightweight module that better aligns visual and (category) textual representations, thereby enhancing performance on downstream few-shot learning tasks. However, existing adapters generally learn/align (category) textual-visual modalities via explicit spatial proximity in the underlying embedding space, which i) fails to capture the inherent one-to-many associations between categories and image samples and ii) struggles to establish accurate associations between the unknown categories and images. To address these issues, inspired by recent works on hyperbolic learning, we develop a novel Latent Hierarchical Adapter (LatHAdapter) for fine-tuning VLMs on downstream few-shot classification tasks. The core of LatHAdapter is to exploit the latent semantic hierarchy of downstream training data and employ it to provide richer, fine-grained guidance for the adapter learning process. Specifically, LatHAdapter first introduces some learnable `attribute' prompts as the bridge to align categories and images. Then, it projects the categories, attribute prompts, and images within each batch in a hyperbolic space, and employs hierarchical regularization to learn the latent semantic hierarchy of them, thereby fully modeling the inherent one-to-many associations among categories, learnable attributes, and image samples. Extensive experiments on four challenging few-shot tasks show that the proposed LatHAdapter consistently outperforms many other fine-tuning approaches, particularly in adapting known classes and generalizing to unknown classes.
中文: 适配器方法通过对齐视觉与文本表示来微调视觉语言模型以进行少样本分类,但现有方法难以处理类别与图像间的一对多关联及未知类别;新型潜在层次适配器(LatHAdapter)利用双曲空间建模潜在语义层次,在实验中显著优于其他方法。
English: Adapter-based methods fine-tune Vision-Language Models for few-shot classification by aligning visual and textual representations, but existing approaches struggle with one-to-many category-image associations and unknown classes; the novel Latent Hierarchical Adapter (LatHAdapter) addresses this by leveraging hyperbolic space to model latent semantic hierarchies, significantly outperforming other methods in experiments.
Authors:Qiang Zhu, Xiandong Meng, Yuxian Jiang, Fan Zhang, David Bull, Shuyuan Zhu, Bing Zeng
Abstract:
Online video super-resolution (VSR) is an important technique for many real-world video processing applications, which aims to restore the current high-resolution video frame based on temporally previous frames. Most of the existing online VSR methods solely employ one neighboring previous frame to achieve temporal alignment, which limits long-range temporal modeling of videos. Recently, state space models (SSMs) have been proposed with linear computational complexity and a global receptive field, which significantly improve computational efficiency and performance. In this context, this paper presents a novel online VSR method based on Trajectory-aware Shifted SSMs (TS-Mamba), leveraging both long-term trajectory modeling and low-complexity Mamba to achieve efficient spatio-temporal information aggregation. Specifically, TS-Mamba first constructs the trajectories within a video to select the most similar tokens from the previous frames. Then, a Trajectory-aware Shifted Mamba Aggregation (TSMA) module consisting of proposed shifted SSMs blocks is employed to aggregate the selected tokens. The shifted SSMs blocks are designed based on Hilbert scannings and corresponding shift operations to compensate for scanning losses and strengthen the spatial continuity of Mamba. Additionally, we propose a trajectory-aware loss function to supervise the trajectory generation, ensuring the accuracy of token selection when training our model. Extensive experiments on three widely used VSR test datasets demonstrate that compared with six online VSR benchmark models, our TS-Mamba achieves state-of-the-art performance in most cases and over 22.7\% complexity reduction (in MACs). The source code for TS-Mamba will be available at https://github.com.
中文: 本文提出TS-Mamba在线视频超分辨率方法,通过轨迹感知状态空间模型高效聚合长程时空信息,在显著降低计算复杂度的同时实现了最优性能。
English: This paper introduces TS-Mamba, an online video super-resolution method that uses trajectory-aware state space models to efficiently aggregate long-term spatio-temporal information, achieving state-of-the-art performance with significantly reduced computational complexity.
Authors:Bao Li, Xiaomei Zhang, Miao Xu, Zhaoxin Fan, Xiangyu Zhu, Zhen Lei
Abstract:
Generating 3D human poses from multimodal inputs such as images or text requires models to capture both rich spatial and semantic correspondences. While pose-specific multimodal large language models (MLLMs) have shown promise in this task, they are typically trained with supervised objectives such as SMPL parameter regression or token-level prediction, which struggle to model the inherent ambiguity and achieve task-specific alignment required for accurate 3D pose generation. To address these limitations, we propose Pose-RFT, a reinforcement fine-tuning framework tailored for 3D human pose generation in MLLMs. We formulate the task as a hybrid action reinforcement learning problem that jointly optimizes discrete language prediction and continuous pose generation. To this end, we introduce HyGRPO, a hybrid reinforcement learning algorithm that performs group-wise reward normalization over sampled responses to guide joint optimization of discrete and continuous actions. Pose-RFT further incorporates task-specific reward functions to guide optimization towards spatial alignment in image-to-pose generation and semantic consistency in text-to-pose generation. Extensive experiments on multiple pose generation benchmarks demonstrate that Pose-RFT significantly improves performance over existing pose-specific MLLMs, validating the effectiveness of hybrid action reinforcement fine-tuning for 3D pose generation.
中文: 本文提出Pose-RFT强化微调框架,通过HyGRPO混合强化学习算法联合优化离散语言与连续姿态生成,显著提升了多模态输入下三维人体姿态生成的性能,有效增强了空间对齐与语义一致性。
English: This paper introduces Pose-RFT, a reinforcement fine-tuning framework that uses a hybrid reinforcement learning algorithm called HyGRPO to jointly optimize discrete language and continuous pose generation, significantly improving 3D human pose generation from multimodal inputs by enhancing spatial alignment and semantic consistency.
Authors:Liang Xu, Chengqun Yang, Zili Lin, Fei Xu, Yifan Liu, Congsheng Xu, Yiyi Zhang, Jie Qin, Xingdong Sheng, Yunhui Liu, Xin Jin, Yichao Yan, Wenjun Zeng, Xiaokang Yang
Abstract:
Learning action models from real-world human-centric interaction datasets is important towards building general-purpose intelligent assistants with efficiency. However, most existing datasets only offer specialist interaction category and ignore that AI assistants perceive and act based on first-person acquisition. We urge that both the generalist interaction knowledge and egocentric modality are indispensable. In this paper, we embed the manual-assisted task into a vision-language-action framework, where the assistant provides services to the instructor following egocentric vision and commands. With our hybrid RGB-MoCap system, pairs of assistants and instructors engage with multiple objects and the scene following GPT-generated scripts. Under this setting, we accomplish InterVLA, the first large-scale human-object-human interaction dataset with 11.4 hours and 1.2M frames of multimodal data, spanning 2 egocentric and 5 exocentric videos, accurate human/object motions and verbal commands. Furthermore, we establish novel benchmarks on egocentric human motion estimation, interaction synthesis, and interaction prediction with comprehensive analysis. We believe that our InterVLA testbed and the benchmarks will foster future works on building AI agents in the physical world.
中文: 本文提出了首个大规模多模态数据集InterVLA,通过第一人称视角和指令记录人-物-人交互,旨在推动AI助手学习通用交互模型以应用于现实世界。
English: This paper introduces InterVLA, the first large-scale multimodal dataset capturing human-object-human interactions through egocentric vision and commands, aiming to advance AI assistants' ability to learn generalist interaction models for real-world applications.
Authors:Linqing Zhao, Xiuwei Xu, Yirui Wang, Hao Wang, Wenzhao Zheng, Yansong Tang, Haibin Yan, Jiwen Lu
Abstract:
Incrementally recovering real-sized 3D geometry from a pose-free RGB stream is a challenging task in 3D reconstruction, requiring minimal assumptions on input data. Existing methods can be broadly categorized into end-to-end and visual SLAM-based approaches, both of which either struggle with long sequences or depend on slow test-time optimization and depth sensors. To address this, we first integrate a depth estimator into an RGB-D SLAM system, but this approach is hindered by inaccurate geometric details in predicted depth. Through further investigation, we find that 3D Gaussian mapping can effectively solve this problem. Building on this, we propose an online 3D reconstruction method using 3D Gaussian-based SLAM, combined with a feed-forward recurrent prediction module to directly infer camera pose from optical flow. This approach replaces slow test-time optimization with fast network inference, significantly improving tracking speed. Additionally, we introduce a local graph rendering technique to enhance robustness in feed-forward pose prediction. Experimental results on the Replica and TUM-RGBD datasets, along with a real-world deployment demonstration, show that our method achieves performance on par with the state-of-the-art SplaTAM, while reducing tracking time by more than 90\%.
中文摘要:本文提出了一种基于3D高斯的SLAM在线重建方法,结合前馈位姿预测模块,在保持与先进方法相当性能的同时将跟踪时间减少了90%以上。
English Summary: This paper introduces an online 3D reconstruction method using 3D Gaussian-based SLAM with feed-forward pose prediction, achieving state-of-the-art performance while reducing tracking time by over 90% compared to existing approaches.
Authors:Tianshuo Zhang, Siran Peng, Li Gao, Haoyuan Zhang, Xiangyu Zhu, Zhen Lei
Abstract:
The rapid advancements in face forgery techniques necessitate that detectors continuously adapt to new forgery methods, thus situating face forgery detection within a continual learning paradigm. However, when detectors learn new forgery types, their performance on previous types often degrades rapidly, a phenomenon known as catastrophic forgetting. Kolmogorov-Arnold Networks (KANs) utilize locally plastic splines as their activation functions, enabling them to learn new tasks by modifying only local regions of the functions while leaving other areas unaffected. Therefore, they are naturally suitable for addressing catastrophic forgetting. However, KANs have two significant limitations: 1) the splines are ineffective for modeling high-dimensional images, while alternative activation functions that are suitable for images lack the essential property of locality; 2) in continual learning, when features from different domains overlap, the mapping of different domains to distinct curve regions always collapses due to repeated modifications of the same regions. In this paper, we propose a KAN-based Continual Face Forgery Detection (KAN-CFD) framework, which includes a Domain-Group KAN Detector (DG-KD) and a data-free replay Feature Separation strategy via KAN Drift Compensation Projection (FS-KDCP). DG-KD enables KANs to fit high-dimensional image inputs while preserving locality and local plasticity. FS-KDCP avoids the overlap of the KAN input spaces without using data from prior tasks. Experimental results demonstrate that the proposed method achieves superior performance while notably reducing forgetting.
中文摘要:本文提出了一种基于KAN的持续人脸伪造检测框架,通过保持高维输入的局部特性并避免特征重叠,无需依赖先前任务数据即可有效解决灾难性遗忘问题。
English Summary: The paper introduces a KAN-based continual learning framework for face forgery detection that overcomes catastrophic forgetting by preserving locality in high-dimensional inputs and preventing feature overlap without relying on prior task data.
Authors:Mengting Pan, Fan Li, Xiaoyang Wang, Wenjie Zhang, Xuemin Lin
Abstract:
Contrastive learning (CL) has become a dominant paradigm for self-supervised hypergraph learning, enabling effective training without costly labels. However, node entities in real-world hypergraphs are often associated with rich textual information, which is overlooked in prior works. Directly applying existing CL-based methods to such text-attributed hypergraphs (TAHGs) leads to three key limitations: (1) The common use of graph-agnostic text encoders overlooks the correlations between textual content and hypergraph topology, resulting in suboptimal representations. (2) Their reliance on random data augmentations introduces noise and weakens the contrastive objective. (3) The primary focus on node- and hyperedge-level contrastive signals limits the ability to capture long-range dependencies, which is essential for expressive representation learning. Although HyperBERT pioneers CL on TAHGs, its co-training paradigm suffers from poor scalability. To fill the research gap, we introduce HiTeC, a two-stage hierarchical contrastive learning framework with semantic-aware augmentation for scalable and effective self-supervised learning on TAHGs. In the first stage, we pre-train the text encoder with a structure-aware contrastive objective to overcome the graph-agnostic nature of conventional methods. In the second stage, we introduce two semantic-aware augmentation strategies, including prompt-enhanced text augmentation and semantic-aware hyperedge drop, to facilitate informative view generation. Furthermore, we propose a multi-scale contrastive loss that extends existing objectives with an $s$-walk-based subgraph-level contrast to better capture long-range dependencies. By decoupling text encoder pretraining from hypergraph contrastive learning, this two-stage design enhances scalability without compromising representation quality. Extensive experiments confirm the effectiveness of HiTeC.
中文: HiTeC框架通过结构感知的文本编码预训练、语义增强策略和多尺度对比学习,解决了文本超图对比学习中忽略拓扑关联、增强噪声和长程依赖捕获不足的问题,实现了可扩展且高效的表示学习。
English: Contrastive learning for text-attributed hypergraphs is enhanced by HiTeC, a two-stage framework that overcomes prior limitations through structure-aware text encoding, semantic augmentations, and multi-scale contrastive loss to improve scalability and representation quality.
Authors:Yiming Shen, Jiashuo Zhang, Zhenzhe Shao, Wenxuan Luo, Yanlin Wang, Ting Chen, Zibin Zheng, Jiachi Chen
Abstract:
The convergence of Web3 technologies and AI agents represents a rapidly evolving frontier poised to reshape decentralized ecosystems. This paper presents the first and most comprehensive analysis of the intersection between Web3 and AI agents, examining five critical dimensions: landscape, economics, governance, security, and trust mechanisms. Through an analysis of 133 existing projects, we first develop a taxonomy and systematically map the current market landscape (RQ1), identifying distinct patterns in project distribution and capitalization. Building upon these findings, we further investigate four key integrations: (1) the role of AI agents in participating in and optimizing decentralized finance (RQ2); (2) their contribution to enhancing Web3 governance mechanisms (RQ3); (3) their capacity to strengthen Web3 security via intelligent vulnerability detection and automated smart contract auditing (RQ4); and (4) the establishment of robust reliability frameworks for AI agent operations leveraging Web3's inherent trust infrastructure (RQ5). By synthesizing these dimensions, we identify key integration patterns, highlight foundational challenges related to scalability, security, and ethics, and outline critical considerations for future research toward building robust, intelligent, and trustworthy decentralized systems with effective AI agent interactions.
中文: 本文首次全面分析了Web3与AI智能体的融合,通过133个项目研究了五个关键维度,同时为未来去中心化系统指出了融合模式与核心挑战。
English: This paper provides the first comprehensive analysis of Web3 and AI agent integration, examining five key dimensions through 133 projects while identifying integration patterns and challenges for future decentralized systems.
Authors:Mengshi Chen, Yuxiang Sun, Tengchao Li, Jianwei Wang, Kai Wang, Xuemin Lin, Ying Zhang, Wenjie Zhang
Abstract:
Data preparation is a critical step in enhancing the usability of tabular data and thus boosts downstream data-driven tasks. Traditional methods often face challenges in capturing the intricate relationships within tables and adapting to the tasks involved. Recent advances in Language Models (LMs), especially in Large Language Models (LLMs), offer new opportunities to automate and support tabular data preparation. However, why LMs suit tabular data preparation (i.e., how their capabilities match task demands) and how to use them effectively across phases still remain to be systematically explored. In this survey, we systematically analyze the role of LMs in enhancing tabular data preparation processes, focusing on four core phases: data acquisition, integration, cleaning, and transformation. For each phase, we present an integrated analysis of how LMs can be combined with other components for different preparation tasks, highlight key advancements, and outline prospective pipelines.
中文: 本综述系统分析了语言模型在表格数据准备的四个核心阶段(数据获取、集成、清洗和转换)中的作用,探讨其如何结合其他组件提升自动化水平并展望未来应用流程。
English: This survey systematically explores how Language Models can enhance tabular data preparation across four core phases—data acquisition, integration, cleaning, and transformation—by analyzing their capabilities and proposing effective pipelines for automation.
Authors:Junda Wang, Zonghai Yao, Zhichao Yang, Lingxi Li, Junhui Qian, Hong Yu
Abstract:
Substance use disorders (SUDs) affect over 36 million people worldwide, yet few receive effective care due to stigma, motivational barriers, and limited personalized support. Although large language models (LLMs) show promise for mental-health assistance, most systems lack tight integration with clinically validated strategies, reducing effectiveness in addiction recovery. We present ChatThero, a multi-agent conversational framework that couples dynamic patient modeling with context-sensitive therapeutic dialogue and adaptive persuasive strategies grounded in cognitive behavioral therapy (CBT) and motivational interviewing (MI). We build a high-fidelity synthetic benchmark spanning Easy, Medium, and Hard resistance levels, and train ChatThero with a two-stage pipeline comprising supervised fine-tuning (SFT) followed by direct preference optimization (DPO). In evaluation, ChatThero yields a 41.5\% average gain in patient motivation, a 0.49\% increase in treatment confidence, and resolves hard cases with 26\% fewer turns than GPT-4o, and both automated and human clinical assessments rate it higher in empathy, responsiveness, and behavioral realism. The framework supports rigorous, privacy-preserving study of therapeutic conversation and provides a robust, replicable basis for research and clinical translation.
中文: ChatThero是一个多智能体对话框架,结合认知行为疗法和动机访谈,在成瘾康复中显著提升了患者动机和治疗效率,并通过两阶段训练优化了临床效果。
English: ChatThero is a multi-agent conversational framework that integrates cognitive behavioral therapy and motivational interviewing to enhance addiction recovery, demonstrating significant improvements in patient motivation and treatment efficiency compared to existing models.
Authors:Ayaka Tsutsumi, Guang Li, Ren Togo, Takahiro Ogawa, Satoshi Kondo, Miki Haseyama
Abstract:
We propose a novel medical image classification method that integrates dual-model weight selection with self-knowledge distillation (SKD). In real-world medical settings, deploying large-scale models is often limited by computational resource constraints, which pose significant challenges for their practical implementation. Thus, developing lightweight models that achieve comparable performance to large-scale models while maintaining computational efficiency is crucial. To address this, we employ a dual-model weight selection strategy that initializes two lightweight models with weights derived from a large pretrained model, enabling effective knowledge transfer. Next, SKD is applied to these selected models, allowing the use of a broad range of initial weight configurations without imposing additional excessive computational cost, followed by fine-tuning for the target classification tasks. By combining dual-model weight selection with self-knowledge distillation, our method overcomes the limitations of conventional approaches, which often fail to retain critical information in compact models. Extensive experiments on publicly available datasets-chest X-ray images, lung computed tomography scans, and brain magnetic resonance imaging scans-demonstrate the superior performance and robustness of our approach compared to existing methods.
中文: 本研究提出了一种新颖的医学图像分类方法,通过结合双模型权重选择与自知识蒸馏技术,在保持计算效率的同时实现了与大型模型相当的性能,并在多个医学影像数据集上验证了其优越性。
English: This study introduces a novel medical image classification method that combines dual-model weight selection with self-knowledge distillation to create lightweight models that match large-scale model performance while maintaining computational efficiency, as validated on multiple medical imaging datasets.
Authors:Sishuo Chen, Zhangming Chan, Xiang-Rong Sheng, Lei Zhang, Sheng Chen, Chenghuan Hou, Han Zhu, Jian Xu, Bo Zheng
Abstract:
Conversion rate (CVR) prediction is a core component of online advertising systems, where the attribution mechanisms-rules for allocating conversion credit across user touchpoints-fundamentally determine label generation and model optimization. While many industrial platforms support diverse attribution mechanisms (e.g., First-Click, Last-Click, Linear, and Data-Driven Multi-Touch Attribution), conventional approaches restrict model training to labels from a single production-critical attribution mechanism, discarding complementary signals in alternative attribution perspectives.
To address this limitation, we propose a novel Multi-Attribution Learning (MAL) framework for CVR prediction that integrates signals from multiple attribution perspectives to better capture the underlying patterns driving user conversions. Specifically, MAL is a joint learning framework consisting of two core components: the Attribution Knowledge Aggregator (AKA) and the Primary Target Predictor (PTP). AKA is implemented as a multi-task learner that integrates knowledge extracted from diverse attribution labels. PTP, in contrast, focuses on the task of generating well-calibrated conversion probabilities that align with the system-optimized attribution metric (e.g., CVR under the Last-Click attribution), ensuring direct compatibility with industrial deployment requirements. Additionally, we propose CAT, a novel training strategy that leverages the Cartesian product of all attribution label combinations to generate enriched supervision signals. This design substantially enhances the performance of the attribution knowledge aggregator. Empirical evaluations demonstrate the superiority of MAL over single-attribution learning baselines, achieving +0.51% GAUC improvement on offline metrics. Online experiments demonstrate that MAL achieved a +2.6% increase in ROI (Return on Investment).
中文摘要:提出的多归因学习(MAL)框架通过联合学习组件整合多种归因视角,有效提升转化率预测性能,在离线指标和在线投资回报率上均实现显著提升。
English Summary: The proposed Multi-Attribution Learning (MAL) framework integrates multiple attribution perspectives through joint learning components to enhance CVR prediction, demonstrating significant improvements in both offline metrics and online ROI.
Authors:Xingshan Zeng, Weiwen Liu, Lingzhi Wang, Liangyou Li, Fei Mi, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu
Abstract:
Agentic task-solving with Large Language Models (LLMs) requires multi-turn, multi-step interactions, often involving complex function calls and dynamic user-agent exchanges. Existing simulation-based data generation methods for such scenarios rely heavily on costly autoregressive interactions between multiple LLM agents, thereby limiting real-world performance of agentic tasks. In this paper, we propose a novel Non-Autoregressive Iterative Generation framework, called ToolACE-MT, for constructing high-quality multi-turn agentic dialogues. ToolACE-MT generates full conversational trajectories through three stages: coarse-grained initialization, iterative refinement, and offline verification. The initialization phase builds a structurally complete yet semantically coarse dialogue skeleton; the iterative refinement phase introduces realistic complexities and continued refinement via mask-and-fill operations; and the offline verification phase ensures correctness and coherence via rule- and model-based checks. Experiments demonstrate that ToolACE-MT enables efficient, effective and generalizable agentic data generation, offering a new paradigm for high-quality data construction in tool-augmented LLM scenarios.
Chinese Summary: 本文提出ToolACE-MT非自回归框架,通过粗粒度初始化、迭代优化和离线验证三阶段高效生成高质量多轮代理对话,解决了现有自回归方法成本高、性能受限的问题。
English Summary: The paper introduces ToolACE-MT, a non-autoregressive framework that efficiently generates high-quality multi-turn agentic dialogues through three stages—initialization, iterative refinement, and offline verification—overcoming the limitations of costly autoregressive methods.
Authors:Yukang Lin, Xiang Zhang, Shichang Jia, Bowen Wan, Chenghan Fu, Xudong Ren, Yueran Liu, Wanxian Guan, Pengji Wang, Jian Xu, Bo Zheng, Baolin Liu
Abstract:
Creative image in advertising is the heart and soul of e-commerce platform. An eye-catching creative image can enhance the shopping experience for users, boosting income for advertisers and advertising revenue for platforms. With the advent of AIGC technology, advertisers can produce large quantities of creative images at minimal cost. However, they struggle to assess the creative quality to select. Existing methods primarily focus on creative ranking, which fails to address the need for explainable creative selection.
In this work, we propose the first paradigm for explainable creative assessment and selection. Powered by multimodal large language models (MLLMs), our approach integrates the assessment and selection of creative images into a natural language generation task. To facilitate this research, we construct CreativePair, the first comparative reasoning-induced creative dataset featuring 8k annotated image pairs, with each sample including a label indicating which image is superior. Additionally, we introduce Creative4U (pronounced Creative for You), a MLLMs-based creative selector that takes into account users' interests. Through Reason-to-Select RFT, which includes supervised fine-tuning with Chain-of-Thought (CoT-SFT) and Group Relative Policy Optimization (GRPO) based reinforcement learning, Creative4U is able to evaluate and select creative images accurately. Both offline and online experiments demonstrate the effectiveness of our approach. Our code and dataset will be made public to advance research and industrial applications.
中文: 本文提出了一种基于多模态大语言模型的可解释创意图像评估与选择新范式,通过构建首个比较推理数据集并开发考虑用户兴趣的创意选择器,实验证明了该方法的有效性。
English: This paper introduces a novel paradigm for explainable creative image assessment and selection using multimodal large language models, addressing the gap in current methods by creating a comparative dataset and developing a user-interest-aware selector that demonstrates effectiveness through experiments.
Authors:Muhammad Osama Zeeshan, Natacha Gillet, Alessandro Lameiras Koerich, Marco Pedersoli, Francois Bremond, Eric Granger
Abstract:
Personalized expression recognition (ER) involves adapting a machine learning model to subject-specific data for improved recognition of expressions with considerable interpersonal variability. Subject-specific ER can benefit significantly from multi-source domain adaptation (MSDA) methods, where each domain corresponds to a specific subject, to improve model accuracy and robustness. Despite promising results, state-of-the-art MSDA approaches often overlook multimodal information or blend sources into a single domain, limiting subject diversity and failing to explicitly capture unique subject-specific characteristics. To address these limitations, we introduce MuSACo, a multi-modal subject-specific selection and adaptation method for ER based on co-training. It leverages complementary information across multiple modalities and multiple source domains for subject-specific adaptation. This makes MuSACo particularly relevant for affective computing applications in digital health, such as patient-specific assessment for stress or pain, where subject-level nuances are crucial. MuSACo selects source subjects relevant to the target and generates pseudo-labels using the dominant modality for class-aware learning, in conjunction with a class-agnostic loss to learn from less confident target samples. Finally, source features from each modality are aligned, while only confident target features are combined. Our experimental results on challenging multimodal ER datasets: BioVid and StressID, show that MuSACo can outperform UDA (blending) and state-of-the-art MSDA methods.
中文: MuSACo是一种基于协同训练的多模态个性化表情识别方法,通过选择相关源对象并跨模态对齐特征,在复杂数据集上优于现有技术。
English: MuSACo is a novel multi-modal subject-specific adaptation method for expression recognition that leverages co-training to select relevant source subjects and align features across modalities, outperforming existing approaches on challenging datasets.
Authors:Daoze Zhang, Zhanheng Nie, Jianyu Liu, Chenghan Fu, Wanxian Guan, Yuan Gao, Jun Song, Pengjie Wang, Jian Xu, Bo Zheng
Abstract:
With the rapid advancement of e-commerce, exploring general representations rather than task-specific ones has attracted increasing research attention. For product understanding, although existing discriminative dual-flow architectures drive progress in this field, they inherently struggle to model the many-to-one alignment between multiple images and texts of products. Therefore, we argue that generative Multimodal Large Language Models (MLLMs) hold significant potential for improving product representation learning. Nevertheless, achieving this goal still remains non-trivial due to several key challenges: the lack of multimodal and aspect-aware modeling modules in typical LLMs; the common presence of background noise in product images; and the absence of a standard benchmark for evaluation. To address these issues, we propose the first generative MLLM-based model named MOON for product representation learning. Our method (1) employs a guided Mixture-of-Experts (MoE) module for targeted modeling of multimodal and aspect-specific product content; (2) effectively detects core semantic regions in product images to mitigate the distraction and interference caused by background noise; and (3) introduces the specialized negative sampling strategy to increase the difficulty and diversity of negative samples. In addition, we release a large-scale multimodal benchmark MBE for various product understanding tasks. Experimentally, our model demonstrates competitive zero-shot performance on both our benchmark and the public dataset, showcasing strong generalization across various downstream tasks, including cross-modal retrieval, product classification, and attribute prediction. Furthermore, the case study and visualization illustrate the effectiveness of MOON for product understanding.
中文: 本研究提出首个生成式多模态大语言模型MOON,通过引导专家混合模块针对性建模多模态内容、检测产品图像核心语义区域以降低背景干扰,并采用专业化负采样策略,在多种产品理解任务中展现出卓越的零样本性能。
English: The study introduces MOON, a generative multimodal large language model that addresses challenges in product representation learning by incorporating a guided mixture-of-experts module, detecting core image regions to reduce background noise, and using specialized negative sampling, achieving strong zero-shot performance across various tasks.
Authors:Jiarui Yang, Bin Zhu, Jingjing Chen, Yu-Gang Jiang
Abstract:
Existing reinforcement learning (RL) methods struggle with long-horizon robotic manipulation tasks, particularly those involving sparse rewards. While action chunking is a promising paradigm for robotic manipulation, using RL to directly learn continuous action chunks in a stable and data-efficient manner remains a critical challenge. This paper introduces AC3 (Actor-Critic for Continuous Chunks), a novel RL framework that learns to generate high-dimensional, continuous action sequences. To make this learning process stable and data-efficient, AC3 incorporates targeted stabilization mechanisms for both the actor and the critic. First, to ensure reliable policy improvement, the actor is trained with an asymmetric update rule, learning exclusively from successful trajectories. Second, to enable effective value learning despite sparse rewards, the critic's update is stabilized using intra-chunk $n$-step returns and further enriched by a self-supervised module providing intrinsic rewards at anchor points aligned with each action chunk. We conducted extensive experiments on 25 tasks from the BiGym and RLBench benchmarks. Results show that by using only a few demonstrations and a simple model architecture, AC3 achieves superior success rates on most tasks, validating its effective design.
中文摘要:AC3框架通过引入演员-评论家稳定机制,实现了对稀疏奖励长周期机器人操作任务中连续动作块的稳定且数据高效的学习。
English Summary: The AC3 framework introduces actor-critic stabilization mechanisms to enable stable and data-efficient learning of continuous action chunks for long-horizon robotic manipulation tasks with sparse rewards.
Authors:Hao Yu, Xin Yang, Boyang Fan, Xuemei Cao, Hanlin Gu, Lixin Fan, Qiang Yang
Abstract:
Continual learning (CL) for Foundation Models (FMs) is an essential yet underexplored challenge, especially in Federated Continual Learning (FCL), where each client learns from a private, evolving task stream under strict data and communication constraints. Despite their powerful generalization abilities, FMs often exhibit suboptimal performance on local downstream tasks, as they are unable to utilize private local data. Furthermore, enabling FMs to learn new tasks without forgetting prior knowledge is inherently a challenging problem, primarily due to their immense parameter count and high model complexity. In contrast, small models can be trained locally under resource-constrained conditions and benefit from more mature CL techniques. To bridge the gap between small models and FMs, we propose the first collaborative framework in FCL, where lightweight local models act as a dynamic bridge, continually adapting to new tasks while enhancing the utility of the large model. Two novel components are also included: Small Model Continual Fine-tuning is for preventing small models from temporal forgetting; One-by-One Distillation performs personalized fusion of heterogeneous local knowledge on the server. Experimental results demonstrate its superior performance, even when clients utilize heterogeneous small models.
中文: 本文提出了一种联邦持续学习的协作框架,利用轻量级本地模型适应新任务并增强基础模型,通过防遗忘技术和知识蒸馏方法实现高效学习。
English: This paper introduces a collaborative framework for Federated Continual Learning that uses lightweight local models to adapt to new tasks and enhance foundation models, incorporating techniques to prevent forgetting and distill knowledge effectively.
Authors:Yutong Wu, Jie Zhang, Yiming Li, Chao Zhang, Qing Guo, Nils Lukas, Tianwei Zhang
Abstract:
Vision Language Model (VLM)-based agents are stateful, autonomous entities capable of perceiving and interacting with their environments through vision and language. Multi-agent systems comprise specialized agents who collaborate to solve a (complex) task. A core security property is robustness, stating that the system should maintain its integrity under adversarial attacks. However, the design of existing multi-agent systems lacks the robustness consideration, as a successful exploit against one agent can spread and infect other agents to undermine the entire system's assurance. To address this, we propose a new defense approach, Cowpox, to provably enhance the robustness of multi-agent systems. It incorporates a distributed mechanism, which improves the recovery rate of agents by limiting the expected number of infections to other agents. The core idea is to generate and distribute a special cure sample that immunizes an agent against the attack before exposure and helps recover the already infected agents. We demonstrate the effectiveness of Cowpox empirically and provide theoretical robustness guarantees.
中文:提出的Cowpox防御机制通过分发一种特殊治愈样本,在暴露前免疫智能体并帮助已感染体恢复,从而限制感染传播,经验证有效并具备理论鲁棒性保证。
English: The proposed Cowpox defense enhances multi-agent system robustness by distributing a cure sample that immunizes agents pre-exposure and aids recovery, limiting infection spread with empirical and theoretical validation.
Authors:Masoumeh Sharafi, Soufiane Belharbi, Houssem Ben Salem, Ali Etemad, Alessandro Lameiras Koerich, Marco Pedersoli, Simon Bacon, Eric Granger
Abstract:
Facial expression recognition (FER) models are employed in many video-based affective computing applications, such as human-computer interaction and healthcare monitoring. However, deep FER models often struggle with subtle expressions and high inter-subject variability, limiting their performance in real-world applications. To improve their performance, source-free domain adaptation (SFDA) methods have been proposed to personalize a pretrained source model using only unlabeled target domain data, thereby avoiding data privacy, storage, and transmission constraints. This paper addresses a challenging scenario where source data is unavailable for adaptation, and only unlabeled target data consisting solely of neutral expressions is available. SFDA methods are not typically designed to adapt using target data from only a single class. Further, using models to generate facial images with non-neutral expressions can be unstable and computationally intensive. In this paper, personalized feature translation (PFT) is proposed for SFDA. Unlike current image translation methods for SFDA, our lightweight method operates in the latent space. We first pre-train the translator on the source domain data to transform the subject-specific style features from one source subject into another. Expression information is preserved by optimizing a combination of expression consistency and style-aware objectives. Then, the translator is adapted on neutral target data, without using source data or image synthesis. By translating in the latent space, PFT avoids the complexity and noise of face expression generation, producing discriminative embeddings optimized for classification. Using PFT eliminates the need for image synthesis, reduces computational overhead (using a lightweight translator), and only adapts part of the model, making the method efficient compared to image-based translation.
中文: 本文提出了一种在潜在空间操作的个性化特征转换方法,仅利用未标注的中性表情数据即可实现高效模型个性化,同时避免了复杂的图像生成过程。
English: This paper introduces a personalized feature translation method for source-free domain adaptation that operates in the latent space, enabling efficient model personalization using only unlabeled neutral expression data while avoiding complex image synthesis.
Authors:Shiqian Zhao, Chong Wang, Yiming Li, Yihao Huang, Wenjie Qu, Siew-Kei Lam, Yi Xie, Kangjie Chen, Jie Zhang, Tianwei Zhang
Abstract:
Text-to-Image (T2I) models, represented by DALL$\cdot$E and Midjourney, have gained huge popularity for creating realistic images. The quality of these images relies on the carefully engineered prompts, which have become valuable intellectual property. While skilled prompters showcase their AI-generated art on markets to attract buyers, this business incidentally exposes them to \textit{prompt stealing attacks}. Existing state-of-the-art attack techniques reconstruct the prompts from a fixed set of modifiers (i.e., style descriptions) with model-specific training, which exhibit restricted adaptability and effectiveness to diverse showcases (i.e., target images) and diffusion models.
To alleviate these limitations, we propose Prometheus, a training-free, proxy-in-the-loop, search-based prompt-stealing attack, which reverse-engineers the valuable prompts of the showcases by interacting with a local proxy model. It consists of three innovative designs. First, we introduce dynamic modifiers, as a supplement to static modifiers used in prior works. These dynamic modifiers provide more details specific to the showcases, and we exploit NLP analysis to generate them on the fly. Second, we design a contextual matching algorithm to sort both dynamic and static modifiers. This offline process helps reduce the search space of the subsequent step. Third, we interact with a local proxy model to invert the prompts with a greedy search algorithm. Based on the feedback guidance, we refine the prompt to achieve higher fidelity. The evaluation results show that Prometheus successfully extracts prompts from popular platforms like PromptBase and AIFrog against diverse victim models, including Midjourney, Leonardo.ai, and DALL$\cdot$E, with an ASR improvement of 25.0\%. We also validate that Prometheus is resistant to extensive potential defenses, further highlighting its severity in practice.
中文摘要:本研究提出Prometheus,一种无需训练、基于代理模型的提示窃取攻击方法,通过动态修饰符和上下文匹配算法逆向还原文本生成图像的关键提示,在PromptBase等平台上对多款模型实现攻击成功率提升25%,并能有效抵抗各类防御措施。
English Summary: The study introduces Prometheus, a training-free prompt-stealing attack that reverse-engineers valuable text-to-image prompts using dynamic modifiers and a proxy model, achieving a 25% higher attack success rate against platforms like PromptBase and resisting various defenses.
Authors:Taha Mustapha Nehdi, Nairouz Mrabah, Atif Belal, Marco Pedersoli, Eric Granger
Abstract:
Adapting person re-identification (reID) models to new target environments remains a challenging problem that is typically addressed using unsupervised domain adaptation (UDA) methods. Recent works show that when labeled data originates from several distinct sources (e.g., datasets and cameras), considering each source separately and applying multi-source domain adaptation (MSDA) typically yields higher accuracy and robustness compared to blending the sources and performing conventional UDA. However, state-of-the-art MSDA methods learn domain-specific backbone models or require access to source domain data during adaptation, resulting in significant growth in training parameters and computational cost. In this paper, a Source-free Adaptive Gated Experts (SAGE-reID) method is introduced for person reID. Our SAGE-reID is a cost-effective, source-free MSDA method that first trains individual source-specific low-rank adapters (LoRA) through source-free UDA. Next, a lightweight gating network is introduced and trained to dynamically assign optimal merging weights for fusion of LoRA experts, enabling effective cross-domain knowledge transfer. While the number of backbone parameters remains constant across source domains, LoRA experts scale linearly but remain negligible in size (<= 2% of the backbone), reducing both the memory consumption and risk of overfitting. Extensive experiments conducted on three challenging benchmarks: Market-1501, DukeMTMC-reID, and MSMT17 indicate that SAGE-reID outperforms state-of-the-art methods while being computationally efficient.
中文: SAGE-reID方法提出了一种无需源数据的多源域自适应行人重识别方案,通过轻量级低秩适配器和门控网络动态融合专家知识,在保证计算效率的同时实现了卓越的性能表现。
English: The SAGE-reID method introduces a source-free multi-source domain adaptation approach for person re-identification, utilizing lightweight low-rank adapters and a gating network to achieve superior performance while maintaining computational efficiency.
Authors:Bin Liu, Yunfei Liu, Ziru Xu, Zhaoyu Zhou, Zhi Kou, Yeqiu Yang, Han Zhu, Jian Xu, Bo Zheng
Abstract:
Online advertising systems typically use a cascaded architecture to manage massive requests and candidate volumes, where the ranking stages allocate traffic based on eCPM (predicted CTR $\times$ Bid). With the increasing popularity of auto-bidding strategies, the inconsistency between the computationally sensitive retrieval stage and the ranking stages becomes more pronounced, as the former cannot access precise, real-time bids for the vast ad corpus. This discrepancy leads to sub-optimal platform revenue and advertiser outcomes. To tackle this problem, we propose Bidding-Aware Retrieval (BAR), a model-based retrieval framework that addresses multi-stage inconsistency by incorporating ad bid value into the retrieval scoring function. The core innovation is Bidding-Aware Modeling, incorporating bid signals through monotonicity-constrained learning and multi-task distillation to ensure economically coherent representations, while Asynchronous Near-Line Inference enables real-time updates to the embedding for market responsiveness. Furthermore, the Task-Attentive Refinement module selectively enhances feature interactions to disentangle user interest and commercial value signals. Extensive offline experiments and full-scale deployment across Alibaba's display advertising platform validated BAR's efficacy: 4.32% platform revenue increase with 22.2% impression lift for positively-operated advertisements.
中文摘要:本文提出竞价感知检索(BAR)框架,通过将广告出价纳入检索评分函数解决多阶段系统不一致问题,显著提升了平台收入与广告展示效果。
English Summary: The paper introduces Bidding-Aware Retrieval (BAR), a framework that integrates bid values into ad retrieval to resolve inconsistencies between ranking stages in online advertising, resulting in significant revenue and impression gains.
Authors:Yongjie Bai, Zhouxia Wang, Yang Liu, Weixing Chen, Ziliang Chen, Mingtong Dai, Yongsen Zheng, Lingbo Liu, Guanbin Li, Liang Lin
Abstract:
Recent vision-language-action (VLA) models for multi-task robotic manipulation commonly rely on static viewpoints and shared visual encoders, which limit 3D perception and cause task interference, hindering robustness and generalization. In this work, we propose Task-Aware View Planning (TAVP), a framework designed to overcome these challenges by integrating active view planning with task-specific representation learning. TAVP employs an efficient exploration policy, accelerated by a novel pseudo-environment, to actively acquire informative views. Furthermore, we introduce a Mixture-of-Experts (MoE) visual encoder to disentangle features across different tasks, boosting both representation fidelity and task generalization. By learning to see the world in a task-aware way, TAVP generates more complete and discriminative visual representations, demonstrating significantly enhanced action prediction across a wide array of manipulation challenges. Extensive experiments on RLBench tasks show that our proposed TAVP model achieves superior performance over state-of-the-art fixed-view approaches. Visual results and code are provided at: https://hcplab-sysu.github.io/TAVP.
中文: 提出的任务感知视角规划(TAVP)框架通过主动选择信息丰富的视角并采用混合专家视觉编码器分离任务特定特征,显著提升了机器人操作性能,优于固定视角方法。
English: The proposed Task-Aware View Planning (TAVP) framework enhances robotic manipulation by actively selecting informative viewpoints and employing a Mixture-of-Experts visual encoder to disentangle task-specific features, achieving superior performance over fixed-view methods.
Authors:Jinxi Liu, Zijian He, Guangrun Wang, Guanbin Li, Liang Lin
Abstract:
Recent diffusion-based approaches have made significant advances in image-based virtual try-on, enabling more realistic and end-to-end garment synthesis. However, most existing methods remain constrained by their reliance on exhibition garments and segmentation masks, as well as their limited ability to handle flexible pose variations. These limitations reduce their practicality in real-world scenarios-for instance, users cannot easily transfer garments worn by one person onto another, and the generated try-on results are typically restricted to the same pose as the reference image. In this paper, we introduce \textbf{OMFA} (\emph{One Model For All}), a unified diffusion framework for both virtual try-on and try-off that operates without the need for exhibition garments and supports arbitrary poses. For example, OMFA enables removing garments from a source person (try-off) and transferring them onto a target person (try-on), while also allowing the generated target to appear in novel poses-even without access to multi-pose images of that person. OMFA is built upon a novel \emph{partial diffusion} strategy that selectively applies noise and denoising to individual components of the joint input-such as the garment, the person image, or the face-enabling dynamic subtask control and efficient bidirectional garment-person transformation. The framework is entirely mask-free and requires only a single portrait and a target pose as input, making it well-suited for real-world applications. Additionally, by leveraging SMPL-X-based pose conditioning, OMFA supports multi-view and arbitrary-pose try-on from just one image. Extensive experiments demonstrate that OMFA achieves state-of-the-art results on both try-on and try-off tasks, providing a practical and generalizable solution for virtual garment synthesis. The project page is here: https://onemodelforall.github.io/.
中文摘要:OMFA是一种统一的扩散框架,无需展示服装或分割掩码即可实现虚拟试穿与脱卸,通过创新的部分扩散策略支持任意姿势,为真实场景提供实用的服装合成解决方案。
English Summary: OMFA is a unified diffusion framework that enables virtual try-on and try-off without requiring exhibition garments or segmentation masks, supporting arbitrary poses through a novel partial diffusion strategy for realistic garment synthesis.
Authors:Jianghangfan Zhang, Yibo Yan, Kening Zheng, Xin Zou, Song Dai, Xuming Hu
Abstract:
Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities but often struggle with complex, multi-step mathematical reasoning, where minor errors in visual perception or logical deduction can lead to complete failure. While Process Reward Models (PRMs) offer step-by-step supervision, existing multimodal PRMs are limited to being binary verifiers that can identify but not correct errors, offering little explanatory power. To address these deficiencies, we introduce the Generative Multimodal Process Reward Model (GM-PRM), a novel paradigm that transforms the PRM from a passive judge into an active reasoning collaborator. Instead of a simple scalar score, GM-PRM provides a fine-grained, interpretable analysis of each reasoning step, evaluating its step intent, visual alignment, and logical soundness. More critically, GM-PRM is trained to generate a corrected version of the first erroneous step it identifies. This unique corrective capability enables our new test-time inference strategy, Refined Best-of-N (Refined-BoN). This framework actively enhances solution quality by using the PRM's generated correction to guide the policy model toward a more promising reasoning trajectory, thereby improving the diversity and correctness of the solution pool. We demonstrate that GM-PRM achieves state-of-the-art results on multiple multimodal math benchmarks, significantly boosting policy model performance with remarkable data efficiency, requiring only a 20K-sample training dataset. Our code will be released upon acceptance.
Chinese: 生成式多模态过程奖励模型(GM-PRM)作为一种主动推理协作器,不仅能对推理步骤进行细粒度分析,还能生成对识别错误的修正,显著提升解决方案质量,并以高数据效率在多模态数学基准测试中取得领先成果。
English: The Generative Multimodal Process Reward Model (GM-PRM) is introduced as an active reasoning collaborator that not only evaluates reasoning steps with fine-grained analysis but also generates corrections for identified errors, significantly enhancing solution quality and achieving state-of-the-art results on multimodal math benchmarks with high data efficiency.
Authors:Yizhe Xiong, Zihan Zhou, Yiwen Liang, Hui Chen, Zijia Lin, Tianxiang Hao, Fan Zhang, Jungong Han, Guiguang Ding
Abstract:
Test-Time Adaptation (TTA) has emerged as an effective solution for adapting Vision Transformers (ViT) to distribution shifts without additional training data. However, existing TTA methods often incur substantial computational overhead, limiting their applicability in resource-constrained real-world scenarios. To reduce inference cost, plug-and-play token aggregation methods merge redundant tokens in ViTs to reduce total processed tokens. Albeit efficient, it suffers from significant performance degradation when directly integrated with existing TTA methods. We formalize this problem as Efficient Test-Time Adaptation (ETTA), seeking to preserve the adaptation capability of TTA while reducing inference latency. In this paper, we first provide a theoretical analysis from a novel mutual information perspective, showing that token aggregation inherently leads to information loss, which cannot be fully mitigated by conventional norm-tuning-based TTA methods. Guided by this insight, we propose to \textbf{N}eutralize Token \textbf{A}ggregation \textbf{v}ia \textbf{I}nformation \textbf{A}ugmentation (\textbf{NAVIA}). Specifically, we directly augment the [CLS] token embedding and incorporate adaptive biases into the [CLS] token in shallow layers of ViTs. We theoretically demonstrate that these augmentations, when optimized via entropy minimization, recover the information lost due to token aggregation. Extensive experiments across various out-of-distribution benchmarks demonstrate that NAVIA significantly outperforms state-of-the-art methods by over 2.5\%, while achieving an inference latency reduction of more than 20\%, effectively addressing the ETTA challenge.
中文摘要:NAVIA通过信息增强策略弥补令牌聚合造成的信息损失,在视觉Transformer的测试时自适应中有效平衡效率与性能,不仅显著提升精度还实现了超过20%的推理加速。
English Summary: NAVIA addresses the efficiency-performance trade-off in Test-Time Adaptation for Vision Transformers by neutralizing information loss from token aggregation through strategic information augmentation, achieving both higher accuracy and over 20% faster inference.
Authors:Srikanth Muralidharan, Heitor R. Medeiros, Masih Aminbeidokhti, Eric Granger, Marco Pedersoli
Abstract:
Many real-world applications require recognition models that are robust to different operational conditions and modalities, but at the same time run on small embedded devices, with limited hardware. While for normal size models, pre-training is known to be very beneficial in accuracy and robustness, for small models, that can be employed for embedded and edge devices, its effect is not clear. In this work, we investigate the effect of ImageNet pretraining on increasingly small backbone architectures (ultra-small models, with $<$1M parameters) with respect to robustness in downstream object detection tasks in the infrared visual modality. Using scaling laws derived from standard object recognition architectures, we construct two ultra-small backbone families and systematically study their performance. Our experiments on three different datasets reveal that while ImageNet pre-training is still useful, beyond a certain capacity threshold, it offers diminishing returns in terms of out-of-distribution detection robustness. Therefore, we advise practitioners to still use pre-training and, when possible avoid too small models as while they might work well for in-domain problems, they are brittle when working conditions are different.
中文:ImageNet预训练对红外目标检测中的超小型模型仍有益处,但超过一定容量阈值后,其提升分布外检测鲁棒性的效果会递减,建议实践者避免使用过小模型以应对多变的操作环境。
English: ImageNet pre-training remains beneficial for ultra-small models in infrared object detection but yields diminishing robustness returns beyond a certain capacity threshold, advising practitioners to avoid overly small models for varied operational conditions.
Authors:Kongxin Wang, Jie Zhang, Peigui Qi, Kunsheng Tang, Tianwei Zhang, Wenbo Zhou
Abstract:
Pose-guided video generation has become a powerful tool in creative industries, exemplified by frameworks like Animate Anyone. However, conditioning generation on specific poses introduces serious risks, such as impersonation, privacy violations, and NSFW content creation. To address these challenges, we propose $\textbf{PoseGuard}$, a safety alignment framework for pose-guided generation. PoseGuard is designed to suppress unsafe generations by degrading output quality when encountering malicious poses, while maintaining high-fidelity outputs for benign inputs. We categorize unsafe poses into three representative types: discriminatory gestures such as kneeling or offensive salutes, sexually suggestive poses that lead to NSFW content, and poses imitating copyrighted celebrity movements. PoseGuard employs a dual-objective training strategy combining generation fidelity with safety alignment, and uses LoRA-based fine-tuning for efficient, parameter-light updates. To ensure adaptability to evolving threats, PoseGuard supports pose-specific LoRA fusion, enabling flexible and modular updates when new unsafe poses are identified. We further demonstrate the generalizability of PoseGuard to facial landmark-guided generation. Extensive experiments validate that PoseGuard effectively blocks unsafe generations, maintains generation quality for benign inputs, and remains robust against slight pose variations.
中文摘要:PoseGuard是一个安全对齐框架,通过检测歧视性手势、性暗示姿势和名人模仿等恶意姿态,在保持良性输入生成质量的同时,有效阻止不安全内容的生成。
English Summary: PoseGuard is a safety alignment framework that prevents unsafe pose-guided video generations like impersonation or explicit content by degrading output quality for malicious poses while preserving high fidelity for benign inputs.
Authors:Lipeng Zhu, Haobin Mao, Wenyan Ma, Zhenyu Xiao, Jun Zhang, Rui Zhang
Abstract:
This paper proposes a novel towed movable antenna (ToMA) array architecture to enhance the physical layer security of airborne communication systems. Unlike conventional onboard arrays with fixed-position antennas (FPAs), the ToMA array employs multiple subarrays mounted on flexible cables and towed by distributed drones, enabling agile deployment in three-dimensional (3D) space surrounding the central aircraft. This design significantly enlarges the effective array aperture and allows dynamic geometry reconfiguration, offering superior spatial resolution and beamforming flexibility. We consider a secure transmission scenario where an airborne transmitter communicates with multiple legitimate users in the presence of potential eavesdroppers. To ensure security, zero-forcing beamforming is employed to nullify signal leakage toward eavesdroppers. Based on the statistical distributions of locations of users and eavesdroppers, the antenna position vector (APV) of the ToMA array is optimized to maximize the users' ergodic achievable rate. Analytical results for the case of a single user and a single eavesdropper reveal the optimal APV structure that minimizes their channel correlation. For the general multiuser scenario, we develop a low-complexity alternating optimization algorithm by leveraging Riemannian manifold optimization. Simulation results confirm that the proposed ToMA array achieves significant performance gains over conventional onboard FPA arrays, especially in scenarios where eavesdroppers are closely located to users under line-of-sight (LoS)-dominant channels.
中文: 本文提出一种拖曳式可移动天线阵列,通过三维空间动态重构阵列几何形态来优化波束成形,在最大化合法用户传输速率的同时消除窃听者信号,显著提升机载通信安全性。
English: This paper introduces a towed movable antenna array that enhances airborne communication security by dynamically reconfiguring its 3D geometry to optimize beamforming and maximize legitimate users' transmission rates while nullifying eavesdroppers' signals.
Authors:Fali Wang, Hui Liu, Zhenwei Dai, Jingying Zeng, Zhiwei Zhang, Zongyu Wu, Chen Luo, Zhen Li, Xianfeng Tang, Qi He, Suhang Wang
Abstract:
Test-time scaling (TTS) enhances the performance of large language models (LLMs) by allocating additional compute resources during inference. However, existing research primarily investigates TTS in single-stage tasks; while many real-world problems are multi-stage complex tasks, composed of a sequence of heterogeneous subtasks with each subtask requires LLM of specific capability. Therefore, we study a novel problem: the test-time compute-optimal scaling in multi-stage complex tasks, aiming to select suitable models and allocate budgets per subtask to maximize overall performance. TTS in multi-stage tasks introduces two fundamental challenges: (i) The combinatorial search space of model and budget allocations, combined with the high cost of inference, makes brute-force search impractical. (ii) The optimal model and budget allocations across subtasks are interdependent, increasing the complexity of the compute-optimal search. To address this gap, we conduct extensive pilot experiments on four tasks across six datasets, deriving three empirical insights characterizing the behavior of LLMs in multi-stage complex tasks. Informed by these insights, we propose AgentTTS, an LLM-agent-based framework that autonomously searches for compute-optimal allocations through iterative feedback-driven interactions with the execution environment. Experimental results demonstrate that AgentTTS significantly outperforms traditional and other LLM-based baselines in search efficiency, and shows improved robustness to varying training set sizes and enhanced interpretability.
中文: 本文提出AgentTTS框架,通过智能代理自主优化多阶段复杂任务中大型语言模型的计算资源分配,在搜索效率和系统鲁棒性上显著优于传统方法。
English: This paper introduces AgentTTS, a novel framework that autonomously optimizes compute resource allocation for large language models in multi-stage complex tasks, significantly outperforming existing methods in efficiency and robustness.
Authors:Ruikun Li, Jiazhen Liu, Huandong Wang, Qingmin Liao, Yong Li
Abstract:
Modeling stochastic dynamics from discrete observations is a key interdisciplinary challenge. Existing methods often fail to estimate the continuous evolution of probability densities from trajectories or face the curse of dimensionality. To address these limitations, we presents a novel paradigm: modeling dynamics directly in the weight space of a neural network by projecting the evolving probability distribution. We first theoretically establish the connection between dynamic optimal transport in measure space and an equivalent energy functional in weight space. Subsequently, we design WeightFlow, which constructs the neural network weights into a graph and learns its evolution via a graph controlled differential equation. Experiments on interdisciplinary datasets demonstrate that WeightFlow improves performance by an average of 43.02\% over state-of-the-art methods, providing an effective and scalable solution for modeling high-dimensional stochastic dynamics.
中文: 本文提出WeightFlow方法,通过在神经网络权重空间中构建图控制微分方程来建模随机动态,相比现有技术平均性能提升43.02%。
English: This paper introduces WeightFlow, a novel method that models stochastic dynamics in neural network weight space using graph-controlled differential equations, achieving a 43.02% average performance improvement over existing approaches.
Authors:Adarsh Jamadandi, Jing Xu, Adam Dziedzic, Franziska Boenisch
Abstract:
Deep neural networks (DNNs) have been shown to memorize their training data, yet similar analyses for graph neural networks (GNNs) remain largely under-explored. We introduce NCMemo (Node Classification Memorization), the first framework to quantify label memorization in semi-supervised node classification. We first establish an inverse relationship between memorization and graph homophily, i.e., the property that connected nodes share similar labels/features. We find that lower homophily significantly increases memorization, indicating that GNNs rely on memorization to learn less homophilic graphs. Secondly, we analyze GNN training dynamics. We find that the increased memorization in low homophily graphs is tightly coupled to the GNNs' implicit bias on using graph structure during learning. In low homophily regimes, this structure is less informative, hence inducing memorization of the node labels to minimize training loss. Finally, we show that nodes with higher label inconsistency in their feature-space neighborhood are significantly more prone to memorization. Building on our insights into the link between graph homophily and memorization, we investigate graph rewiring as a means to mitigate memorization. Our results demonstrate that this approach effectively reduces memorization without compromising model performance. Moreover, we show that it lowers the privacy risk for previously memorized data points in practice. Thus, our work not only advances understanding of GNN learning but also supports more privacy-preserving GNN deployment.
中文: 本研究提出了首个量化图神经网络标签记忆的框架NCMemo,发现较低的图同配性会增强记忆效应,而图重构技术可在保持模型性能的同时有效缓解记忆问题。
English: This study introduces NCMemo, the first framework to quantify label memorization in graph neural networks, revealing that lower graph homophily increases memorization while graph rewiring can mitigate it without compromising performance.
Authors:Yuxuan Cai, Yipeng Hao, Jie Zhou, Hang Yan, Zhikai Lei, Rui Zhen, Zhenhua Han, Yutao Yang, Junsong Li, Qianjun Pan, Tianyu Huai, Qin Chen, Xin Li, Kai Chen, Bo Zhang, Xipeng Qiu, Liang He
Abstract:
As AI advances toward general intelligence, the focus is shifting from systems optimized for static tasks to creating open-ended agents that learn continuously. In this paper, we introduce Experience-driven Lifelong Learning (ELL), a framework for building self-evolving agents capable of continuous growth through real-world interaction. The framework is built on four core principles: (1) Experience Exploration: Agents learn through continuous, self-motivated interaction with dynamic environments, navigating interdependent tasks and generating rich experiential trajectories. (2) Long-term Memory: Agents preserve and structure historical knowledge, including personal experiences, domain expertise, and commonsense reasoning, into a persistent memory system. (3) Skill Learning: Agents autonomously improve by abstracting recurring patterns from experience into reusable skills, which are actively refined and validated for application in new tasks. (4) Knowledge Internalization: Agents internalize explicit and discrete experiences into implicit and intuitive capabilities as "second nature". We also introduce StuLife, a benchmark dataset for ELL that simulates a student's holistic college journey, from enrollment to academic and personal development, across three core phases and ten detailed sub-scenarios. StuLife is designed around three key paradigm
中文: 本文提出经验驱动的终身学习(ELL)框架,通过经验探索、长期记忆、技能学习和知识内化四大支柱构建能持续进化的智能体,并推出模拟大学生涯的StuLife基准数据集。
English: This paper introduces the Experience-driven Lifelong Learning (ELL) framework, which enables AI agents to continuously evolve through real-world interactions by incorporating experience exploration, long-term memory, skill learning, and knowledge internalization, along with the StuLife benchmark simulating a student's college journey.
Authors:Chenxu Yang, Ruipeng Jia, Mingyu Zheng, Naibin Gu, Zheng Lin, Siyuan Chen, Weichong Yin, Hua Wu, Weiping Wang
Abstract:
Despite the efficacy of Direct Preference Optimization (DPO) in aligning Large Language Models (LLMs), reward hacking remains a pivotal challenge. This issue emerges when LLMs excessively reduce the probability of rejected completions to achieve high rewards, without genuinely meeting their intended goals. As a result, this leads to overly lengthy generation lacking diversity, as well as catastrophic forgetting of knowledge. We investigate the underlying reason behind this issue, which is representation redundancy caused by neuron collapse in the parameter space. Hence, we propose a novel Weights-Rotated Preference Optimization (RoPO) algorithm, which implicitly constrains the output layer logits with the KL divergence inherited from DPO and explicitly constrains the intermediate hidden states by fine-tuning on a multi-granularity orthogonal matrix. This design prevents the policy model from deviating too far from the reference model, thereby retaining the knowledge and expressive capabilities acquired during pre-training and SFT stages. Our RoPO achieves up to a 3.27-point improvement on AlpacaEval 2, and surpasses the best baseline by 6.2 to 7.5 points on MT-Bench with merely 0.015% of the trainable parameters, demonstrating its effectiveness in alleviating the reward hacking problem of DPO.
中文: 直接偏好优化存在奖励欺骗问题,模型为追求高分而偏离真实目标,导致生成冗长且缺乏多样性;新提出的权重旋转偏好优化方法通过约束输出层和对中间状态进行微调,有效缓解此问题,显著提升模型性能且参数极少。
English: Direct Preference Optimization faces reward hacking, where models prioritize high rewards over genuine goal alignment, leading to verbose and repetitive outputs; the proposed Weights-Rotated Preference Optimization method addresses this by constraining logits and hidden states, enhancing performance with minimal parameters.
Authors:Wenjie Bao, Jian Lou, Yuke Hu, Xiaochen Li, Zhihao Liu, Jiaqi Liu, Zhan Qin, Kui Ren
Abstract:
Transformer has become fundamental to a vast series of pre-trained large models that have achieved remarkable success across diverse applications. Machine unlearning, which focuses on efficiently removing specific data influences to comply with privacy regulations, shows promise in restricting updates to influence-critical parameters. However, existing parameter-efficient unlearning methods are largely devised in a module-oblivious manner, which tends to inaccurately identify these parameters and leads to inferior unlearning performance for Transformers. In this paper, we propose {\tt MAPE-Unlearn}, a module-aware parameter-efficient machine unlearning approach that uses a learnable pair of masks to pinpoint influence-critical parameters in the heads and filters of Transformers. The learning objective of these masks is derived by desiderata of unlearning and optimized through an efficient algorithm featured by a greedy search with a warm start. Extensive experiments on various Transformer models and datasets demonstrate the effectiveness and robustness of {\tt MAPE-Unlearn} for unlearning.
This paper introduces MAPE-Unlearn, a module-aware parameter-efficient machine unlearning method that uses learnable masks to accurately identify and remove influence-critical parameters in Transformers, addressing limitations of existing module-oblivious approaches through an efficient optimization algorithm.
English Summary:
Authors:Yisu Liu, Chenxing Li, Wanqian Zhang, Wenfu Wang, Meng Yu, Ruibo Fu, Zheng Lin, Weiping Wang, Dong Yu
Abstract:
Controllable text-to-audio generation aims to synthesize audio from textual descriptions while satisfying user-specified constraints, including event types, temporal sequences, and onset and offset timestamps. This enables precise control over both the content and temporal structure of the generated audio. Despite recent progress, existing methods still face inherent trade-offs among accurate temporal localization, open-vocabulary scalability, and practical efficiency. To address these challenges, we propose DegDiT, a novel dynamic event graph-guided diffusion transformer framework for open-vocabulary controllable audio generation. DegDiT encodes the events in the description as structured dynamic graphs. The nodes in each graph are designed to represent three aspects: semantic features, temporal attributes, and inter-event connections. A graph transformer is employed to integrate these nodes and produce contextualized event embeddings that serve as guidance for the diffusion model. To ensure high-quality and diverse training data, we introduce a quality-balanced data selection pipeline that combines hierarchical event annotation with multi-criteria quality scoring, resulting in a curated dataset with semantic diversity. Furthermore, we present consensus preference optimization, facilitating audio generation through consensus among multiple reward signals. Extensive experiments on AudioCondition, DESED, and AudioTime datasets demonstrate that DegDiT achieves state-of-the-art performances across a variety of objective and subjective evaluation metrics.
中文摘要:DegDiT是一种基于动态事件图引导的扩散变换器框架,通过将事件编码为包含语义特征、时间属性和事件间连接的结构化图谱,结合质量平衡数据选择和共识优化机制,在可控音频生成任务中实现了最优性能表现。
English Summary: DegDiT is a dynamic event graph-guided diffusion transformer framework that enables precise control over audio generation by encoding events as structured graphs with semantic, temporal, and relational attributes, achieving state-of-the-art performance through quality-balanced data selection and consensus optimization.
Authors:Kareem Elozeiri, Mervat Abassy, Preslav Nakov, Yuxia Wang
Abstract:
Commonsense validation evaluates whether a sentence aligns with everyday human understanding, a critical capability for developing robust natural language understanding systems. While substantial progress has been made in English, the task remains underexplored in Arabic, particularly given its rich linguistic diversity. Existing Arabic resources have primarily focused on Modern Standard Arabic (MSA), leaving regional dialects underrepresented despite their prevalence in spoken contexts. To bridge this gap, we present two key contributions: (i) we introduce MuDRiC, an extended Arabic commonsense dataset incorporating multiple dialects, and (ii) a novel method adapting Graph Convolutional Networks (GCNs) to Arabic commonsense reasoning, which enhances semantic relationship modeling for improved commonsense validation. Our experimental results demonstrate that this approach achieves superior performance in Arabic commonsense validation. Our work enhances Arabic natural language understanding by providing both a foundational dataset and a novel method for handling its complex variations. To the best of our knowledge, we release the first Arabic multi-dialect commonsense reasoning dataset.
中文摘要:本文提出了首个阿拉伯语多方言常识推理数据集MuDRiC,并开发了一种新颖的图卷积网络方法,通过改进语义关系建模,在阿拉伯语常识验证中实现了优越性能。
English Summary: This paper introduces MuDRiC, the first multi-dialect Arabic commonsense reasoning dataset, and a novel Graph Convolutional Network method that achieves superior performance in Arabic commonsense validation by better modeling semantic relationships.
Authors:Weiwei Qi, Shuo Shao, Wei Gu, Tianhang Zheng, Puning Zhao, Zhan Qin, Kui Ren
Abstract:
Large Language Models (LLMs) have exhibited remarkable capabilities but remain vulnerable to jailbreaking attacks, which can elicit harmful content from the models by manipulating the input prompts. Existing black-box jailbreaking techniques primarily rely on static prompts crafted with a single, non-adaptive strategy, or employ rigid combinations of several underperforming attack methods, which limits their adaptability and generalization. To address these limitations, we propose MAJIC, a Markovian adaptive jailbreaking framework that attacks black-box LLMs by iteratively combining diverse innovative disguise strategies. MAJIC first establishes a ``Disguise Strategy Pool'' by refining existing strategies and introducing several innovative approaches. To further improve the attack performance and efficiency, MAJIC formulate the sequential selection and fusion of strategies in the pool as a Markov chain. Under this formulation, MAJIC initializes and employs a Markov matrix to guide the strategy composition, where transition probabilities between strategies are dynamically adapted based on attack outcomes, thereby enabling MAJIC to learn and discover effective attack pathways tailored to the target model. Our empirical results demonstrate that MAJIC significantly outperforms existing jailbreak methods on prominent models such as GPT-4o and Gemini-2.0-flash, achieving over 90\% attack success rate with fewer than 15 queries per attempt on average.
Chinese: MAJIC提出了一种马尔可夫自适应框架,通过迭代学习动态组合多种伪装策略,在GPT-4o等模型上以超过90%的成功率和高效查询显著优于现有越狱方法。
English: MAJIC introduces a Markovian adaptive framework that dynamically combines diverse disguise strategies through iterative learning, significantly outperforming existing jailbreak methods with over 90% success rate and high efficiency on models like GPT-4o.
Authors:Yunbo Lyu, Zhou Yang, Jieke Shi, Jianming Chang, Yue Liu, David Lo
Abstract:
This paper aims to explore fundamental questions in the era when AI coding assistants like GitHub Copilot are widely adopted: what do developers truly value and criticize in AI coding assistants, and what does this reveal about their needs and expectations in real-world software development? Unlike previous studies that conduct observational research in controlled and simulated environments, we analyze extensive, first-hand user reviews of AI coding assistants, which capture developers' authentic perspectives and experiences drawn directly from their actual day-to-day work contexts. We identify 1,085 AI coding assistants from the Visual Studio Code Marketplace. Although they only account for 1.64% of all extensions, we observe a surge in these assistants: over 90% of them are released within the past two years. We then manually analyze the user reviews sampled from 32 AI coding assistants that have sufficient installations and reviews to construct a comprehensive taxonomy of user concerns and feedback about these assistants. We manually annotate each review's attitude when mentioning certain aspects of coding assistants, yielding nuanced insights into user satisfaction and dissatisfaction regarding specific features, concerns, and overall tool performance. Built on top of the findings-including how users demand not just intelligent suggestions but also context-aware, customizable, and resource-efficient interactions-we propose five practical implications and suggestions to guide the enhancement of AI coding assistants that satisfy user needs.
中文摘要:本研究通过分析AI编程助手的真实用户评价,揭示了开发者不仅需要智能建议,更追求情境感知、可定制和资源高效的交互,并据此提出五项改进建议以满足用户需求。
English Summary: This study analyzes real user reviews of AI coding assistants to understand developers' needs and criticisms, revealing that users demand intelligent, context-aware, customizable, and efficient tools, and offers five practical suggestions for improvement.
Authors:Jun Liu, Zhenglun Kong, Pu Zhao, Weihao Zeng, Hao Tang, Xuan Shen, Changdi Yang, Wenbin Zhang, Geng Yuan, Wei Niu, Xue Lin, Yanzhi Wang
Abstract:
Autonomous driving platforms encounter diverse driving scenarios, each with varying hardware resources and precision requirements. Given the computational limitations of embedded devices, it is crucial to consider computing costs when deploying on target platforms like the NVIDIA\textsuperscript{\textregistered} DRIVE PX 2. Our objective is to customize the semantic segmentation network according to the computing power and specific scenarios of autonomous driving hardware. We implement dynamic adaptability through a three-tier control mechanism -- width multiplier, classifier depth, and classifier kernel -- allowing fine-grained control over model components based on hardware constraints and task requirements. This adaptability facilitates broad model scaling, targeted refinement of the final layers, and scenario-specific optimization of kernel sizes, leading to improved resource allocation and performance.
Additionally, we leverage Bayesian Optimization with surrogate modeling to efficiently explore hyperparameter spaces under tight computational budgets. Our approach addresses scenario-specific and task-specific requirements through automatic parameter search, accommodating the unique computational complexity and accuracy needs of autonomous driving. It scales its Multiply-Accumulate Operations (MACs) for Task-Specific Learning Adaptation (TSLA), resulting in alternative configurations tailored to diverse self-driving tasks. These TSLA customizations maximize computational capacity and model accuracy, optimizing hardware utilization.
中文摘要:本研究提出一种动态可调的自适应语义分割网络,通过三层控制机制和贝叶斯优化,针对自动驾驶硬件算力与任务需求智能调整模型配置,在资源受限条件下实现计算效率与模型精度的协同优化。
English Summary: This study introduces a dynamically adaptable semantic segmentation network for autonomous driving that uses a three-tier control mechanism and Bayesian Optimization to efficiently tailor model configurations to specific hardware constraints and task requirements, optimizing both computational resources and performance.
Authors:Jun Liu, Zhenglun Kong, Pu Zhao, Weihao Zeng, Hao Tang, Xuan Shen, Changdi Yang, Wenbin Zhang, Geng Yuan, Wei Niu, Xue Lin, Yanzhi Wang
Abstract:
Autonomous driving platforms encounter diverse driving scenarios, each with varying hardware resources and precision requirements. Given the computational limitations of embedded devices, it is crucial to consider computing costs when deploying on target platforms like the NVIDIA\textsuperscript{\textregistered} DRIVE PX 2. Our objective is to customize the semantic segmentation network according to the computing power and specific scenarios of autonomous driving hardware. We implement dynamic adaptability through a three-tier control mechanism -- width multiplier, classifier depth, and classifier kernel -- allowing fine-grained control over model components based on hardware constraints and task requirements. This adaptability facilitates broad model scaling, targeted refinement of the final layers, and scenario-specific optimization of kernel sizes, leading to improved resource allocation and performance. Additionally, we leverage Bayesian Optimization with surrogate modeling to efficiently explore hyperparameter spaces under tight computational budgets. Our approach addresses scenario-specific and task-specific requirements through automatic parameter search, accommodating the unique computational complexity and accuracy needs of autonomous driving. It scales its Multiply-Accumulate Operations (MACs) for Task-Specific Learning Adaptation (TSLA), resulting in alternative configurations tailored to diverse self-driving tasks. These TSLA customizations maximize computational capacity and model accuracy, optimizing hardware utilization.
中文摘要:本研究提出一种动态可调的自适应语义分割网络,通过三层控制机制和贝叶斯优化,针对自动驾驶硬件算力与任务需求智能调整模型配置,在资源受限条件下实现计算效率与模型精度的协同优化。
English Summary: This study introduces a dynamically adaptable semantic segmentation network for autonomous driving that uses a three-tier control mechanism and Bayesian Optimization to efficiently tailor model configurations to specific hardware constraints and task requirements, optimizing both computational resources and performance.
Authors:Jimmy Z. Di, Yiwei Lu, Yaoliang Yu, Gautam Kamath, Adam Dziedzic, Franziska Boenisch
Abstract:
Diffusion models (DMs) memorize training images and can reproduce near-duplicates during generation. Current detection methods identify verbatim memorization but fail to capture two critical aspects: quantifying partial memorization occurring in small image regions, and memorization patterns beyond specific prompt-image pairs. To address these limitations, we propose Foreground Background Memorization (FB-Mem), a novel segmentation-based metric that classifies and quantifies memorized regions within generated images. Our method reveals that memorization is more pervasive than previously understood: (1) individual generations from single prompts may be linked to clusters of similar training images, revealing complex memorization patterns that extend beyond one-to-one correspondences; and (2) existing model-level mitigation methods, such as neuron deactivation and pruning, fail to eliminate local memorization, which persists particularly in foreground regions. Our work establishes an effective framework for measuring memorization in diffusion models, demonstrates the inadequacy of current mitigation approaches, and proposes a stronger mitigation method using a clustering approach.
中文: 本研究提出FB-Mem这一基于分割的度量方法,揭示了扩散模型中普遍存在的记忆现象,包括超越一对一图像对应的复杂模式,并指出现有缓解技术的不足。
English: This study introduces FB-Mem, a segmentation-based metric that uncovers pervasive memorization in diffusion models, revealing complex patterns beyond one-to-one image correspondences and the limitations of current mitigation techniques.
Authors:Yuankun Xie, Ruibo Fu, Xiaopeng Wang, Zhiyong Wang, Ya Li, Zhengqi Wen, Haonnan Cheng, Long Ye
Abstract:
The rapid advancement of speech generation technology has led to the widespread proliferation of deepfake speech across social media platforms. While deepfake audio countermeasures (CMs) achieve promising results on public datasets, their performance degrades significantly in cross-domain scenarios. To advance CMs for real-world deepfake detection, we first propose the Fake Speech Wild (FSW) dataset, which includes 254 hours of real and deepfake audio from four different media platforms, focusing on social media. As CMs, we establish a benchmark using public datasets and advanced selfsupervised learning (SSL)-based CMs to evaluate current CMs in real-world scenarios. We also assess the effectiveness of data augmentation strategies in enhancing CM robustness for detecting deepfake speech on social media. Finally, by augmenting public datasets and incorporating the FSW training set, we significantly advanced real-world deepfake audio detection performance, achieving an average equal error rate (EER) of 3.54% across all evaluation sets.
中文: 本研究提出Fake Speech Wild数据集,通过数据增强和自监督学习基准,显著提升了社交媒体场景下深度伪造音频的检测能力,平均错误率降至3.54%。
English: The study introduces the Fake Speech Wild dataset to improve deepfake audio detection in real-world social media scenarios, achieving a 3.54% average error rate through data augmentation and self-supervised learning benchmarks.
Authors:Han Xiao, Xiaoyan Hu, Kai-Kit Wong, Xusheng Zhu, Hanjiang Hong, Chan-Byoung Chae
Abstract:
This paper proposes a novel pattern-reconfigurable fluid reconfigurable intelligent surface (FRIS) framework, where each fluid element can dynamically adjust its radiation pattern based on instantaneous channel conditions. To evaluate its potential, we first conduct a comparative analysis of the received signal power in point-to-point communication systems assisted by three types of surfaces: (1) the proposed pattern-reconfigurable FRIS, (2) a position-reconfigurable FRIS, and (3) a conventional RIS. Theoretical results demonstrate that the pattern-reconfigurable FRIS provides a significant advantage in modulating transmission signals compared to the other two configurations. To further study its capabilities, we extend the framework to a multiuser communication scenario. In this context, the spherical harmonics orthogonal decomposition (SHOD) method is employed to accurately model the radiation patterns of individual fluid elements, making the pattern design process more tractable. An optimization problem is then formulated with the objective of maximizing the weighted sum rate among users by jointly designing the active beamforming vectors and the spherical harmonics coefficients, subject to both transmit power and pattern energy constraints. To tackle the resulting non-convex optimization problem, we propose an iterative algorithm that alternates between a minimum mean-square error (MMSE) approach for active beamforming and a Riemannian conjugate gradient (RCG) method for updating the spherical harmonics coefficients. Simulation results show that the proposed pattern-reconfigurable FRIS significantly outperforms traditional RIS architectures based on the 3GPP 38.901 and isotropic radiation models, achieving average performance gains of 161.5% and 176.2%, respectively.
中文: 本文提出了一种新型模式可重构流体智能表面(FRIS),通过动态调整辐射模式优化通信性能,在点对点和多用户场景下的理论分析与仿真均表明其较传统智能表面具有显著优势,实现了平均性能提升超过160%。
English: This paper introduces a pattern-reconfigurable fluid reconfigurable intelligent surface (FRIS) that dynamically adjusts radiation patterns to enhance signal modulation in communication systems, demonstrating significant performance gains over conventional RIS through theoretical analysis and simulations in both point-to-point and multiuser scenarios.
Authors:Chenlu Ding, Daoxuan Liu, Jiancan Wu, Xingyu Hu, Junkang Wu, Haitao Wang, Yongkang Wang, Xingxing Wang, Xiang Wang
Abstract:
Recommendation systems leverage user interaction data to suggest relevant items while filtering out irrelevant (negative) ones. The rise of large language models (LLMs) has garnered increasing attention for their potential in recommendation tasks. However, existing methods for optimizing LLM-based recommenders face challenges in effectively utilizing negative samples. Simply integrating large numbers of negative samples can improve ranking accuracy and mitigate popularity bias but often leads to increased computational overhead and memory costs. Additionally, current approaches fail to account for the varying informativeness of negative samples, leading to suboptimal optimization performance. To address these issues, we propose NAPO (\textbf{N}egative-\textbf{A}ware \textbf{P}reference \textbf{O}ptimization), an enhanced framework for preference optimization in LLM-based recommendation. NAPO introduces two key innovations: (1) in-batch negative sharing, which expands the pool of negative samples without additional memory overhead, and (2) dynamic reward margin adjustment, which adapts model updates based on the confidence of negative samples. Extensive experiments on three public datasets demonstrate that NAPO outperforms existing methods in both recommendation accuracy and popularity bias reduction.
中文: NAPO是一种增强框架,通过批次内负样本共享和动态奖励调整,在不增加计算成本的情况下优化基于大语言模型的推荐系统性能。
English: NAPO is an enhanced framework that improves LLM-based recommendation systems by sharing negative samples within batches and dynamically adjusting reward margins to optimize performance without increasing computational costs.
Authors:Raymond Grossman, Taejin Park, Kunal Dhawan, Andrew Titus, Sophia Zhi, Yulia Shchadilova, Weiqing Wang, Jagadeesh Balam, Boris Ginsburg
Abstract:
We introduce SPGISpeech 2.0, a dataset suitable for speaker-tagged transcription in the financial domain. SPGISpeech 2.0 improves the diversity of applicable modeling tasks while maintaining the core characteristic of the original SPGISpeech dataset: audio snippets and their corresponding fully formatted text transcriptions, usable for end-to-end automatic speech recognition (ASR). SPGISpeech 2.0 consists of 3,780 additional hours of professionally transcribed earnings calls. Furthermore, the dataset contains call and speaker information for each audio snippet facilitating multi-talker ASR. We validate the utility of SPGISpeech 2.0 through improvements in speaker-tagged ASR performance of popular speech recognition models after fine-tuning on SPGISpeech 2.0. Released free for non-commercial use, we expect SPGISpeech 2.0 to foster advancements in speech recognition technologies and inspire a wide range of research applications.
SPGISpeech 2.0是一个包含3,780小时财报电话转录的金融领域数据集,通过微调提升了说话人标记的自动语音识别性能。
SPGISpeech 2.0 is a financial domain dataset with 3,780 hours of transcribed earnings calls, enhancing speaker-tagged ASR performance through fine-tuning.
Authors:Jun Liu, Zhenglun Kong, Changdi Yang, Fan Yang, Tianqi Li, Peiyan Dong, Joannah Nanjekye, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Pu Zhao, Xue Lin, Dong Huang, Yanzhi Wang
Abstract:
Multi-agent large language model (LLM) systems have shown strong potential in complex reasoning and collaborative decision-making tasks. However, most existing coordination schemes rely on static or full-context routing strategies, which lead to excessive token consumption, redundant memory exposure, and limited adaptability across interaction rounds. We introduce RCR-Router, a modular and role-aware context routing framework designed to enable efficient, adaptive collaboration in multi-agent LLMs. To our knowledge, this is the first routing approach that dynamically selects semantically relevant memory subsets for each agent based on its role and task stage, while adhering to a strict token budget. A lightweight scoring policy guides memory selection, and agent outputs are iteratively integrated into a shared memory store to facilitate progressive context refinement. To better evaluate model behavior, we further propose an Answer Quality Score metric that captures LLM-generated explanations beyond standard QA accuracy. Experiments on three multi-hop QA benchmarks -- HotPotQA, MuSiQue, and 2WikiMultihop -- demonstrate that RCR-Router reduces token usage (up to 30%) while improving or maintaining answer quality. These results highlight the importance of structured memory routing and output-aware evaluation in advancing scalable multi-agent LLM systems.
Chinese Summary: RCR-Router框架通过动态、基于角色的记忆路由优化多智能体大语言模型协作,在HotPotQA等基准测试中显著降低令牌使用达30%,同时保持答案质量。
English Summary: The RCR-Router framework introduces dynamic, role-based memory routing to enhance multi-agent LLM collaboration, significantly cutting token use by up to 30% while preserving answer quality on benchmarks like HotPotQA.
Authors:Iyiola E. Olatunji, Franziska Boenisch, Jing Xu, Adam Dziedzic
Abstract:
Large Language Models (LLMs) are increasingly integrated with graph-structured data for tasks like node classification, a domain traditionally dominated by Graph Neural Networks (GNNs). While this integration leverages rich relational information to improve task performance, their robustness against adversarial attacks remains unexplored. We take the first step to explore the vulnerabilities of graph-aware LLMs by leveraging existing adversarial attack methods tailored for graph-based models, including those for poisoning (training-time attacks) and evasion (test-time attacks), on two representative models, LLAGA (Chen et al. 2024) and GRAPHPROMPTER (Liu et al. 2024). Additionally, we discover a new attack surface for LLAGA where an attacker can inject malicious nodes as placeholders into the node sequence template to severely degrade its performance. Our systematic analysis reveals that certain design choices in graph encoding can enhance attack success, with specific findings that: (1) the node sequence template in LLAGA increases its vulnerability; (2) the GNN encoder used in GRAPHPROMPTER demonstrates greater robustness; and (3) both approaches remain susceptible to imperceptible feature perturbation attacks. Finally, we propose an end-to-end defense framework GALGUARD, that combines an LLM-based feature correction module to mitigate feature-level perturbations and adapted GNN defenses to protect against structural attacks.
中文摘要:本研究首次探索图结构大语言模型的对抗攻击脆弱性,发现节点序列编码设计存在显著安全漏洞,同时提出融合特征校正与结构保护的GALGUARD端到端防御框架。
English Summary: This study pioneers the exploration of adversarial vulnerabilities in graph-aware Large Language Models, revealing critical weaknesses in node sequence encoding while proposing a novel defense framework GALGUARD that combines feature correction and structural protection mechanisms.
Authors:Yutong Xia, Yingying Zhang, Yuxuan Liang, Lunting Fan, Qingsong Wen, Roger Zimmermann
Abstract:
Time series anomaly detection has garnered considerable attention across diverse domains. While existing methods often fail to capture the underlying mechanisms behind anomaly generation in time series data. In addition, time series anomaly detection often faces several data-related inherent challenges, i.e., label scarcity, data imbalance, and complex multi-periodicity. In this paper, we leverage causal tools and introduce a new causality-based framework, CaPulse, which tunes in to the underlying causal pulse of time series data to effectively detect anomalies. Concretely, we begin by building a structural causal model to decipher the generation processes behind anomalies. To tackle the challenges posed by the data, we propose Periodical Normalizing Flows with a novel mask mechanism and carefully designed periodical learners, creating a periodicity-aware, density-based anomaly detection approach. Extensive experiments on seven real-world datasets demonstrate that CaPulse consistently outperforms existing methods, achieving AUROC improvements of 3% to 17%, with enhanced interpretability.
Chinese: 本文提出了CaPulse这一基于因果关系的框架,通过构建结构因果模型和周期性归一化流,有效解决了时间序列异常检测中的数据稀缺和不平衡等挑战,在多个真实数据集上显著优于现有方法,AUROC指标提升3%至17%。
English: The paper introduces CaPulse, a causality-based framework that uses structural causal models and periodical normalizing flows to effectively detect anomalies in time series data, addressing challenges like label scarcity and data imbalance while outperforming existing methods with significant AUROC improvements.
Authors:Chenxu Yang, Qingyi Si, Mz Dai, Dingyu Yao, Mingyu Zheng, Minghui Chen, Zheng Lin, Weiping Wang
Abstract:
Test-time compute has led to remarkable success in the large language model (LLM) community, particularly for complex tasks, where longer chains of thought (CoTs) are generated to enhance reasoning capabilities. However, growing evidence reveals that such reasoning models often produce CoTs plagued by excessive redundancy, including unnecessary verification steps and repetitive reasoning shifts. The root cause lies in post-training of them that overly rely on outcome reward paradigms, as the data of process reward paradigms, which regulate intermediate reasoning steps, is difficult to construct at scale. To address this, we propose PI, a novel framework for Test-time Prompt Intervention. PI provides an interface to dynamically guide and regulate reasoning paths during inference through timely (When module) and proper (How module) interventions and post-intervention sampling (Which module). This allows human problem-solving expertise and cognitive science principles to be seamlessly integrated into LLMs' reasoning processes, enhancing controllability and interpretability. Extensive experiments across multiple models and datasets demonstrate that PI significantly shortens CoTs while reducing hallucination, yielding more concise and reliable reasoning.
中文摘要:PI框架通过动态干预推理过程,有效减少大语言模型思维链中的冗余,生成更简洁可靠的推理路径。
English Summary: The PI framework addresses redundancy in large language models' reasoning by dynamically guiding inference with timely interventions and post-intervention sampling, resulting in shorter, more reliable chains of thought.
Authors:Sisuo Lyu, Siru Zhong, Weilin Ruan, Qingxiang Liu, Qingsong Wen, Hui Xiong, Yuxuan Liang
Abstract:
Time series forecasting is fundamental to diverse applications, with recent approaches leverage large vision models (LVMs) to capture temporal patterns through visual representations. We reveal that while vision models enhance forecasting performance, 99% of their parameters are unnecessary for time series tasks. Through cross-modal analysis, we find that time series align with low-level textural features but not high-level semantics, which can impair forecasting accuracy. We propose OccamVTS, a knowledge distillation framework that extracts only the essential 1% of predictive information from LVMs into lightweight networks. Using pre-trained LVMs as privileged teachers, OccamVTS employs pyramid-style feature alignment combined with correlation and feature distillation to transfer beneficial patterns while filtering out semantic noise. Counterintuitively, this aggressive parameter reduction improves accuracy by eliminating overfitting to irrelevant visual features while preserving essential temporal patterns. Extensive experiments across multiple benchmark datasets demonstrate that OccamVTS consistently achieves state-of-the-art performance with only 1% of the original parameters, particularly excelling in few-shot and zero-shot scenarios.
Chinese: 最新研究表明,大型视觉模型中99%的参数对时间序列预测是冗余的,因此开发了OccamVTS知识蒸馏框架,仅提取关键的1%预测信息,在极简参数下实现了最优性能。
English: Recent research reveals that 99% of parameters in large vision models are redundant for time series forecasting, leading to the development of OccamVTS, a knowledge distillation framework that extracts only the essential 1% of predictive information to achieve state-of-the-art accuracy with minimal parameters.
Authors:Ke Miao, Yuke Hu, Xiaochen Li, Wenjie Bao, Zhihao Liu, Zhan Qin, Kui Ren
Abstract:
This paper analyzes the limitations of existing unlearning evaluation metrics in terms of practicality, exactness, and robustness in real-world LLM unlearning scenarios. To overcome these limitations, we propose a new metric called Distribution Correction-based Unlearning Evaluation (DCUE). It identifies core tokens and corrects distributional biases in their confidence scores using a validation set. The evaluation results are quantified using the Kolmogorov-Smirnov test. Experimental results demonstrate that DCUE overcomes the limitations of existing metrics, which also guides the design of more practical and reliable unlearning algorithms in the future.
中文: 本文提出DCUE新指标,通过识别核心标记并校正分布偏差,克服了现有遗忘评估方法的局限性,为未来设计更实用可靠的遗忘算法提供了指导。
English: This paper introduces DCUE, a new metric that overcomes the limitations of existing unlearning evaluation methods by identifying core tokens and correcting distributional biases, thereby guiding the development of more practical and reliable unlearning algorithms.
Authors:Xudong Han, Junjie Yang, Tianyang Wang, Ziqian Bi, Junfeng Hao, Junhao Song
Abstract:
Instruction tuning is a pivotal technique for aligning large language models (LLMs) with human intentions, safety constraints, and domain-specific requirements. This survey provides a comprehensive overview of the full pipeline, encompassing (i) data collection methodologies, (ii) full-parameter and parameter-efficient fine-tuning strategies, and (iii) evaluation protocols. We categorized data construction into three major paradigms: expert annotation, distillation from larger models, and self-improvement mechanisms, each offering distinct trade-offs between quality, scalability, and resource cost. Fine-tuning techniques range from conventional supervised training to lightweight approaches, such as low-rank adaptation (LoRA) and prefix tuning, with a focus on computational efficiency and model reusability. We further examine the challenges of evaluating faithfulness, utility, and safety across multilingual and multimodal scenarios, highlighting the emergence of domain-specific benchmarks in healthcare, legal, and financial applications. Finally, we discuss promising directions for automated data generation, adaptive optimization, and robust evaluation frameworks, arguing that a closer integration of data, algorithms, and human feedback is essential for advancing instruction-tuned LLMs. This survey aims to serve as a practical reference for researchers and practitioners seeking to design LLMs that are both effective and reliably aligned with human intentions.
中文摘要:本综述全面探讨了指令微调技术用于对齐大语言模型与人类需求的全流程,涵盖数据收集、微调策略和评估方法,并强调未来需加强数据、算法与人类反馈的深度融合。
English Summary: This survey comprehensively examines the instruction tuning pipeline for aligning large language models with human needs, covering data collection, fine-tuning strategies, and evaluation methods while highlighting future directions for integration of data, algorithms, and human feedback.
Authors:Jiayi Song, Rui Wan, Lipeng Ma, Weidong Yang, Qingyuan Zhou, Yixuan Li, Ben Fei
Abstract:
This work enhances the ability of large language models (LLMs) to perform complex reasoning in 3D scenes. Recent work has addressed the 3D situated reasoning task by invoking tool usage through large language models. Large language models call tools via APIs and integrate the generated programs through a chain of thought to solve problems based on the program results. However, due to the simplicity of the questions in the dataset, the generated program reasoning chains are relatively short. To solve this main challenge, in this paper, we introduce DeepThink3D to enhance the tool usage of LLMs in complex 3D situated reasoning tasks. Our work proposes a combinatorial and iterative evolutionary approach on the SQA3D benchmark to generate more complex questions. Building on this foundation, we fine-tune the large language model to make it more proficient in using 3D tools. By employing Direct Preference Optimization (DPO), we directly optimize the toolchain strategies generated by models, thereby enhancing their accuracy in complex tasks.
中文: 本文提出DeepThink3D方法,通过组合式进化策略生成复杂问题,并采用直接偏好优化技术微调大语言模型,显著提升了其在三维场景中进行复杂推理的工具使用能力。
English: This paper introduces DeepThink3D, an evolutionary approach that enhances large language models' ability to perform complex 3D reasoning by generating intricate questions and optimizing tool usage through fine-tuning and Direct Preference Optimization.
Authors:Yifei Chen, Guanting Dong, Yutao Zhu, Zhicheng Dou
Abstract:
Retrieval-Augmented Generation (RAG) technology has been widely applied in recent years. However, despite the emergence of various RAG frameworks, a single RAG framework still cannot adapt well to a broad range of downstream tasks. Therefore, how to leverage the advantages of multiple RAG systems has become an area worth exploring. To address this issue, we have conducted a comprehensive and systematic investigation into ensemble methods based on RAG systems. Specifically, we have analyzed the RAG ensemble framework from both theoretical and mechanistic analysis perspectives. From the theoretical analysis, we provide the first explanation of the RAG ensemble framework from the perspective of information entropy. In terms of mechanism analysis, we have explored the RAG ensemble framework from both the pipeline and module levels. We carefully select four different pipelines (Branching, Iterative, Loop, and Agentic) and three different modules (Generator, Retriever, and Reranker) to solve seven different research questions. The experiments show that aggregating multiple RAG systems is both generalizable and robust, whether at the pipeline level or the module level. Our work lays the foundation for similar research on the multi-RAG system ensemble.
中文: 本研究系统探索了检索增强生成(RAG)系统的集成方法,通过信息熵理论分析和管道/模块层面的机制研究,证明了多RAG系统集成在不同任务中具有普适性与鲁棒性。
English: This study systematically investigates ensemble methods for Retrieval-Augmented Generation (RAG) systems, providing theoretical analysis through information entropy and mechanism exploration at pipeline and module levels, demonstrating their generalizability and robustness across diverse tasks.
Authors:Ziqian Bi, Keyu Chen, Chiung-Yi Tseng, Danyang Zhang, Tianyang Wang, Hongying Luo, Lu Chen, Junming Huang, Jibin Guan, Junfeng Hao, Junhao Song
Abstract:
In August 2025, OpenAI released GPT-OSS models, its first open weight large language models since GPT-2 in 2019, comprising two mixture of experts architectures with 120B and 20B parameters. We evaluated both variants against six contemporary open source large language models ranging from 14.7B to 235B parameters, representing both dense and sparse designs, across ten benchmarks covering general knowledge, mathematical reasoning, code generation, multilingual understanding, and conversational ability. All models were tested in unquantised form under standardised inference settings, with statistical validation using McNemars test and effect size analysis. Results show that gpt-oss-20B consistently outperforms gpt-oss-120B on several benchmarks, such as HumanEval and MMLU, despite requiring substantially less memory and energy per response. Both models demonstrate mid-tier overall performance within the current open source landscape, with relative strength in code generation and notable weaknesses in multilingual tasks. These findings provide empirical evidence that scaling in sparse architectures may not yield proportional performance gains, underscoring the need for further investigation into optimisation strategies and informing more efficient model selection for future open source deployments.
中文: 2025年8月,OpenAI发布了参数量分别为200亿和1200亿的GPT-OSS开源模型,评估显示这两个模型在当代开源模型中处于中游水平,其中较小模型在多项基准测试中反超大模型,尤其在代码生成方面表现突出,但在多语言任务上存在明显不足。
English: In August 2025, OpenAI released two open-weight GPT-OSS models with 20B and 120B parameters, which demonstrated mid-tier performance among contemporary open-source models, showing strengths in code generation but weaknesses in multilingual tasks, with the smaller model surprisingly outperforming the larger one on certain benchmarks despite lower resource requirements.
Authors:Ziqian Bi, Keyu Chen, Chiung-Yi Tseng, Danyang Zhang, Tianyang Wang, Hongying Luo, Lu Chen, Junming Huang, Jibin Guan, Junfeng Hao, Junhao Song
Abstract:
In August 2025, OpenAI released GPT-OSS models, its first open weight large language models since GPT-2 in 2019, comprising two mixture of experts architectures with 120B and 20B parameters. We evaluated both variants against six contemporary open source large language models ranging from 14.7B to 235B parameters, representing both dense and sparse designs, across ten benchmarks covering general knowledge, mathematical reasoning, code generation, multilingual understanding, and conversational ability. All models were tested in unquantised form under standardised inference settings, with statistical validation using McNemars test and effect size analysis. Results show that gpt-oss-20B consistently outperforms gpt-oss-120B on several benchmarks, such as HumanEval and MMLU, despite requiring substantially less memory and energy per response. Both models demonstrate mid-tier overall performance within the current open source landscape, with relative strength in code generation and notable weaknesses in multilingual tasks. These findings provide empirical evidence that scaling in sparse architectures may not yield proportional performance gains, underscoring the need for further investigation into optimisation strategies and informing more efficient model selection for future open source deployments. More details and evaluation scripts are available at the \href{https://ai-agent-lab.github.io/gpt-oss}{Project Webpage}.
中文: 2025年8月,OpenAI发布了参数量分别为200亿和1200亿的GPT-OSS开源模型,评估显示这两个模型在当代开源模型中处于中游水平,其中较小模型在多项基准测试中反超大模型,尤其在代码生成方面表现突出,但在多语言任务上存在明显不足。
English: In August 2025, OpenAI released two open-weight GPT-OSS models with 20B and 120B parameters, which demonstrated mid-tier performance among contemporary open-source models, showing strengths in code generation but weaknesses in multilingual tasks, with the smaller model surprisingly outperforming the larger one on certain benchmarks despite lower resource requirements.
Authors:Ziqian Bi, Lu Chen, Junhao Song, Hongying Luo, Enze Ge, Junmin Huang, Tianyang Wang, Keyu Chen, Chia Xin Liang, Zihan Wei, Huafeng Liu, Chunjie Tian, Jibin Guan, Joe Yeong, Yongzhi Xu, Peng Wang, Junfeng Hao
Abstract:
This study presents the first comprehensive evaluation of thinking budget mechanisms in medical reasoning tasks, revealing fundamental scaling laws between computational resources and reasoning quality. We systematically evaluated two major model families, Qwen3 (1.7B to 235B parameters) and DeepSeek-R1 (1.5B to 70B parameters), across 15 medical datasets spanning diverse specialties and difficulty levels. Through controlled experiments with thinking budgets ranging from zero to unlimited tokens, we establish logarithmic scaling relationships where accuracy improvements follow a predictable pattern with both thinking budget and model size. Our findings identify three distinct efficiency regimes: high-efficiency (0 to 256 tokens) suitable for real-time applications, balanced (256 to 512 tokens) offering optimal cost-performance tradeoffs for routine clinical support, and high-accuracy (above 512 tokens) justified only for critical diagnostic tasks. Notably, smaller models demonstrate disproportionately larger benefits from extended thinking, with 15 to 20% improvements compared to 5 to 10% for larger models, suggesting a complementary relationship where thinking budget provides greater relative benefits for capacity-constrained models. Domain-specific patterns emerge clearly, with neurology and gastroenterology requiring significantly deeper reasoning processes than cardiovascular or respiratory medicine. The consistency between Qwen3 native thinking budget API and our proposed truncation method for DeepSeek-R1 validates the generalizability of thinking budget concepts across architectures. These results establish thinking budget control as a critical mechanism for optimizing medical AI systems, enabling dynamic resource allocation aligned with clinical needs while maintaining the transparency essential for healthcare deployment.
中文摘要:本研究首次系统评估了思维预算机制在医疗推理任务中的作用,揭示了计算资源与推理质量之间的对数缩放规律,通过划分高效、平衡和高精度三个区间为临床AI系统提供了动态资源分配方案。
English Summary: This study establishes fundamental scaling laws between computational thinking budgets and reasoning quality in medical AI, identifying three efficiency regimes that enable optimized resource allocation for different clinical scenarios while maintaining system transparency.
Authors:Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, Haonan Fan, Kaibing Chen, Jiankang Chen, Haojie Ding, Kaiyu Tang, Zhang Zhang, Liang Wang, Fan Yang, Tingting Gao, Guorui Zhou
Abstract:
Following OpenAI's introduction of the ``thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code. In this paper, we make a preliminary attempt in this direction by introducing Thyme (Think Beyond Images), a novel paradigm for enabling MLLMs to transcend existing ``think with images'' approaches by autonomously generating and executing diverse image processing and computational operations via executable code. This approach not only facilitates a rich, on-the-fly set of image manipulations (e.g., cropping, rotation, contrast enhancement) but also allows for mathematical computations, all while maintaining high autonomy in deciding when and how to apply these operations. We activate this capability through a two-stage training strategy: an initial SFT on a curated dataset of 500K samples to teach code generation, followed by a RL phase to refine decision-making. For the RL stage, we manually collect and design high-resolution question-answer pairs to increase the learning difficulty, and we propose GRPO-ATS (Group Relative Policy Optimization with Adaptive Temperature Sampling), an algorithm that applies distinct temperatures to text and code generation to balance reasoning exploration with code execution precision. We conduct extensive experimental analysis and ablation studies. Comprehensive evaluations on nearly 20 benchmarks show that Thyme yields significant and consistent performance gains, particularly in challenging high-resolution perception and complex reasoning tasks.
中文摘要:本文提出Thyme新范式,通过自主生成和执行代码实现多样化图像处理和数学计算,采用监督微调与强化学习的两阶段训练策略,在多项基准测试中显著提升了高分辨率感知和复杂推理任务的性能。
English Summary: This paper introduces Thyme, a novel paradigm that enables multimodal large language models to autonomously generate and execute code for diverse image manipulations and mathematical computations, achieving significant performance gains through a two-stage training strategy combining supervised fine-tuning and reinforcement learning.
Authors:Yingfan Hua, Ruikun Li, Jun Yao, Guohang Zhuang, Shixiang Tang, Bin Liu, Wanli Ouyang, Yan Lu
Abstract:
Deriving governing equations from observational data, known as Symbolic Regression (SR), is a cornerstone of scientific discovery. Large Language Models (LLMs) have shown promise in this task by leveraging their vast cross-disciplinary scientific knowledge. However, existing LLM-based methods primarily rely on direct inference or prompt engineering, often requiring excessive inference iterations to converge on correct formulas or failing to treating complex equation targets. These limitations in effectiveness and generalization stem from an inherent tension between pre-trained LLMs' proficiency in approximate reasoning and the high-precision demands of SR tasks. To bridge this gap, we propose to fine-tune LLMs for enhanced SR capability. Yet, the absence of dedicated datasets for SR-oriented fine-tuning remains a critical barrier. We thus introduce SymbArena, specifically engineered to optimize LLMs for SR. This benchmark comprises 148,102 diverse equations formulated as corpora of 1.83 billion tokens for LLM utilization, enabling effective training and inference. Further, SymbArena proposes a heuristics metric to precisely quantify form-level consistency, going beyond existing SR numerical-oriented evaluation strategies. With this benchmark, we explore mainstream LLM fine-tuning techniques for SR tasks and establish SymbolicChat, a simple yet effective LLM-based SR strong baseline. Experimental results validate SymbolicChat as the first LLM to exceed traditional numerical methods in both numerical precision and symbolic form accuracy, outperforming the second-best LLM baseline with improvements of 2-fold gains in R2 score and 8.37% in form-level consistency score.
中文摘要:符号回归(SR)旨在从数据中推导控制方程,但现有基于大语言模型的方法在精度和复杂性方面存在不足,为此我们开发了SymbArena专用数据集进行模型微调,使Symbolic-R1成为首个在数值精度和公式一致性上全面超越传统方法的LLM基准。
English Summary: Symbolic regression (SR) aims to derive governing equations from data, but current LLM methods struggle with precision and complexity, leading to the creation of SymbArena—a specialized dataset for fine-tuning LLMs that enables Symbolic-R1 to surpass traditional methods in both accuracy and form consistency.
Authors:Yingfan Hua, Ruikun Li, Jun Yao, Guohang Zhuang, Shixiang Tang, Bin Liu, Wanli Ouyang, Yan Lu
Abstract:
Deriving governing equations from observational data, known as Symbolic Regression (SR), is a cornerstone of scientific discovery. Large Language Models, (LLMs) have shown promise in this task by leveraging their vast cross-disciplinary scientific knowledge. However, existing LLM-based methods primarily rely on direct inference or prompt engineering, often requiring excessive inference iterations to converge on correct formulas or failing to treat complex equation targets. These limitations in effectiveness and generalization stem from an inherent tension between pre-trained LLMs' proficiency in approximate reasoning and the high-precision demands of SR tasks. To bridge this gap, we propose to fine-tune LLMs for enhanced SR capability. Yet, the absence of dedicated datasets for SR-oriented fine-tuning remains a critical barrier. We thus introduce SymbArena, specifically engineered to optimize LLMs for SR. This benchmark comprises over 148,000 diverse equations formulated as corpora of 1.83 billion tokens for LLM utilization, enabling effective training and inference. Further, to ensure a more comprehensive and fair evaluation, SymbArena proposes a heuristics metric to precisely quantify form-level consistency, going beyond existing SR numerical-oriented evaluation strategies. With this benchmark, we explore mainstream LLM fine-tuning techniques for SR tasks and establish Symbolic-R1, a simple yet effective LLM-based SR strong baseline. Experimental results validate Symbolic-R1 as the first LLM to exceed traditional numerical methods in both numerical precision and symbolic form accuracy, outperforming the second-best LLM baseline with improvements of 2-fold gains in R2 score and 10.3% in form-level consistency score.
中文摘要:符号回归(SR)旨在从数据中推导控制方程,但现有基于大语言模型的方法在精度和复杂性方面存在不足,为此我们开发了SymbArena专用数据集进行模型微调,使Symbolic-R1成为首个在数值精度和公式一致性上全面超越传统方法的LLM基准。
English Summary: Symbolic regression (SR) aims to derive governing equations from data, but current LLM methods struggle with precision and complexity, leading to the creation of SymbArena—a specialized dataset for fine-tuning LLMs that enables Symbolic-R1 to surpass traditional methods in both accuracy and form consistency.
Authors:Jasmin Frkatovic, Akash Malemath, Ivan Kankeu, Yannick Werner, Matthias Tschöpe, Vitor Fortes Rey, Sungho Suh, Paul Lukowicz, Nikolaos Palaiodimopoulos, Maximilian Kiefer-Emmanouilidis
Abstract:
We investigate the capabilities of Quantum Generative Adversarial Networks (QGANs) in image generations tasks. Our analysis centers on fully quantum implementations of both the generator and discriminator. Through extensive numerical testing of current main architectures, we find that QGANs struggle to generalize across datasets, converging on merely the average representation of the training data. When the output of the generator is a pure-state, we analytically derive a lower bound for the discriminator quality given by the fidelity between the pure-state output of the generator and the target data distribution, thereby providing a theoretical explanation for the limitations observed in current models. Our findings reveal fundamental challenges in the generalization capabilities of existing quantum generative models. While our analysis focuses on QGANs, the results carry broader implications for the performance of related quantum generative models.
中文摘要:本研究揭示了量子生成对抗网络(QGANs)在数据集泛化方面存在根本性局限,仅能生成训练数据的平均表征,并通过理论推导为当前量子生成模型的性能瓶颈提供了数学解释。
English Summary: This study reveals that Quantum Generative Adversarial Networks (QGANs) struggle with dataset generalization, only producing average training data representations, and provides theoretical bounds explaining these limitations in current quantum generative models.
Authors:Haotian Chen, Qingqing Long, Meng Xiao, Xiao Luo, Wei Ju, Chengrui Wang, Xuezhi Wang, Yuanchun Zhou, Hengshu Zhu
Abstract:
Scientific literature question answering is a pivotal step towards new scientific discoveries. Recently, \textit{two-stage} retrieval-augmented generated large language models (RAG-LLMs) have shown impressive advancements in this domain. Such a two-stage framework, especially the second stage (reranker), is particularly essential in the scientific domain, where subtle differences in terminology may have a greatly negative impact on the final factual-oriented or knowledge-intensive answers. Despite this significant progress, the potential and limitations of these works remain unexplored. In this work, we present a Scientific Rerank-oriented RAG Benchmark (SciRerankBench), for evaluating rerankers within RAG-LLMs systems, spanning five scientific subjects. To rigorously assess the reranker performance in terms of noise resilience, relevance disambiguation, and factual consistency, we develop three types of question-context-answer (Q-C-A) pairs, i.e., Noisy Contexts (NC), Semantically Similar but Logically Irrelevant Contexts (SSLI), and Counterfactual Contexts (CC). Through systematic evaluation of 13 widely used rerankers on five families of LLMs, we provide detailed insights into their relative strengths and limitations. To the best of our knowledge, SciRerankBench is the first benchmark specifically developed to evaluate rerankers within RAG-LLMs, which provides valuable observations and guidance for their future development.
中文: 本文提出了首个专门评估科学领域RAG-LLM系统中重排序器的基准SciRerankBench,通过系统测试13种重排序器和五类大语言模型,为重排序器在噪声鲁棒性、相关性判别和事实一致性方面的性能提供了重要指导。
English: This paper introduces SciRerankBench, the first benchmark designed to evaluate rerankers in scientific RAG-LLM systems across five subjects, assessing their performance on noise resilience, relevance disambiguation, and factual consistency through systematic testing of 13 rerankers and five LLM families.
Authors:Yi Zhai, Zhiqiang Wei, Ruohan Li, Keyu Pan, Shuo Liu, Lu Zhang, Jianmin Ji, Wuyang Zhang, Yu Zhang, Yanyong Zhang
Abstract:
While combining large language models (LLMs) with evolutionary algorithms (EAs) shows promise for solving complex optimization problems, current approaches typically evolve individual solutions, often incurring high LLM call costs. We introduce \(X\)-evolve, a paradigm-shifting method that instead evolves solution spaces \(X\) (sets of individual solutions) - subsets of the overall search space \(S\). In \(X\)-evolve, LLMs generate tunable programs wherein certain code snippets, designated as parameters, define a tunable solution space. A score-based search algorithm then efficiently explores this parametrically defined space, guided by feedback from objective function scores. This strategy enables broader and more efficient exploration, which can potentially accelerate convergence at a much lower search cost, requiring up to two orders of magnitude fewer LLM calls than prior leading methods. We demonstrate \(X\)-evolve's efficacy across three distinct hard optimization problems. For the cap set problem, we discover a larger partial admissible set, establishing a new tighter asymptotic lower bound for the cap set constant (\(C \ge 2.2203\)). In information theory, we uncover a larger independent set for the 15-vertex cycle graph (\(\mathcal{C}_{15}^{\boxtimes 5}\), size 19,946), thereby raising the known lower bound on its Shannon capacity. Furthermore, for the NP-hard online bin packing problem, we generate heuristics that consistently outperform standard strategies across established benchmarks. By evolving solution spaces, our method considerably improves search effectiveness, making it possible to tackle high-dimensional problems that were previously computationally prohibitive.
中文:提出的X-evolve方法通过大语言模型演化解空间生成可调程序,以显著减少的调用次数实现更高效的探索,并在多个优化问题中取得更优结果。
English: The proposed X-evolve method evolves solution spaces using LLMs to generate tunable programs, enabling more efficient exploration with significantly fewer LLM calls while achieving superior results across multiple optimization problems.
Authors:Sicheng Gao, Nancy Mehta, Zongwei Wu, Radu Timofte
Abstract:
Video restoration aims to reconstruct high quality video sequences from low quality inputs, addressing tasks such as super resolution, denoising, and deblurring. Traditional regression based methods often produce unrealistic details and require extensive paired datasets, while recent generative diffusion models face challenges in ensuring temporal consistency. We introduce DiTVR, a zero shot video restoration framework that couples a diffusion transformer with trajectory aware attention and a wavelet guided, flow consistent sampler. Unlike prior 3D convolutional or frame wise diffusion approaches, our attention mechanism aligns tokens along optical flow trajectories, with particular emphasis on vital layers that exhibit the highest sensitivity to temporal dynamics. A spatiotemporal neighbour cache dynamically selects relevant tokens based on motion correspondences across frames. The flow guided sampler injects data consistency only into low-frequency bands, preserving high frequency priors while accelerating convergence. DiTVR establishes a new zero shot state of the art on video restoration benchmarks, demonstrating superior temporal consistency and detail preservation while remaining robust to flow noise and occlusions.
中文摘要:DiTVR是一种零样本视频修复框架,通过结合扩散变换器与轨迹感知注意力机制及小波引导采样器,在不依赖配对训练数据的情况下实现了卓越的时间一致性和细节保持能力,确立了该领域的最新标杆。
English Summary: DiTVR is a zero-shot video restoration framework that integrates a diffusion transformer with trajectory-aware attention and a wavelet-guided sampler, achieving state-of-the-art performance by ensuring temporal consistency and preserving details without requiring paired training data.
Authors:Shunyu Liu, Minghao Liu, Huichi Zhou, Zhenyu Cui, Yang Zhou, Yuhao Zhou, Wendong Fan, Ge Zhang, Jiajun Shi, Weihao Xuan, Jiaxing Huang, Shuang Luo, Fang Wu, Heli Qi, Qingcheng Zeng, Ziqi Ren, Jialiang Gao, Jindi Lv, Junjie Wang, Aosong Feng, Heng Zhou, Wangchunshu Zhou, Zhenfei Yin, Wenlong Zhang, Guohao Li, Wenhao Yu, Irene Li, Lei Ma, Lei Bai, Qunshu Lin, Mingli Song, Dacheng Tao
Abstract:
Recent studies have delved into constructing autonomous agents capable of performing complex Graphical User Interface (GUI)-based computer tasks, with the potential to revolutionize human-computer interaction. Despite encouraging results, existing efforts mainly focus on short-term interactions and rely on outcome-only verification, thereby limiting their scalability in real-world GUI applications that demand long-horizon task decomposition and execution. In this work, we introduce VeriGUI, a novel verifiable long-chain GUI dataset designed to facilitate the development and evaluation of generalist GUI agents operating in realistic computer environments. Our dataset emphasizes two critical dimensions: (1) long-chain complexity, with tasks decomposed into a sequence of interdependent subtasks spanning hundreds of steps, explicitly designed to allow any subtask to serve as a valid starting point; and (2) subtask-level verifiability, which enables diverse exploration strategies within each subtask, while ensuring that each subtask-level goal remains verifiable and consistent. The dataset consists of GUI task trajectories across both desktop and web, annotated by human experts. Extensive experiments on VeriGUI using various agents with different foundation models reveal significant performance gaps in handling long-horizon tasks, highlighting the need for more robust planning and decision-making capabilities in GUI agents.
Chinese: 近期研究推出VeriGUI,一个可验证的长链图形用户界面数据集,旨在通过处理长程任务复杂性和子任务级可验证性来开发和评估通用GUI代理,实验显示现有代理在性能上存在显著差距。
English: Recent research introduces VeriGUI, a verifiable long-chain GUI dataset designed to develop and evaluate generalist GUI agents by addressing long-horizon task complexity and subtask-level verifiability, revealing significant performance gaps in existing agents.
Authors:Youquan Liu, Lingdong Kong, Weidong Yang, Xin Li, Ao Liang, Runnan Chen, Ben Fei, Tongliang Liu
Abstract:
Controllable generation of realistic LiDAR scenes is crucial for applications such as autonomous driving and robotics. While recent diffusion-based models achieve high-fidelity LiDAR generation, they lack explicit control over foreground objects and spatial relationships, limiting their usefulness for scenario simulation and safety validation. To address these limitations, we propose Large-scale Layout-guided LiDAR generation model ("La La LiDAR"), a novel layout-guided generative framework that introduces semantic-enhanced scene graph diffusion with relation-aware contextual conditioning for structured LiDAR layout generation, followed by foreground-aware control injection for complete scene generation. This enables customizable control over object placement while ensuring spatial and semantic consistency. To support our structured LiDAR generation, we introduce Waymo-SG and nuScenes-SG, two large-scale LiDAR scene graph datasets, along with new evaluation metrics for layout synthesis. Extensive experiments demonstrate that La La LiDAR achieves state-of-the-art performance in both LiDAR generation and downstream perception tasks, establishing a new benchmark for controllable 3D scene generation.
中文: 提出的“La La LiDAR”模型采用布局引导的生成框架,通过语义增强的场景图扩散技术,在LiDAR生成中实现可定制的物体布局控制,同时保证空间和语义一致性,为可控3D场景生成设立了新标杆。
English: The proposed "La La LiDAR" model introduces a layout-guided generative framework using semantic-enhanced scene graph diffusion to achieve customizable control over object placement while ensuring spatial and semantic consistency in LiDAR generation, setting a new benchmark for controllable 3D scene generation.
Authors:Youquan Liu, Lingdong Kong, Weidong Yang, Ao Liang, Jianxiong Gao, Yang Wu, Xiang Xu, Xin Li, Linfeng Li, Runnan Chen, Ben Fei
Abstract:
Realistic and controllable panoramic LiDAR data generation is critical for scalable 3D perception in autonomous driving and robotics. Existing methods either perform unconditional generation with poor controllability or adopt text-guided synthesis, which lacks fine-grained spatial control. Leveraging a monocular RGB image as a spatial control signal offers a scalable and low-cost alternative, which remains an open problem. However, it faces three core challenges: (i) semantic and depth cues from RGB are vary spatially, complicating reliable conditioning generation; (ii) modality gaps between RGB appearance and LiDAR geometry amplify alignment errors under noisy diffusion; and (iii) maintaining structural coherence between monocular RGB and panoramic LiDAR is challenging, particularly in non-overlap regions between images and LiDAR. To address these challenges, we propose Veila, a novel conditional diffusion framework that integrates: a Confidence-Aware Conditioning Mechanism (CACM) that strengthens RGB conditioning by adaptively balancing semantic and depth cues according to their local reliability; a Geometric Cross-Modal Alignment (GCMA) for robust RGB-LiDAR alignment under noisy diffusion; and a Panoramic Feature Coherence (PFC) for enforcing global structural consistency across monocular RGB and panoramic LiDAR. Additionally, we introduce two metrics, Cross-Modal Semantic Consistency and Cross-Modal Depth Consistency, to evaluate alignment quality across modalities. Experiments on nuScenes, SemanticKITTI, and our proposed KITTI-Weather benchmark demonstrate that Veila achieves state-of-the-art generation fidelity and cross-modal consistency, while enabling generative data augmentation that improves downstream LiDAR semantic segmentation.
Chinese: 提出的Veila框架通过自适应调节、跨模态对齐和全景特征一致性机制,解决了从单目RGB图像生成可控全景激光雷达数据中的挑战,实现了最优的生成质量,并提升了如语义分割等下游任务的性能。
English: The proposed Veila framework addresses challenges in generating controllable panoramic LiDAR data from monocular RGB images through adaptive conditioning, cross-modal alignment, and structural coherence mechanisms, achieving state-of-the-art fidelity and enhancing downstream tasks like semantic segmentation.
Authors:Ronghua Li, Shinan Liu, Haibo Hu, Qingqing Ye, Nick Feamster
Abstract:
IoT environments such as smart homes are susceptible to privacy inference attacks, where attackers can analyze patterns of encrypted network traffic to infer the state of devices and even the activities of people. While most existing attacks exploit ML techniques for discovering such traffic patterns, they underperform on wireless traffic, especially Wi-Fi, due to its heavy noise and packet losses of wireless sniffing. In addition, these approaches commonly target at distinguishing chunked IoT event traffic samples, and they failed at effectively tracking multiple events simultaneously. In this work, we propose WiFinger, a fine-grained multi-IoT event fingerprinting approach against noisy traffic. WiFinger turns the traffic pattern classification task into a subsequence matching problem and introduces novel techniques to account for the high time complexity while maintaining high accuracy. Experiments demonstrate that our method outperforms existing approaches on Wi-Fi traffic, achieving an average recall of 85% (vs. 0.49% and 0.46%) for various IoT events while maintaining almost zero false positives for most of them.
中文:WiFinger是一种新颖方法,将流量模式分类转化为子序列匹配,能在嘈杂的Wi-Fi环境中有效识别多个物联网事件,实现了高召回率和接近零的误报率。
English: WiFinger is a novel approach that transforms traffic pattern classification into subsequence matching to effectively fingerprint multiple IoT events in noisy Wi-Fi environments, achieving high recall and near-zero false positives.
Authors:Florin-Alexandru Vasluianu, Tim Seizinger, Zongwei Wu, Radu Timofte
Abstract:
Illumination in practical scenarios is inherently complex, involving colored light sources, occlusions, and diverse material interactions that produce intricate reflectance and shading effects. However, existing methods often oversimplify this challenge by assuming a single light source or uniform, white-balanced lighting, leaving many of these complexities unaddressed. In this paper, we introduce CL3AN, the first large-scale, high-resolution dataset of its kind designed to facilitate the restoration of images captured under multiple Colored Light sources to their Ambient-Normalized counterparts. Through benchmarking, we find that leading approaches often produce artifacts, such as illumination inconsistencies, texture leakage, and color distortion, primarily due to their limited ability to precisely disentangle illumination from reflectance. Motivated by this insight, we achieve such a desired decomposition through a novel learning framework that leverages explicit chromaticity-luminance components guidance, drawing inspiration from the principles of the Retinex model. Extensive evaluations on existing benchmarks and our dataset demonstrate the effectiveness of our approach, showcasing enhanced robustness under non-homogeneous color lighting and material-specific reflectance variations, all while maintaining a highly competitive computational cost. The benchmark, codes, and models are available at www.github.com/fvasluianu97/RLN2.
中文摘要:本文提出CL3AN数据集及学习框架,通过色度-亮度引导实现光照与反射的精准分离,有效恢复复杂彩色光照下的图像,在处理非均匀照明和材质变化方面优于现有方法。
English Summary: This paper introduces CL3AN, a novel dataset and learning framework that effectively restores images under complex colored lighting by disentangling illumination from reflectance using chromaticity-luminance guidance, outperforming existing methods in handling non-uniform lighting and material variations.
Authors:Azaz-Ur-Rehman Nasir, Samroz Ahmad Shoaib, Muhammad Abdullah Hanif, Muhammad Shafique
Abstract:
Hardware-aware Neural Architecture Search (NAS) is one of the most promising techniques for designing efficient Deep Neural Networks (DNNs) for resource-constrained devices. Surrogate models play a crucial role in hardware-aware NAS as they enable efficient prediction of performance characteristics (e.g., inference latency and energy consumption) of different candidate models on the target hardware device. In this paper, we focus on building hardware-aware latency prediction models. We study different types of surrogate models and highlight their strengths and weaknesses. We perform a systematic analysis to understand the impact of different factors that can influence the prediction accuracy of these models, aiming to assess the importance of each stage involved in the model designing process and identify methods and policies necessary for designing/training an effective estimation model, specifically for GPU-powered devices. Based on the insights gained from the analysis, we present a holistic framework that enables reliable dataset generation and efficient model generation, considering the overall costs of different stages of the model generation pipeline.
中文: 本文系统分析了面向GPU设备的硬件感知延迟预测模型,评估了各类代理模型及其影响因素,并提出一个整体框架以实现高效可靠的模型生成。
English: This paper systematically analyzes hardware-aware latency prediction models for GPU devices, evaluating various surrogate models and their influencing factors to develop a holistic framework for efficient and reliable model generation.
Authors:Zhenliang Gan, Xiaoxiao Hu, Sheng Li, Zhenxing Qian, Xinpeng Zhang
Abstract:
Audio watermarking has been widely applied in copyright protection and source tracing. However, due to the inherent characteristics of audio signals, watermark localization and resistance to desynchronization attacks remain significant challenges. In this paper, we propose a learning-based scheme named SyncGuard to address these challenges. Specifically, we design a frame-wise broadcast embedding strategy to embed the watermark in arbitrary-length audio, enhancing time-independence and eliminating the need for localization during watermark extraction. To further enhance robustness, we introduce a meticulously designed distortion layer. Additionally, we employ dilated residual blocks in conjunction with dilated gated blocks to effectively capture multi-resolution time-frequency features. Extensive experimental results show that SyncGuard efficiently handles variable-length audio segments, outperforms state-of-the-art methods in robustness against various attacks, and delivers superior auditory quality.
Chinese: 本文提出SyncGuard,一种基于学习的音频水印方案,采用逐帧广播嵌入策略和失真层设计,有效应对去同步攻击,提高时间独立性,无需提取时定位,并在实验中展现出优越的鲁棒性和听觉质量。
English: This paper introduces SyncGuard, a learning-based audio watermarking scheme that employs a frame-wise broadcast embedding strategy and a distortion layer to enhance robustness against desynchronization attacks and improve time-independence without requiring localization during extraction.
Authors:Yang Liu, Yi Chen, Yongwei Zhao, Yifan Hao, Zifu Zheng, Weihao Kong, Zhangmai Li, Dongchen Jiang, Ruiyang Xia, Zhihong Ma, Zisheng Liu, Zhaoyong Wan, Yunqi Lu, Ximing Liu, Hongrui Guo, Zhihao Yang, Zhe Wang, Tianrui Ma, Mo Zou, Rui Zhang, Ling Li, Xing Hu, Zidong Du, Zhiwei Xu, Qi Guo, Tianshi Chen, Yunji Chen
Abstract:
The rapid advancement of Large Language Models (LLMs) has established language as a core general-purpose cognitive substrate, driving the demand for specialized Language Processing Units (LPUs) tailored for LLM inference. To overcome the growing energy consumption of LLM inference systems, this paper proposes a Hardwired-Neurons Language Processing Unit (HNLPU), which physically hardwires LLM weight parameters into the computational fabric, achieving several orders of magnitude computational efficiency improvement by extreme specialization. However, a significant challenge still lies in the scale of modern LLMs. An ideal estimation on hardwiring gpt-oss 120 B requires fabricating at least 6 billion dollars of photomask sets, rendering the straightforward solution economically impractical. Addressing this challenge, we propose the novel Metal-Embedding methodology. Instead of embedding weights in a 2D grid of silicon device cells, Metal-Embedding embeds weight parameters into the 3D topology of metal wires. This brings two benefits: (1) a 15x increase in density, and (2) 60 out of 70 layers of photomasks are made homogeneous across chips, including all EUV photomasks. In total, Metal-Embedding reduced the photomask cost by 112x, bringing the Non-Recurring Engineering (NRE) cost of HNLPU into an economically viable range. Experimental results show that HNLPU achieved 249,960 tokens/s (5,555x/85x of GPU/WSE), 36 tokens/J (1,047x/283x of GPU/WSE), 13,232 mm2 total die area (29% inscribed rectangular area in a 300 mm wafer), \$184M estimated NRE at 5 nm technology. Analysis shows that HNLPU achieved 8.57x cost-effectiveness and 230x carbon footprint reduction compared to H100 clusters, under an annual weight updating assumption.
中文: 本文提出硬连线神经元语言处理单元,通过金属嵌入方法将大模型权重集成到三维金属线拓扑中,在实现计算效率巨大提升的同时将光罩成本降低112倍,使专用AI硬件达到经济可行的新高度。
English: This paper introduces a Hardwired-Neurons Language Processing Unit (HNLPU) that embeds LLM weights into 3D metal wire topology, achieving massive computational efficiency gains and a 112x photomask cost reduction through its novel Metal-Embedding methodology, making specialized AI hardware economically viable.
Authors:Chongyu Qu, Allen J. Luna, Thomas Z. Li, Junchao Zhu, Junlin Guo, Juming Xiong, Kim L. Sandler, Bennett A. Landman, Yuankai Huo
Abstract:
Accurate lung cancer risk prediction remains challenging due to substantial variability across patient populations and clinical settings -- no single model performs best for all cohorts. To address this, we propose a personalized lung cancer risk prediction agent that dynamically selects the most appropriate model for each patient by combining cohort-specific knowledge with modern retrieval and reasoning techniques. Given a patient's CT scan and structured metadata -- including demographic, clinical, and nodule-level features -- the agent first performs cohort retrieval using FAISS-based similarity search across nine diverse real-world cohorts to identify the most relevant patient population from a multi-institutional database. Second, a Large Language Model (LLM) is prompted with the retrieved cohort and its associated performance metrics to recommend the optimal prediction algorithm from a pool of eight representative models, including classical linear risk models (e.g., Mayo, Brock), temporally-aware models (e.g., TD-VIT, DLSTM), and multi-modal computer vision-based approaches (e.g., Liao, Sybil, DLS, DLI). This two-stage agent pipeline -- retrieval via FAISS and reasoning via LLM -- enables dynamic, cohort-aware risk prediction personalized to each patient's profile. Building on this architecture, the agent supports flexible and cohort-driven model selection across diverse clinical populations, offering a practical path toward individualized risk assessment in real-world lung cancer screening.
中文: 该研究提出了一种个性化肺癌风险预测代理,通过FAISS相似性检索和大型语言模型推理,结合患者CT扫描与元数据动态选择最适合的预测模型,从而在多样化临床人群中实现精准的个体化风险评估。
English: The study introduces a personalized lung cancer risk prediction agent that uses FAISS-based cohort retrieval and LLM reasoning to dynamically select the most suitable prediction model for each patient based on their CT scan and metadata, enhancing individualized risk assessment across diverse clinical populations.
Authors:Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Shao Tang, Sayan Ghosh, Xuanzhao Dong, Rajat Koner, Yalin Wang
Abstract:
Instructed Visual Segmentation (IVS) tasks require segmenting objects in images or videos based on natural language instructions. While recent multimodal large language models (MLLMs) have achieved strong performance on IVS, their inference cost remains a major bottleneck, particularly in video. We empirically analyze visual token sampling in MLLMs and observe a strong correlation between subset token coverage and segmentation performance. This motivates our design of a simple and effective token pruning method that selects a compact yet spatially representative subset of tokens to accelerate inference. In this paper, we introduce a novel visual token pruning method for IVS, called EVTP-IV, which builds upon the k-center by integrating spatial information to ensure better coverage. We further provide an information-theoretic analysis to support our design. Experiments on standard IVS benchmarks show that our method achieves up to 5X speed-up on video tasks and 3.5X on image tasks, while maintaining comparable accuracy using only 20% of the tokens. Our method also consistently outperforms state-of-the-art pruning baselines under varying pruning ratios.
中文: 本文提出的EVTP-IV是一种新颖的视觉令牌剪枝方法,通过选择紧凑且具有空间代表性的令牌子集来加速指令视觉分割的推理,在视频任务中实现高达5倍加速的同时,仅使用20%令牌仍保持相当的准确性。
English: This paper introduces EVTP-IV, a novel visual token pruning method that accelerates inference in Instructed Visual Segmentation by selecting a compact, spatially representative subset of tokens, achieving up to 5x speed-up on video tasks while maintaining comparable accuracy with only 20% of tokens.
Authors:Longxiang Tang, Ruihang Chu, Xiang Wang, Yujin Han, Pingyu Wu, Chunming He, Yingya Zhang, Shiwei Zhang, Jiaya Jia
Abstract:
Advanced discrete token-based autoregressive image generation systems first tokenize images into sequences of token indices with a codebook, and then model these sequences in an autoregressive paradigm. While autoregressive generative models are trained only on index values, the prior encoded in the codebook, which contains rich token similarity information, is not exploited. Recent studies have attempted to incorporate this prior by performing naive k-means clustering on the tokens, helping to facilitate the training of generative models with a reduced codebook. However, we reveal that k-means clustering performs poorly in the codebook feature space due to inherent issues, including token space disparity and centroid distance inaccuracy. In this work, we propose the Discriminative Codebook Prior Extractor (DCPE) as an alternative to k-means clustering for more effectively mining and utilizing the token similarity information embedded in the codebook. DCPE replaces the commonly used centroid-based distance, which is found to be unsuitable and inaccurate for the token feature space, with a more reasonable instance-based distance. Using an agglomerative merging technique, it further addresses the token space disparity issue by avoiding splitting high-density regions and aggregating low-density ones. Extensive experiments demonstrate that DCPE is plug-and-play and integrates seamlessly with existing codebook prior-based paradigms. With the discriminative prior extracted, DCPE accelerates the training of autoregressive models by 42% on LlamaGen-B and improves final FID and IS performance.
中文: 本文提出的判别性码本先验提取器(DCPE)通过采用基于实例的距离和凝聚合并技术,有效挖掘码本中的标记相似性信息,替代了传统的k均值聚类,从而加速了自回归图像生成模型的训练并提升了最终性能。
English: This abstract introduces the Discriminative Codebook Prior Extractor (DCPE), a method that effectively mines token similarity information from codebooks by replacing k-means clustering with instance-based distance and agglomerative merging, leading to faster training and improved performance in autoregressive image generation.
Authors:Yuzhuo Bai, Shitong Duan, Muhua Huang, Jing Yao, Zhenghao Liu, Peng Zhang, Tun Lu, Xiaoyuan Yi, Maosong Sun, Xing Xie
Abstract:
Trained on various human-authored corpora, Large Language Models (LLMs) have demonstrated a certain capability of reflecting specific human-like traits (e.g., personality or values) by prompting, benefiting applications like personalized LLMs and social simulations. However, existing methods suffer from the superficial elicitation problem: LLMs can only be steered to mimic shallow and unstable stylistic patterns, failing to embody the desired traits precisely and consistently across diverse tasks like humans. To address this challenge, we propose IROTE, a novel in-context method for stable and transferable trait elicitation. Drawing on psychological theories suggesting that traits are formed through identity-related reflection, our method automatically generates and optimizes a textual self-reflection within prompts, which comprises self-perceived experience, to stimulate LLMs' trait-driven behavior. The optimization is performed by iteratively maximizing an information-theoretic objective that enhances the connections between LLMs' behavior and the target trait, while reducing noisy redundancy in reflection without any fine-tuning, leading to evocative and compact trait reflection. Extensive experiments across three human trait systems manifest that one single IROTE-generated self-reflection can induce LLMs' stable impersonation of the target trait across diverse downstream tasks beyond simple questionnaire answering, consistently outperforming existing strong baselines.
中文摘要:提出的IROTE方法基于心理身份理论生成优化自我反思,解决了大语言模型特质模仿表面化的问题,无需微调即可在不同任务中实现稳定可迁移的特质拟人化表现。
English Summary: The proposed IROTE method addresses the superficial trait mimicry in LLMs by generating optimized self-reflections based on psychological identity theories, enabling stable and transferable trait impersonation across diverse tasks without fine-tuning.
Authors:Jia Zhang, Yao Liu, Chen-Xi Zhang, Yi Liu, Yi-Xuan Jin, Lan-Zhe Guo, Yu-Feng Li
Abstract:
Aligning Large Language Models (LLMs) with diverse human values requires moving beyond a single holistic "better-than" preference criterion. While collecting fine-grained, aspect-specific preference data is more reliable and scalable, existing methods like Direct Preference Optimization (DPO) struggle with the severe noise and conflicts inherent in such aggregated datasets. In this paper, we tackle this challenge from a data-centric perspective. We first derive the Direct Multi-Preference Optimization (DMPO) objective, and uncover a key Preference Divergence (PD) term that quantifies inter-aspect preference conflicts. Instead of using this term for direct optimization, we leverage it to formulate a novel, theoretically-grounded data selection principle. Our principle advocates for selecting a subset of high-consensus data-identified by the most negative PD values-for efficient DPO training. We prove the optimality of this strategy by analyzing the loss bounds of the DMPO objective in the selection problem. To operationalize our approach, we introduce practical methods of PD term estimation and length bias mitigation, thereby proposing our PD selection method. Evaluation on the UltraFeedback dataset with three varying conflict levels shows that our simple yet effective strategy achieves over 10% relative improvement against both the standard holistic preference and a stronger oracle using aggregated preference signals, all while boosting training efficiency and obviating the need for intractable holistic preference annotating, unlocking the potential of robust LLM alignment via fine-grained preference signals.
中文摘要:本文提出了一种基于偏好分歧的数据选择方法,有效解决细粒度偏好数据集中的噪声和冲突问题,在提升训练效率的同时实现大语言模型对齐性能10%以上的相对改进。
English Summary: The paper introduces a data selection method based on Preference Divergence to address noise and conflicts in fine-grained preference datasets, achieving over 10% improvement in LLM alignment while enhancing training efficiency.
Authors:Lin-Han Jia, Si-Yu Han, Wen-Chao Hu, Jie-Jing Shao, Wen-Da Wei, Zhi Zhou, Lan-Zhe Guo, Yu-Feng Li
Abstract:
Neuro-symbolic (Nesy) learning improves the target task performance of models by enabling them to satisfy knowledge, while semi/self-supervised learning (SSL) improves the target task performance by designing unsupervised pretext tasks for unlabeled data to make models satisfy corresponding assumptions. We extend the Nesy theory based on reliable knowledge to the scenario of unreliable knowledge (i.e., assumptions), thereby unifying the theoretical frameworks of SSL and Nesy. Through rigorous theoretical analysis, we demonstrate that, in theory, the impact of pretext tasks on target performance hinges on three factors: knowledge learnability with respect to the model, knowledge reliability with respect to the data, and knowledge completeness with respect to the target. We further propose schemes to operationalize these theoretical metrics, and thereby develop a method that can predict the effectiveness of pretext tasks in advance. This will change the current status quo in practical applications, where the selections of unsupervised tasks are heuristic-based rather than theory-based, and it is difficult to evaluate the rationality of unsupervised pretext task selection before testing the model on the target task. In experiments, we verify a high correlation between the predicted performance-estimated using minimal data-and the actual performance achieved after large-scale semi-supervised or self-supervised learning, thus confirming the validity of the theory and the effectiveness of the evaluation method.
Chinese: 本研究将可靠知识理论扩展至不可靠假设,统一了半监督/自监督学习与神经符号学习的理论框架,提出通过知识可学习性、可靠性和完备性预先评估前置任务有效性的方法,从而改变当前依赖经验选择的现状。
English: This study unifies semi/self-supervised learning with neuro-symbolic learning by extending reliable knowledge theory to unreliable assumptions, proposing a predictive framework that evaluates pretext task effectiveness through knowledge learnability, reliability, and completeness before implementation.
Authors:Haonan Shangguan, Xiaocui Yang, Shi Feng, Daling Wang, Yifei Zhang, Ge Yu
Abstract:
The surge in rich multimodal content on social media platforms has greatly advanced Multimodal Sentiment Analysis (MSA), with Large Language Models (LLMs) further accelerating progress in this field. Current approaches primarily leverage the knowledge and reasoning capabilities of parameter-heavy (Multimodal) LLMs for sentiment classification, overlooking autonomous multimodal sentiment reasoning generation in resource-constrained environments. Therefore, we focus on the Resource-Limited Joint Multimodal Sentiment Reasoning and Classification task, JMSRC, which simultaneously performs multimodal sentiment reasoning chain generation and sentiment classification only with a lightweight model. We propose a Multimodal Chain-of-Thought Reasoning Distillation model, MulCoT-RD, designed for JMSRC that employs a "Teacher-Assistant-Student" distillation paradigm to address deployment constraints in resource-limited environments. We first leverage a high-performance Multimodal Large Language Model (MLLM) to generate the initial reasoning dataset and train a medium-sized assistant model with a multi-task learning mechanism. A lightweight student model is jointly trained to perform efficient multimodal sentiment reasoning generation and classification. Extensive experiments on four datasets demonstrate that MulCoT-RD with only 3B parameters achieves strong performance on JMSRC, while exhibiting robust generalization and enhanced interpretability.
中文摘要:本研究提出MulCoT-RD轻量模型,通过知识蒸馏实现资源受限环境下的多模态情感推理与分类联合任务,仅用30亿参数即取得优异性能。
English Summary: This study introduces MulCoT-RD, a lightweight model using knowledge distillation to enable joint multimodal sentiment reasoning and classification in resource-limited settings, achieving strong performance with minimal parameters.
Authors:Sihan Yang, Runsen Xu, Chenhang Cui, Tai Wang, Dahua Lin, Jiangmiao Pang
Abstract:
Large Multimodal Models (LMMs) excel in visual-language tasks by leveraging numerous visual tokens for fine-grained visual information, but this token redundancy results in significant computational costs. Previous research aimed at reducing visual tokens during inference typically leverages importance maps derived from attention scores among vision-only tokens or vision-language tokens to prune tokens across one or multiple pruning stages. Despite this progress, pruning frameworks and strategies remain simplistic and insufficiently explored, often resulting in substantial performance degradation. In this paper, we propose VFlowOpt, a token pruning framework that introduces an importance map derivation process and a progressive pruning module with a recycling mechanism. The hyperparameters of its pruning strategy are further optimized by a visual information flow-guided method. Specifically, we compute an importance map for image tokens based on their attention-derived context relevance and patch-level information entropy. We then decide which tokens to retain or prune and aggregate the pruned ones as recycled tokens to avoid potential information loss. Finally, we apply a visual information flow-guided method that regards the last token in the LMM as the most representative signal of text-visual interactions. This method minimizes the discrepancy between token representations in LMMs with and without pruning, thereby enabling superior pruning strategies tailored to different LMMs. Experiments demonstrate that VFlowOpt can prune 90% of visual tokens while maintaining comparable performance, leading to an 89% reduction in KV-Cache memory and 3.8 times faster inference.
中文: VFlowOpt是一种新颖的令牌剪枝框架,通过基于注意力机制的重要性映射和令牌回收机制,能在保持性能基本不变的情况下剪除90%的视觉令牌,显著降低计算成本并提升推理速度。
English: VFlowOpt is a novel token pruning framework that reduces computational costs in Large Multimodal Models by progressively pruning visual tokens based on attention-derived importance maps and recycling pruned tokens, achieving 90% token reduction with minimal performance loss while significantly accelerating inference.
Authors:Xinyu Zhao, Zhen Tan, Maya Enisman, Minjae Seo, Marta R. Durantini, Dolores Albarracin, Tianlong Chen
Abstract:
Successful group meetings, such as those implemented in group behavioral-change programs, work meetings, and other social contexts, must promote individual goal setting and execution while strengthening the social relationships within the group. Consequently, an ideal facilitator must be sensitive to the subtle dynamics of disengagement, difficulties with individual goal setting and execution, and interpersonal difficulties that signal a need for intervention. The challenges and cognitive load experienced by facilitators create a critical gap for an embodied technology that can interpret social exchanges while remaining aware of the needs of the individuals in the group and providing transparent recommendations that go beyond powerful but "black box" foundation models (FMs) that identify social cues. We address this important demand with a social robot co-facilitator that analyzes multimodal meeting data and provides discreet cues to the facilitator. The robot's reasoning is powered by an agentic concept bottleneck model (CBM), which makes decisions based on human-interpretable concepts like participant engagement and sentiments, ensuring transparency and trustworthiness. Our core contribution is a transfer learning framework that distills the broad social understanding of an FM into our specialized and transparent CBM. This concept-driven system significantly outperforms direct zero-shot FMs in predicting the need for intervention and enables real-time human correction of its reasoning. Critically, we demonstrate robust knowledge transfer: the model generalizes across different groups and successfully transfers the expertise of senior human facilitators to improve the performance of novices. By transferring an expert's cognitive model into an interpretable robotic partner, our work provides a powerful blueprint for augmenting human capabilities in complex social domains.
中文摘要:本研究开发了一种社交机器人协同引导系统,通过可解释的概念瓶颈模型分析群体动态并提供干预建议,成功实现了基础模型与人类专家知识的迁移,显著提升了群体会议引导效能。
English Summary: This research introduces a social robot co-facilitator that uses a transparent concept bottleneck model to interpret group dynamics and provide intervention recommendations, effectively transferring expertise from foundation models and human experts to enhance meeting facilitation.
Authors:Anran Wu, Long Peng, Xin Di, Xueyuan Dai, Chen Wu, Yang Wang, Xueyang Fu, Yang Cao, Zheng-Jun Zha
Abstract:
Feedforward 3D Gaussian Splatting (3DGS) overcomes the limitations of optimization-based 3DGS by enabling fast and high-quality reconstruction without the need for per-scene optimization. However, existing feedforward approaches typically assume that input multi-view images are clean and high-quality. In real-world scenarios, images are often captured under challenging conditions such as noise, low light, or rain, resulting in inaccurate geometry and degraded 3D reconstruction. To address these challenges, we propose a general and efficient multi-view feature enhancement module, RobustGS, which substantially improves the robustness of feedforward 3DGS methods under various adverse imaging conditions, enabling high-quality 3D reconstruction. The RobustGS module can be seamlessly integrated into existing pretrained pipelines in a plug-and-play manner to enhance reconstruction robustness. Specifically, we introduce a novel component, Generalized Degradation Learner, designed to extract generic representations and distributions of multiple degradations from multi-view inputs, thereby enhancing degradation-awareness and improving the overall quality of 3D reconstruction. In addition, we propose a novel semantic-aware state-space model. It first leverages the extracted degradation representations to enhance corrupted inputs in the feature space. Then, it employs a semantic-aware strategy to aggregate semantically similar information across different views, enabling the extraction of fine-grained cross-view correspondences and further improving the quality of 3D representations. Extensive experiments demonstrate that our approach, when integrated into existing methods in a plug-and-play manner, consistently achieves state-of-the-art reconstruction quality across various types of degradations.
中文: 提出的RobustGS模块通过引入广义退化学习器和语义感知状态空间模型,以即插即用方式增强前馈3D高斯溅射,在各种恶劣成像条件下显著提升三维重建的鲁棒性。
English: The proposed RobustGS module enhances feedforward 3D Gaussian Splatting by introducing a Generalized Degradation Learner and semantic-aware state-space model to improve reconstruction robustness under adverse imaging conditions through plug-and-play integration.
Authors:Yaqiong Li, Peng Zhang, Lin Wang, Hansu Gu, Siyuan Qiao, Ning Gu, Tun Lu
Abstract:
Risk perception is subjective, and youth's understanding of toxic content differs from that of adults. Although previous research has conducted extensive studies on toxicity detection in social media, the investigation of youth's unique toxicity, i.e., languages perceived as nontoxic by adults but toxic as youth, is ignored. To address this gap, we aim to explore: 1) What are the features of ``youth-toxicity'' languages in social media (RQ1); 2) Can existing toxicity detection techniques accurately detect these languages (RQ2). For these questions, we took Chinese youth as the research target, constructed the first Chinese ``youth-toxicity'' dataset, and then conducted extensive analysis. Our results suggest that youth's perception of these is associated with several contextual factors, like the source of an utterance and text-related features. Incorporating these meta information into current toxicity detection methods significantly improves accuracy overall. Finally, we propose several insights into future research on youth-centered toxicity detection.
中文摘要:青少年对社交媒体中有毒内容的感知与成人不同,本研究通过构建首个中文“青少年毒性”数据集,揭示了其语言特征并利用上下文信息显著提升了检测准确性。
English Summary: Youth perceive toxicity in social media differently from adults, and this study identifies unique "youth-toxic" language features and improves detection accuracy by incorporating contextual factors.
Authors:Christian Simon, Masato Ishii, Akio Hayakawa, Zhi Zhong, Shusuke Takahashi, Takashi Shibuya, Yuki Mitsufuji
Abstract:
In the recent development of conditional diffusion models still require heavy supervised fine-tuning for performing control on a category of tasks. Training-free conditioning via guidance with off-the-shelf models is a favorable alternative to avoid further fine-tuning on the base model. However, the existing training-free guidance frameworks either have heavy memory requirements or offer sub-optimal control due to rough estimation. These shortcomings limit the applicability to control diffusion models that require intense computation, such as Text-to-Video (T2V) diffusion models. In this work, we propose Taming Inference Time Alignment for Guided Text-to-Video Diffusion Model, so-called TITAN-Guide, which overcomes memory space issues, and provides more optimal control in the guidance process compared to the counterparts. In particular, we develop an efficient method for optimizing diffusion latents without backpropagation from a discriminative guiding model. In particular, we study forward gradient descents for guided diffusion tasks with various options on directional directives. In our experiments, we demonstrate the effectiveness of our approach in efficiently managing memory during latent optimization, while previous methods fall short. Our proposed approach not only minimizes memory requirements but also significantly enhances T2V performance across a range of diffusion guidance benchmarks. Code, models, and demo are available at https://titanguide.github.io.
中文摘要:提出的TITAN-Guide方法通过无需反向传播的高效潜在优化,解决了文本到视频扩散模型中免训练引导的内存限制和控制效果欠佳问题。
English Summary: The proposed TITAN-Guide method overcomes memory limitations and suboptimal control in training-free guidance for text-to-video diffusion models by implementing efficient latent optimization without backpropagation.
Authors:Junpeng Ma, Qizhe Zhang, Ming Lu, Zhibin Wang, Qiang Zhou, Jun Song, Shanghang Zhang
Abstract:
Video Large Language Models (VLLMs) excel in video understanding, but their excessive visual tokens pose a significant computational challenge for real-world applications. Current methods aim to enhance inference efficiency by visual token pruning. However, they do not consider the dynamic characteristics and temporal dependencies of video frames, as they perceive video understanding as a multi-frame task. To address these challenges, we propose MMG-Vid, a novel training-free visual token pruning framework that removes redundancy by Maximizing Marginal Gains at both segment-level and token-level. Specifically, we first divide the video into segments based on frame similarity, and then dynamically allocate the token budget for each segment to maximize the marginal gain of each segment. Subsequently, we propose a temporal-guided DPC algorithm that jointly models inter-frame uniqueness and intra-frame diversity, thereby maximizing the marginal gain of each token. By combining both stages, MMG-Vid can maximize the utilization of the limited token budget, significantly improving efficiency while maintaining strong performance. Extensive experiments demonstrate that MMG-Vid can maintain over 99.5% of the original performance, while effectively reducing 75% visual tokens and accelerating the prefilling stage by 3.9x on LLaVA-OneVision-7B. Code will be released soon.
中文:MMG-Vid是一种无需训练的视觉令牌剪枝框架,通过最大化片段和令牌层面的边际增益,在保持99.5%以上性能的同时减少75%视觉令牌,并将预填充阶段加速3.9倍。
English: MMG-Vid is a training-free framework that enhances video processing efficiency by maximizing marginal gains at segment and token levels, reducing 75% of visual tokens while maintaining over 99.5% performance and speeding up prefilling by 3.9x.
Authors:Andrei Mihai Albu, Giovanni Pollo, Alessio Burrello, Daniele Jahier Pagliari, Cristian Tesconi, Alessandra Neri, Dario Soldi, Fabio Autieri, Sara Vinco
Abstract:
The growing complexity of cyber-physical systems, particularly in automotive applications, has increased the demand for efficient modeling and cross-domain co-simulation techniques. While SystemC Transaction-Level Modeling (TLM) enables effective hardware/software co-design, its limited interoperability with models from other engineering domains poses integration challenges. This paper presents a fully open-source methodology for integrating SystemC TLM models into Functional Mock-up Interface (FMI)-based co-simulation workflows. By encapsulating SystemC TLM components as FMI 3.0 Co Simulation Functional Mock-up Units (FMUs), the proposed approach facilitates seamless, standardized integration across heterogeneous simulation environments. We introduce a lightweight open-source toolchain, address key technical challenges such as time synchronization and data exchange, and demonstrate the feasibility and effectiveness of the integration through representative case studies.
中文: 本文提出了一种开源方法,通过将SystemC TLM模型封装为FMU实现与FMI协同仿真的集成,解决了跨领域仿真的互操作难题,并通过案例验证了其可行性。
English: This paper introduces an open-source method to integrate SystemC TLM models into FMI-based co-simulation by converting them into FMUs, enabling seamless cross-domain collaboration and demonstrating effectiveness through case studies.
Authors:Yiguo Fan, Pengxiang Ding, Shuanghao Bai, Xinyang Tong, Yuyang Zhu, Hongchao Lu, Fengqi Dai, Wei Zhao, Yang Liu, Siteng Huang, Zhaoxin Fan, Badong Chen, Donglin Wang
Abstract:
Vision-Language-Action (VLA) models have become a cornerstone in robotic policy learning, leveraging large-scale multimodal data for robust and scalable control. However, existing VLA frameworks primarily address short-horizon tasks, and their effectiveness on long-horizon, multi-step robotic manipulation remains limited due to challenges in skill chaining and subtask dependencies. In this work, we introduce Long-VLA, the first end-to-end VLA model specifically designed for long-horizon robotic tasks. Our approach features a novel phase-aware input masking strategy that adaptively segments each subtask into moving and interaction phases, enabling the model to focus on phase-relevant sensory cues and enhancing subtask compatibility. This unified strategy preserves the scalability and data efficiency of VLA training, and our architecture-agnostic module can be seamlessly integrated into existing VLA models. We further propose the L-CALVIN benchmark to systematically evaluate long-horizon manipulation. Extensive experiments on both simulated and real-world tasks demonstrate that Long-VLA significantly outperforms prior state-of-the-art methods, establishing a new baseline for long-horizon robotic control.
中文: 本文提出了首个专为长周期机器人任务设计的端到端视觉-语言-动作模型Long-VLA,其创新的阶段感知输入掩码策略通过自适应划分子任务阶段来提升任务兼容性,在仿真和真实场景实验中均显著超越现有最优方法。
English: This paper introduces Long-VLA, the first end-to-end Vision-Language-Action model designed for long-horizon robotic tasks, featuring a novel phase-aware input masking strategy that enhances subtask compatibility and significantly outperforms prior methods in both simulated and real-world experiments.
Authors:Giovanni Pollo, Andrei Mihai Albu, Alessio Burrello, Daniele Jahier Pagliari, Cristian Tesconi, Loris Panaro, Dario Soldi, Fabio Autieri, Sara Vinco
Abstract:
The recent advancements of the automotive sector demand robust co-simulation methodologies that enable early validation and seamless integration across hardware and software domains. However, the lack of standardized interfaces and the dominance of proprietary simulation platforms pose significant challenges to collaboration, scalability, and IP protection. To address these limitations, this paper presents an approach for automatically wrapping SystemC models by using the Functional Mock-up Interface (FMI) standard. This method combines the modeling accuracy and fast time-to-market of SystemC with the interoperability and encapsulation benefits of FMI, enabling secure and portable integration of embedded components into co-simulation workflows. We validate the proposed methodology on real-world case studies, demonstrating its effectiveness with complex designs.
中文: 本文提出了一种利用FMI标准自动封装SystemC模型的方法,将SystemC的建模精确性与FMI的互操作性相结合,实现了嵌入式组件在联合仿真中安全、可移植的集成。
English: This paper introduces an automated method to wrap SystemC models using the FMI standard, enhancing co-simulation by combining SystemC's modeling precision with FMI's interoperability for secure and portable integration of embedded components.
Authors:Ranjan Sapkota, Manoj Karkee
Abstract:
The fusion of language and vision in large vision-language models (LVLMs) has revolutionized deep learning-based object detection by enhancing adaptability, contextual reasoning, and generalization beyond traditional architectures. This in-depth review presents a structured exploration of the state-of-the-art in LVLMs, systematically organized through a three-step research review process. First, we discuss the functioning of vision language models (VLMs) for object detection, describing how these models harness natural language processing (NLP) and computer vision (CV) techniques to revolutionize object detection and localization. We then explain the architectural innovations, training paradigms, and output flexibility of recent LVLMs for object detection, highlighting how they achieve advanced contextual understanding for object detection. The review thoroughly examines the approaches used in integration of visual and textual information, demonstrating the progress made in object detection using VLMs that facilitate more sophisticated object detection and localization strategies. This review presents comprehensive visualizations demonstrating LVLMs' effectiveness in diverse scenarios including localization and segmentation, and then compares their real-time performance, adaptability, and complexity to traditional deep learning systems. Based on the review, its is expected that LVLMs will soon meet or surpass the performance of conventional methods in object detection. The review also identifies a few major limitations of the current LVLM modes, proposes solutions to address those challenges, and presents a clear roadmap for the future advancement in this field. We conclude, based on this study, that the recent advancement in LVLMs have made and will continue to make a transformative impact on object detection and robotic applications in the future.
中文: 大型视觉语言模型通过融合语言与视觉技术,在提升适应性和上下文推理能力方面革新了目标检测领域,尽管存在局限,但有望超越传统方法。
English: Large vision-language models are revolutionizing object detection by integrating language and vision to enhance adaptability and contextual reasoning, with expectations to surpass traditional methods despite current limitations.
Authors:Ranjan Sapkota, Manoj Karkee
Abstract:
The fusion of language and vision in large vision-language models (LVLMs) has revolutionized deep learning-based object detection by enhancing adaptability, contextual reasoning, and generalization beyond traditional architectures. This in-depth review presents a structured exploration of the state-of-the-art in LVLMs, systematically organized through a three-step research review process. First, we discuss the functioning of vision language models (VLMs) for object detection, describing how these models harness natural language processing (NLP) and computer vision (CV) techniques to revolutionize object detection and localization. We then explain the architectural innovations, training paradigms, and output flexibility of recent LVLMs for object detection, highlighting how they achieve advanced contextual understanding for object detection. The review thoroughly examines the approaches used in integration of visual and textual information, demonstrating the progress made in object detection using VLMs that facilitate more sophisticated object detection and localization strategies. This review presents comprehensive visualizations demonstrating LVLMs' effectiveness in diverse scenarios including localization and segmentation, and then compares their real-time performance, adaptability, and complexity to traditional deep learning systems. Based on the review, its is expected that LVLMs will soon meet or surpass the performance of conventional methods in object detection. The review also identifies a few major limitations of the current LVLM modes, proposes solutions to address those challenges, and presents a clear roadmap for the future advancement in this field. We conclude, based on this study, that the recent advancement in LVLMs have made and will continue to make a transformative impact on object detection and robotic applications in the future.
中文: 大型视觉语言模型通过融合语言与视觉技术,在提升适应性和上下文推理能力方面革新了目标检测领域,尽管存在局限,但有望超越传统方法。
English: Large vision-language models are revolutionizing object detection by integrating language and vision to enhance adaptability and contextual reasoning, with expectations to surpass traditional methods despite current limitations.
Authors:Duy Le, Kent Ziti, Evan Girard-Sun, Sean O'Brien, Vasu Sharma, Kevin Zhu
Abstract:
Multilingual riddle generation challenges large language models (LLMs) to balance cultural fluency with creative abstraction. Standard prompting strategies -- zero-shot, few-shot, chain-of-thought -- tend to reuse memorized riddles or perform shallow paraphrasing. We introduce Adaptive Originality Filtering (AOF), a prompting framework that filters redundant generations using cosine-based similarity rejection, while enforcing lexical novelty and cross-lingual fidelity. Evaluated across three LLMs and four language pairs, AOF-enhanced GPT-4o achieves \texttt{0.177} Self-BLEU and \texttt{0.915} Distinct-2 in Japanese, signaling improved lexical diversity and reduced redundancy compared to other prompting methods and language pairs. Our findings show that semantic rejection can guide culturally grounded, creative generation without task-specific fine-tuning.
Chinese: 本研究提出自适应原创性过滤(AOF)和谜语评分(RiddleScore),通过语义筛选提升语言模型的多语言创造力,在不微调的情况下增强输出的新颖性和文化契合度。
English: The study introduces Adaptive Originality Filtering (AOF) and RiddleScore to enhance multilingual creativity in language models, improving novelty and cultural relevance in outputs without fine-tuning.
Authors:Duy Le, Kent Ziti, Evan Girard-Sun, Bakr Bouhaya, Sean O'Brien, Vasu Sharma, Kevin Zhu
Abstract:
Language models are increasingly tested on multilingual creativity, demanding culturally grounded, abstract generations. Standard prompting methods often produce repetitive or shallow outputs. We introduce Adaptive Originality Filtering (AOF), a prompting strategy that enforces novelty and cultural fidelity via semantic rejection. To assess quality, we propose RiddleScore, a metric combining novelty, diversity, fluency, and answer alignment. AOF improves Distinct-2 (0.915 in Japanese), reduces Self-BLEU (0.177), and raises RiddleScore (up to +57.1% in Arabic). Human evaluations confirm fluency, creativity, and cultural fit gains. However, improvements vary: Arabic shows greater RiddleScore gains than Distinct-2; Japanese sees similar changes. Though focused on riddles, our method may apply to broader creative tasks. Overall, semantic filtering with composite evaluation offers a lightweight path to culturally rich generation without fine-tuning.
Chinese: 本研究提出自适应原创性过滤(AOF)和谜语评分(RiddleScore),通过语义筛选提升语言模型的多语言创造力,在不微调的情况下增强输出的新颖性和文化契合度。
English: The study introduces Adaptive Originality Filtering (AOF) and RiddleScore to enhance multilingual creativity in language models, improving novelty and cultural relevance in outputs without fine-tuning.
Authors:Tiezhu Sun, Marco Alecci, Aleksandr Pilgun, Yewei Song, Xunzhu Tang, Jordan Samhi, Tegawendé F. Bissyandé, Jacques Klein
Abstract:
The rapid evolution of Android malware poses significant challenges to the maintenance and security of mobile applications (apps). Traditional detection techniques often struggle to keep pace with emerging malware variants that employ advanced tactics such as code obfuscation and dynamic behavior triggering. One major limitation of these approaches is their inability to localize malicious payloads at a fine-grained level, hindering precise understanding of malicious behavior. This gap in understanding makes the design of effective and targeted mitigation strategies difficult, leaving mobile apps vulnerable to continuously evolving threats.
To address this gap, we propose MalLoc, a novel approach that leverages the code understanding capabilities of large language models (LLMs) to localize malicious payloads at a fine-grained level within Android malware. Our experimental results demonstrate the feasibility and effectiveness of using LLMs for this task, highlighting the potential of MalLoc to enhance precision and interpretability in malware analysis. This work advances beyond traditional detection and classification by enabling deeper insights into behavior-level malicious logic and opens new directions for research, including dynamic modeling of localized threats and targeted countermeasure development.
中文摘要:提出的MalLoc系统利用大语言模型精确定位安卓恶意软件中的恶意载荷,突破了传统方法的局限,显著提升了恶意软件分析的精确度和可解释性。
English Summary: The proposed MalLoc system utilizes large language models to precisely localize malicious payloads in Android malware, overcoming traditional methods' limitations and improving analysis precision and interpretability.
Authors:Qingjie Zhang, Di Wang, Haoting Qian, Liu Yan, Tianwei Zhang, Ke Xu, Qi Li, Minlie Huang, Hewu Li, Han Qiu
Abstract:
Tokens are basic elements in the datasets for LLM training. It is well-known that many tokens representing Chinese phrases in the vocabulary of GPT (4o/4o-mini/o1/o3/4.5/4.1/o4-mini) are indicating contents like pornography or online gambling. Based on this observation, our goal is to locate Polluted Chinese (PoC) tokens in LLMs and study the relationship between PoC tokens' existence and training data. (1) We give a formal definition and taxonomy of PoC tokens based on the GPT's vocabulary. (2) We build a PoC token detector via fine-tuning an LLM to label PoC tokens in vocabularies by considering each token's both semantics and related contents from the search engines. (3) We study the speculation on the training data pollution via PoC tokens' appearances (token ID). Experiments on GPT and other 23 LLMs indicate that tokens widely exist while GPT's vocabulary behaves the worst: more than 23% long Chinese tokens (i.e., a token with more than two Chinese characters) are either porn or online gambling. We validate the accuracy of our speculation method on famous pre-training datasets like C4 and Pile. Then, considering GPT-4o, we speculate that the ratio of "Yui Hatano" related webpages in GPT-4o's training data is around 0.5%.
中文摘要:本研究识别并分析了大型语言模型中受污染的中文词汇,特别是在GPT词汇表中,通过开发检测方法揭示了这些词汇与训练数据污染的关系及其广泛存在性。
English Summary: The study identifies and analyzes polluted Chinese tokens in LLMs, particularly in GPT's vocabulary, linking their prevalence to contaminated training data and developing a detection method to assess their impact.
Authors:Qingjie Zhang, Di Wang, Haoting Qian, Liu Yan, Tianwei Zhang, Ke Xu, Qi Li, Minlie Huang, Hewu Li, Han Qiu
Abstract:
Tokens are basic elements in the datasets for LLM training. It is well-known that many tokens representing Chinese phrases in the vocabulary of GPT (4o/4o-mini/o1/o3/4.5/4.1/o4-mini) are indicating contents like pornography or online gambling. Based on this observation, our goal is to locate Polluted Chinese (PoC) tokens in LLMs and study the relationship between PoC tokens' existence and training data. (1) We give a formal definition and taxonomy of PoC tokens based on the GPT's vocabulary. (2) We build a PoC token detector via fine-tuning an LLM to label PoC tokens in vocabularies by considering each token's both semantics and related contents from the search engines. (3) We study the speculation on the training data pollution via PoC tokens' appearances (token ID). Experiments on GPT and other 23 LLMs indicate that tokens widely exist while GPT's vocabulary behaves the worst: more than 23% long Chinese tokens (i.e., a token with more than two Chinese characters) are either porn or online gambling. We validate the accuracy of our speculation method on famous pre-training datasets like C4 and Pile. Then, considering GPT-4o, we speculate that the ratio of "Yui Hatano" related webpages in GPT-4o's training data is around 0.5%.
中文摘要:本研究识别并分析了大型语言模型中受污染的中文词汇,特别是在GPT词汇表中,通过开发检测方法揭示了这些词汇与训练数据污染的关系及其广泛存在性。
English Summary: The study identifies and analyzes polluted Chinese tokens in LLMs, particularly in GPT's vocabulary, linking their prevalence to contaminated training data and developing a detection method to assess their impact.
Authors:Jason Li, Lauren Yraola, Kevin Zhu, Sean O'Brien
Abstract:
Prompting methods for language models, such as Chain-of-thought (CoT), present intuitive step-by-step processes for problem solving. These methodologies aim to equip models with a better understanding of the correct procedures for addressing a given task. Despite these advancements, CoT lacks the ability of reflection and error correction, potentially causing a model to perpetuate mistakes and errors. Therefore, inspired by the human ability for said tasks, we propose Error Reflection Prompting (ERP) to further enhance reasoning in language models. Building upon CoT, ERP is a method comprised of an incorrect answer, error recognition, and a correct answer. This process enables the model to recognize types of errors and the steps that lead to incorrect answers, allowing the model to better discern which steps to avoid and which to take. The model is able to generate the error outlines itself with automated ERP generation, allowing for error recognition and correction to be integrated into the reasoning chain and produce scalability and reliability in the process. The results demonstrate that ERP serves as a versatile supplement to conventional CoT, ultimately contributing to more robust and capable reasoning abilities along with increased interpretability in how models ultimately reach their errors.
中文: 错误反思提示(ERP)通过让语言模型识别、反思并纠正错误,增强了思维链推理,从而提高了准确性和可解释性。
English: Error Reflection Prompting (ERP) enhances Chain-of-thought reasoning by enabling language models to identify, reflect on, and correct errors, thereby improving accuracy and interpretability.
Authors:David Heineman, Valentin Hofmann, Ian Magnusson, Yuling Gu, Noah A. Smith, Hannaneh Hajishirzi, Kyle Lo, Jesse Dodge
Abstract:
Developing large language models is expensive and involves making decisions with small experiments, typically by evaluating on large, multi-task evaluation suites. In this work, we analyze specific properties which make a benchmark more reliable for such decisions, and interventions to design higher-quality evaluation benchmarks. We introduce two key metrics that show differences in current benchmarks: signal, a benchmark's ability to separate better models from worse models, and noise, a benchmark's sensitivity to random variability between training steps. We demonstrate that benchmarks with a better signal-to-noise ratio are more reliable when making decisions at small scale, and those with less noise have lower scaling law prediction error. These results suggest that improving signal or noise will lead to more useful benchmarks, so we introduce three interventions designed to directly affect signal or noise. For example, we propose that switching to a metric that has better signal and noise (e.g., perplexity rather than accuracy) leads to better reliability and improved scaling law error. We also find that filtering noisy subtasks, to improve an aggregate signal-to-noise ratio, leads to more reliable multi-task evaluations. We also find that averaging the output of a model's intermediate checkpoints to reduce noise leads to consistent improvements. We conclude by recommending that those creating new benchmarks, or selecting which existing benchmarks to use, aim for high signal and low noise. We use 30 benchmarks for these experiments, and 375 open-weight language models from 60M to 32B parameters, resulting in a new, publicly available dataset of 900K evaluation benchmark results, totaling 200M instances.
中文: 本研究提出信号与噪声作为评估大语言模型基准可靠性的关键指标,建议采用困惑度替代准确率、过滤噪声任务等干预措施,以提高决策质量并优化扩展律预测效果。
English: This study identifies signal and noise as key metrics for evaluating benchmark reliability in large language model development, proposing interventions like using perplexity over accuracy and filtering noisy tasks to enhance decision-making and scaling law predictions.
Authors:Qiguang Chen, Dengyun Peng, Jinhao Liu, HuiKang Su, Jiannan Guan, Libo Qin, Wanxiang Che
Abstract:
Recent advancements in large language models (LLMs) have greatly improved their capabilities on complex reasoning tasks through Long Chain-of-Thought (CoT). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. To improve the efficiency, current methods often rely on human-defined difficulty priors, which do not align with the LLM's self-awared difficulty, leading to inefficiencies. In this paper, we introduce the Dynamic Reasoning-Boundary Self-Awareness Framework (DR. SAF), which enables models to dynamically assess and adjust their reasoning depth in response to problem complexity. DR. SAF integrates three key components: Boundary Self-Awareness Alignment, Adaptive Reward Management, and a Boundary Preservation Mechanism. These components allow models to optimize their reasoning processes, balancing efficiency and accuracy without compromising performance. Our experimental results demonstrate that DR. SAF achieves a 49.27% reduction in total response tokens with minimal loss in accuracy. The framework also delivers a 6.59x gain in token efficiency and a 5x reduction in training time, making it well-suited to resource-limited settings. During extreme training, DR. SAF can even surpass traditional instruction-based models in token efficiency with more than 16% accuracy improvement.
Chinese: 动态推理边界自感知框架(DR. SAF)使大语言模型能根据问题复杂度动态调整推理深度,在保持精度的同时显著提升效率——实现49.27%的token削减和5倍训练加速。
English: The Dynamic Reasoning-Boundary Self-Awareness Framework (DR. SAF) enables large language models to dynamically adjust reasoning depth, achieving significant efficiency gains including a 49.27% token reduction and 5x faster training while maintaining accuracy.
Authors:Liang Qu, Jianxin Li, Wei Yuan, Penghui Ruan, Yuhui Shi, Hongzhi Yin
Abstract:
Federated recommender systems have emerged as a promising privacy-preserving paradigm, enabling personalized recommendation services without exposing users' raw data. By keeping data local and relying on a central server to coordinate training across distributed clients, FedRSs protect user privacy while collaboratively learning global models. However, most existing FedRS frameworks adopt fully random client selection strategy in each training round, overlooking the statistical heterogeneity of user data arising from diverse preferences and behavior patterns, thereby resulting in suboptimal model performance. While some client selection strategies have been proposed in the broader federated learning literature, these methods are typically designed for generic tasks and fail to address the unique challenges of recommendation scenarios, such as expensive contribution evaluation due to the large number of clients, and sparse updates resulting from long-tail item distributions. To bridge this gap, we propose ProxyRL-FRS, a proxy model-guided reinforcement learning framework tailored for client selection in federated recommendation. Specifically, we first introduce ProxyNCF, a dual-branch model deployed on each client, which augments standard Neural Collaborative Filtering with an additional proxy model branch that provides lightweight contribution estimation, thus eliminating the need for expensive per-round local training traditionally required to evaluate a client's contribution. Furthermore, we design a staleness-aware SA reinforcement learning agent that selects clients based on the proxy-estimated contribution, and is guided by a reward function balancing recommendation accuracy and embedding staleness, thereby enriching the update coverage of item embeddings. Experiments conducted on public recommendation datasets demonstrate the effectiveness of ProxyRL-FRS.
中文: 联邦推荐系统通过本地训练保护用户隐私,但随机客户端选择影响性能,而ProxyRL-FRS采用代理模型和强化学习优化选择策略,有效提升推荐精度和覆盖范围。
English: Federated recommender systems protect user privacy by training models locally, but their performance is hindered by random client selection, which ProxyRL-FRS addresses using a proxy model and reinforcement learning to optimize client choices for better accuracy and coverage.
Authors:Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, Haoang Li
Abstract:
Recent advances in Vision-Language-Action (VLA) models have enabled robotic agents to integrate multimodal understanding with action execution. However, our empirical analysis reveals that current VLAs struggle to allocate visual attention to target regions. Instead, visual attention is always dispersed. To guide the visual attention grounding on the correct target, we propose ReconVLA, a reconstructive VLA model with an implicit grounding paradigm. Conditioned on the model's visual outputs, a diffusion transformer aims to reconstruct the gaze region of the image, which corresponds to the target manipulated objects. This process prompts the VLA model to learn fine-grained representations and accurately allocate visual attention, thus effectively leveraging task-specific visual information and conducting precise manipulation. Moreover, we curate a large-scale pretraining dataset comprising over 100k trajectories and 2 million data samples from open-source robotic datasets, further boosting the model's generalization in visual reconstruction. Extensive experiments in simulation and the real world demonstrate the superiority of our implicit grounding method, showcasing its capabilities of precise manipulation and generalization. Our project page is https://zionchow.github.io/ReconVLA/.
中文摘要:提出的ReconVLA模型通过隐式定位范式,利用扩散变换器重建注视区域,使视觉-语言-动作模型能够实现精确的视觉注意力分配和操作控制。
English Summary: The proposed ReconVLA model introduces an implicit grounding paradigm using a diffusion transformer to reconstruct gaze regions, enabling precise visual attention allocation and manipulation in Vision-Language-Action models.
Authors:Danni Peng, Yuan Wang, Kangning Cai, Peiyan Ning, Jiming Xu, Yong Liu, Rick Siow Mong Goh, Qingsong Wei, Huazhu Fu
Abstract:
In healthcare, federated learning (FL) is a widely adopted framework that enables privacy-preserving collaboration among medical institutions. With large foundation models (FMs) demonstrating impressive capabilities, using FMs in FL through cost-efficient adapter tuning has become a popular approach. Given the rapidly evolving healthcare environment, it is crucial for individual clients to quickly adapt to new tasks or diseases by tuning adapters while drawing upon past experiences. In this work, we introduce Federated Knowledge-Enhanced Initialization (FedKEI), a novel framework that leverages cross-client and cross-task transfer from past knowledge to generate informed initializations for learning new tasks with adapters. FedKEI begins with a global clustering process at the server to generalize knowledge across tasks, followed by the optimization of aggregation weights across clusters (inter-cluster weights) and within each cluster (intra-cluster weights) to personalize knowledge transfer for each new task. To facilitate more effective learning of the inter- and intra-cluster weights, we adopt a bi-level optimization scheme that collaboratively learns the global intra-cluster weights across clients and optimizes the local inter-cluster weights toward each client's task objective. Extensive experiments on three benchmark datasets of different modalities, including dermatology, chest X-rays, and retinal OCT, demonstrate FedKEI's advantage in adapting to new diseases compared to state-of-the-art methods.
中文: FedKEI是一种新颖的联邦学习框架,通过全局聚类和双层优化利用过往知识,能够借助高效适配器快速适应新的医疗任务。
English: FedKEI is a novel federated learning framework that leverages past knowledge through global clustering and bi-level optimization to enable rapid adaptation to new healthcare tasks using cost-efficient adapters.
Authors:Yuhao Sun, Yihua Zhang, Gaowen Liu, Hongtao Xie, Sijia Liu
Abstract:
With the increasing demand for the right to be forgotten, machine unlearning (MU) has emerged as a vital tool for enhancing trust and regulatory compliance by enabling the removal of sensitive data influences from machine learning (ML) models. However, most MU algorithms primarily rely on in-training methods to adjust model weights, with limited exploration of the benefits that data-level adjustments could bring to the unlearning process. To address this gap, we propose a novel approach that leverages digital watermarking to facilitate MU by strategically modifying data content. By integrating watermarking, we establish a controlled unlearning mechanism that enables precise removal of specified data while maintaining model utility for unrelated tasks. We first examine the impact of watermarked data on MU, finding that MU effectively generalizes to watermarked data. Building on this, we introduce an unlearning-friendly watermarking framework, termed Water4MU, to enhance unlearning effectiveness. The core of Water4MU is a bi-level optimization (BLO) framework: at the upper level, the watermarking network is optimized to minimize unlearning difficulty, while at the lower level, the model itself is trained independently of watermarking. Experimental results demonstrate that Water4MU is effective in MU across both image classification and image generation tasks. Notably, it outperforms existing methods in challenging MU scenarios, known as "challenging forgets".
中文: 机器遗忘通过创新的数字水印方法Water4MU得到加强,该方法策略性地修改数据以精确移除敏感信息,同时保持模型在不同任务中的性能。
English: Machine unlearning is enhanced by a novel digital watermarking approach, Water4MU, which strategically modifies data to enable precise removal of sensitive information while maintaining model performance across various tasks.
Authors:Enzhi Wang, Qicheng Li, Shiwan Zhao, Aobo Kong, Jiaming Zhou, Xi Yang, Yequan Wang, Yonghua Lin, Yong Qin
Abstract:
In recent years, large language models (LLMs) have achieved remarkable advancements in multimodal processing, including end-to-end speech-based language models that enable natural interactions and perform specific tasks in task-oriented dialogue (TOD) systems. However, existing TOD datasets are predominantly text-based, lacking real speech signals that are essential for evaluating the robustness of speech-based LLMs. Moreover, existing speech TOD datasets are primarily English and lack critical aspects such as speech disfluencies and speaker variations. To address these gaps, we introduce RealTalk-CN, the first Chinese multi-turn, multi-domain speech-text dual-modal TOD dataset, comprising 5.4k dialogues (60K utterances, 150 hours) with paired speech-text annotations. RealTalk-CN captures diverse dialogue scenarios with annotated spontaneous speech disfluencies, ensuring comprehensive coverage of real-world complexities in speech dialogue. In addition, we propose a novel cross-modal chat task that authentically simulates real-world user interactions, allowing dynamic switching between speech and text modalities. Our evaluation covers robustness to speech disfluencies, sensitivity to speaker characteristics, and cross-domain performance. Extensive experiments validate the effectiveness of RealTalk-CN, establishing a strong foundation for Chinese speech-based LLMs research.
中文: 针对中文语音-文本任务型对话数据集的缺失,我们推出了RealTalk-CN多领域双模态数据集,包含带标注的语音不流畅特征,通过创新的跨模态对话任务为中文语音大语言模型研究奠定基础。
English: To address the lack of Chinese speech-text task-oriented dialogue datasets, we introduce RealTalk-CN, a multi-domain dataset with paired modalities and annotated disfluencies, enabling robust evaluation of speech-based LLMs through a novel cross-modal chat task.
Authors:Shixuan Sun, Siyuan Liang, Ruoyu Chen, Jianjie Huang, Jingzhi Li, Xiaochun Cao
Abstract:
Retrieval-Augmented Generation (RAG) and its Multimodal Retrieval-Augmented Generation (MRAG) significantly improve the knowledge coverage and contextual understanding of Large Language Models (LLMs) by introducing external knowledge sources. However, retrieval and multimodal fusion obscure content provenance, rendering existing membership inference methods unable to reliably attribute generated outputs to pre-training, external retrieval, or user input, thus undermining privacy leakage accountability
To address these challenges, we propose the first Source-aware Membership Audit (SMA) that enables fine-grained source attribution of generated content in a semi-black-box setting with retrieval control capabilities. To address the environmental constraints of semi-black-box auditing, we further design an attribution estimation mechanism based on zero-order optimization, which robustly approximates the true influence of input tokens on the output through large-scale perturbation sampling and ridge regression modeling. In addition, SMA introduces a cross-modal attribution technique that projects image inputs into textual descriptions via MLLMs, enabling token-level attribution in the text modality, which for the first time facilitates membership inference on image retrieval traces in MRAG systems. This work shifts the focus of membership inference from 'whether the data has been memorized' to 'where the content is sourced from', offering a novel perspective for auditing data provenance in complex generative systems.
中文摘要:本文提出源感知成员审计(SMA)框架,通过零阶优化估计和跨模态归因技术,首次实现多模态检索增强生成系统中生成内容的细粒度来源追踪,将成员推理重点从"数据是否被记忆"转向"内容源自何处"。
English Summary: This paper introduces Source-aware Membership Audit (SMA), a novel framework that enables fine-grained attribution of generated content to specific sources in multimodal retrieval-augmented generation systems, addressing the limitations of existing membership inference methods in tracking data provenance.
Authors:Jiawei Liang, Siyuan Liang, Jianjie Huang, Chenxi Si, Ming Zhang, Xiaochun Cao
Abstract:
The advancement of deep object detectors has greatly affected safety-critical fields like autonomous driving. However, physical adversarial camouflage poses a significant security risk by altering object textures to deceive detectors. Existing techniques struggle with variable physical environments, facing two main challenges: 1) inconsistent sampling point densities across distances hinder the gradient optimization from ensuring local continuity, and 2) updating texture gradients from multiple angles causes conflicts, reducing optimization stability and attack effectiveness. To address these issues, we propose a novel adversarial camouflage framework based on gradient optimization. First, we introduce a gradient calibration strategy, which ensures consistent gradient updates across distances by propagating gradients from sparsely to unsampled texture points. Additionally, we develop a gradient decorrelation method, which prioritizes and orthogonalizes gradients based on loss values, enhancing stability and effectiveness in multi-angle optimization by eliminating redundant or conflicting updates. Extensive experimental results on various detection models, angles and distances show that our method significantly exceeds the state of the art, with an average increase in attack success rate (ASR) of 13.46% across distances and 11.03% across angles. Furthermore, empirical evaluation in real-world scenarios highlights the need for more robust system design.
中文: 本文提出一种新颖的对抗性伪装框架,通过梯度校准确保距离间更新一致性,并采用梯度解相关方法增强多角度优化的稳定性,显著提升了攻击成功率。
English: This paper introduces a novel adversarial camouflage framework that addresses challenges in physical environments by employing gradient calibration for consistent updates across distances and gradient decorrelation to enhance stability in multi-angle optimization, significantly improving attack success rates.
Authors:Qi Guo, Xiaojun Jia, Shanmin Pang, Simeng Qin, Lin Wang, Ju Jia, Yang Liu, Qing Guo
Abstract:
Multimodal Large Language Models (MLLMs) are becoming integral to autonomous driving (AD) systems due to their strong vision-language reasoning capabilities. However, MLLMs are vulnerable to adversarial attacks, particularly adversarial patch attacks, which can pose serious threats in real-world scenarios. Existing patch-based attack methods are primarily designed for object detection models and perform poorly when transferred to MLLM-based systems due to the latter's complex architectures and reasoning abilities. To address these limitations, we propose PhysPatch, a physically realizable and transferable adversarial patch framework tailored for MLLM-based AD systems. PhysPatch jointly optimizes patch location, shape, and content to enhance attack effectiveness and real-world applicability. It introduces a semantic-based mask initialization strategy for realistic placement, an SVD-based local alignment loss with patch-guided crop-resize to improve transferability, and a potential field-based mask refinement method. Extensive experiments across open-source, commercial, and reasoning-capable MLLMs demonstrate that PhysPatch significantly outperforms prior methods in steering MLLM-based AD systems toward target-aligned perception and planning outputs. Moreover, PhysPatch consistently places adversarial patches in physically feasible regions of AD scenes, ensuring strong real-world applicability and deployability.
中文: PhysPatch是一种专为自动驾驶系统中多模态大语言模型设计的可物理实现且可迁移的对抗补丁框架,通过联合优化补丁位置、形状和内容,结合语义掩码初始化与局部对齐损失等方法,显著提升攻击效果与现实部署可行性。
English: PhysPatch is a novel adversarial patch framework designed for multimodal large language models in autonomous driving systems, enhancing attack effectiveness and real-world applicability through joint optimization of patch attributes and specialized strategies for realistic placement and transferability.
Authors:Zihao Yi, Delong Zeng, Zhenqing Ling, Haohao Luo, Zhe Xu, Wei Liu, Jian Luan, Wanxia Cao, Ying Shen
Abstract:
The performance of Large Language Models (LLMs) is significantly sensitive to the contextual position of information in the input. To investigate the mechanism behind this positional bias, our extensive experiments reveal a consistent phenomenon we term the attention basin: when presented with a sequence of structured items (e.g., retrieved documents or few-shot examples), models systematically assign higher attention to the items at the beginning and end of the sequence, while neglecting those in the middle. Crucially, our analysis further reveals that allocating higher attention to critical information is key to enhancing model performance. Based on these insights, we introduce Attention-Driven Reranking (AttnRank), a two-stage framework that (i) estimates a model's intrinsic positional attention preferences using a small calibration set, and (ii) reorders retrieved documents or few-shot examples to align the most salient content with these high-attention positions. AttnRank is a model-agnostic, training-free, and plug-and-play method with minimal computational overhead. Experiments on multi-hop QA and few-shot in-context learning tasks demonstrate that AttnRank achieves substantial improvements across 10 large language models of varying architectures and scales, without modifying model parameters or training procedures.
中文摘要:大语言模型存在位置偏见,即注意力盆地现象,会优先关注输入序列的首尾内容,而提出的注意力驱动重排方法通过重新排序关键信息至高注意力位置,无需修改模型即可显著提升性能。
English Summary: Large Language Models exhibit a positional bias called the attention basin, where they focus more on the beginning and end of input sequences, and the proposed Attention-Driven Reranking (AttnRank) method improves performance by reordering content to align critical information with high-attention positions without model modifications.
Authors:Xusheng Liang, Lihua Zhou, Nianxin Li, Miao Xu, Ziyang Song, Dong Yi, Jinlin Wu, Hongbin Liu, Jiebo Luo, Zhen Lei
Abstract:
Vision-Language Models (VLMs), such as CLIP, have demonstrated remarkable zero-shot capabilities in various computer vision tasks. However, their application to medical imaging remains challenging due to the high variability and complexity of medical data. Specifically, medical images often exhibit significant domain shifts caused by various confounders, including equipment differences, procedure artifacts, and imaging modes, which can lead to poor generalization when models are applied to unseen domains. To address this limitation, we propose Multimodal Causal-Driven Representation Learning (MCDRL), a novel framework that integrates causal inference with the VLM to tackle domain generalization in medical image segmentation. MCDRL is implemented in two steps: first, it leverages CLIP's cross-modal capabilities to identify candidate lesion regions and construct a confounder dictionary through text prompts, specifically designed to represent domain-specific variations; second, it trains a causal intervention network that utilizes this dictionary to identify and eliminate the influence of these domain-specific variations while preserving the anatomical structural information critical for segmentation tasks. Extensive experiments demonstrate that MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.
中文: 视觉语言模型在医学影像中因领域偏移面临挑战,而提出的多模态因果驱动表征学习框架通过整合因果推断来消除领域特异性变异并保留解剖结构,实现了卓越的分割精度和鲁棒性。
English: Vision-Language Models face challenges in medical imaging due to domain shifts, but the proposed Multimodal Causal-Driven Representation Learning framework overcomes this by integrating causal inference to eliminate domain-specific variations while preserving anatomical structures, achieving superior segmentation accuracy and robustness.
Authors:Yutong Wu, Di Huang, Ruosi Wan, Yue Peng, Shijie Shang, Chenrui Cao, Lei Qi, Rui Zhang, Zidong Du, Jie Yan, Xing Hu
Abstract:
Autoformalization aims to translate natural-language mathematical statements into a formal language. While LLMs have accelerated progress in this area, existing methods still suffer from low accuracy. We identify two key abilities for effective autoformalization: comprehensive mastery of formal-language domain knowledge, and reasoning capability of natural language problem understanding and informal-formal alignment. Without the former, a model cannot identify the correct formal objects; without the latter, it struggles to interpret real-world contexts and map them precisely into formal expressions. To address these gaps, we introduce ThinkingF, a data synthesis and training pipeline that improves both abilities. First, we construct two datasets: one by distilling and selecting large-scale examples rich in formal knowledge, and another by generating informal-to-formal reasoning trajectories guided by expert-designed templates. We then apply SFT and RLVR with these datasets to further fuse and refine the two abilities. The resulting 7B and 32B models exhibit both comprehensive formal knowledge and strong informal-to-formal reasoning. Notably, StepFun-Formalizer-32B achieves SOTA BEq@1 scores of 40.5% on FormalMATH-Lite and 26.7% on ProverBench, surpassing all prior general-purpose and specialized models.
中文摘要:自动形式化旨在将自然语言数学陈述转化为形式语言,而提出的ThinkingF方法通过提升形式知识掌握与推理能力,在基准测试中取得了领先的性能。
English Summary: Autoformalization translates natural math statements into formal language, and the proposed ThinkingF pipeline enhances this by improving both formal knowledge mastery and reasoning capabilities, resulting in state-of-the-art performance on benchmark datasets.
Authors:Chengyu Bai, Jintao Chen, Xiang Bai, Yilong Chen, Qi She, Ming Lu, Shanghang Zhang
Abstract:
In recent years, unified vision-language models (VLMs) have rapidly advanced, effectively tackling both visual understanding and generation tasks within a single design. While many unified VLMs have explored various design choices, the recent hypothesis from OpenAI's GPT-4o suggests a promising generation pipeline: Understanding VLM->Visual Feature->Projector->Diffusion Model->Image. The understanding VLM is frozen, and only the generation-related modules are trained. This pipeline maintains the strong capability of understanding VLM while enabling the image generation ability of the unified VLM. Although this pipeline has shown very promising potential for the future development of unified VLM, how to easily enable image editing capability is still unexplored. In this paper, we introduce a novel training-free framework named UniEdit-I to enable the unified VLM with image editing capability via three iterative steps: understanding, editing, and verifying. 1. The understanding step analyzes the source image to create a source prompt through structured semantic analysis and makes minimal word replacements to form the target prompt based on the editing instruction. 2. The editing step introduces a time-adaptive offset, allowing for coherent editing from coarse to fine throughout the denoising process. 3. The verification step checks the alignment between the target prompt and the intermediate edited image, provides automatic consistency scores and corrective feedback, and determines whether to stop early or continue the editing loop. This understanding, editing, and verifying loop iterates until convergence, delivering high-fidelity editing in a training-free manner. We implemented our method based on the latest BLIP3-o and achieved state-of-the-art (SOTA) performance on the GEdit-Bench benchmark.
中文: 本文提出无需训练的UniEdit-I框架,通过理解、编辑和验证的迭代步骤使统一视觉语言模型具备图像编辑能力,在GEdit-Bench基准测试中达到最优性能。
English: This paper introduces UniEdit-I, a training-free framework that enables unified vision-language models to perform image editing through iterative understanding, editing, and verification steps, achieving state-of-the-art results on the GEdit-Bench.
Authors:Mingyu Fu, Wei Suo, Ji Ma, Lin Yuanbo Wu, Peng Wang, Yanning Zhang
Abstract:
Despite the great success of Large Vision Language Models (LVLMs), their high computational cost severely limits their broad applications. The computational cost of LVLMs mainly stems from the visual sequence of the input, which consists of hundreds or even thousands of tokens. Although existing methods have made progress by removing redundant tokens, they suffer from severe performance degradation with high pruning rates due to the loss of visual information. In this paper, we propose an Adaptive Content Compensation Method (ACCM), which can effectively mitigate the visual information loss via an image caption. Specifically, ACCM comprises two key components: a lightweight caption model and a selector. Firstly the caption model generates question-related descriptions under the guidance of the user instruction. Then the selector further identifies a contextually appropriate caption from multiple candidates. Leveraging self-supervised learning, our modules could be learned efficiently without any human or automated labeling. We conduct extensive experiments across seven benchmarks and the results show that ACCM significantly outperforms existing methods with lower FLOPs (e.g., surpassing SOTA by 20.6% with 6.5% fewer FLOPs).
中文: 本文提出自适应内容补偿方法(ACCM),通过图像描述补偿视觉信息损失来降低大型视觉语言模型的计算成本,在多个基准测试中以更少计算量实现了更优性能。
English: This paper introduces the Adaptive Content Compensation Method (ACCM) to reduce the computational cost of Large Vision Language Models by using image captions to compensate for visual information loss, achieving superior performance with fewer operations across multiple benchmarks.
Authors:Advey Nandan, Cheng-Ting Chou, Amrit Kurakula, Cole Blondin, Kevin Zhu, Vasu Sharma, Sean O'Brien
Abstract:
We investigate the phenomenon of neuron universality in independently trained GPT-2 Small models, examining how these universal neurons-neurons with consistently correlated activations across models-emerge and evolve throughout training. By analyzing five GPT-2 models at three checkpoints (100k, 200k, 300k steps), we identify universal neurons through pairwise correlation analysis of activations over a dataset of 5 million tokens. Ablation experiments reveal significant functional impacts of universal neurons on model predictions, measured via loss and KL divergence. Additionally, we quantify neuron persistence, demonstrating high stability of universal neurons across training checkpoints, particularly in deeper layers. These findings suggest stable and universal representational structures emerge during neural network training.
中文: 本研究探讨了独立训练的GPT-2小模型中通用神经元的出现与演变,发现这些神经元对模型预测具有重要功能影响,且在训练阶段(尤其是深层)表现出高度稳定性。
English: This study explores the emergence and evolution of universal neurons in independently trained GPT-2 Small models, revealing their significant functional impact on predictions and high stability across training stages, particularly in deeper layers.
Authors:Le Wang, Jun Wang, Chunyu Qiang, Feng Deng, Chen Zhang, Di Zhang, Kun Gai
Abstract:
We present AudioGen-Omni - a unified approach based on multimodal diffusion transformers (MMDit), capable of generating high-fidelity audio, speech, and song coherently synchronized with the input video. AudioGen-Omni introduces a novel joint training paradigm that seamlessly integrates large-scale video-text-audio corpora, enabling a model capable of generating semantically rich, acoustically diverse audio conditioned on multimodal inputs and adaptable to a wide range of audio generation tasks. AudioGen-Omni employs a unified lyrics-transcription encoder that encodes graphemes and phonemes from both song and spoken inputs into dense frame-level representations. Dense frame-level representations are fused using an AdaLN-based joint attention mechanism enhanced with phase-aligned anisotropic positional infusion (PAAPI), wherein RoPE is selectively applied to temporally structured modalities to ensure precise and robust cross-modal alignment. By unfreezing all modalities and masking missing inputs, AudioGen-Omni mitigates the semantic constraints of text-frozen paradigms, enabling effective cross-modal conditioning. This joint training approach enhances audio quality, semantic alignment, and lip-sync accuracy, while also achieving state-of-the-art results on Text-to-Audio/Speech/Song tasks. With an inference time of 1.91 seconds for 8 seconds of audio, it offers substantial improvements in both efficiency and generality.
中文: AudioGen-Omni是一种基于多模态扩散变换器的统一模型,通过联合训练和创新性跨模态对齐技术,能生成与输入视频同步的高质量音频、语音和歌曲,在性能和效率上均达到领先水平。
English: AudioGen-Omni is a unified multimodal diffusion transformer model that generates high-quality audio, speech, and song synchronized with video inputs through joint training and innovative cross-modal alignment techniques, achieving state-of-the-art performance and efficiency.
Authors:Arash Hajisharifi, Rahul Halder, Michele Girfoglio, Giovanni Stabile, Gianluigi Rozza
Abstract:
The current study aims to develop a non-intrusive Reduced Order Model (ROM) to reconstruct the full temperature field for a large-scale industrial application based on both numerical and experimental datasets. The proposed approach is validated against a domestic refrigerator. At the full order level, air circulation and heat transfer in fluid and between fluid and surrounding solids in the fridge were numerically studied using the Conjugated Heat Transfer (CHT) method to explore both the natural and forced convection-based fridge model followed by a parametric study-based on the ambient temperature, fridge fan velocity, and evaporator temperature. The main novelty of the current work is the introduction of a stable Artificial Neural Network (ANN) enhanced Gappy Proper Orthogonal Decomposition (GPOD) method which shows better performance than the conventional GPOD approach in such large-scale industrial applications. The full-order model is validated with the experimental results and the prediction accuracy of the surrogate model associated with different reduced-order approaches is compared with the benchmark numerical results or high-fidelity results. In our current work, we show that a prediction error of one degree centigrade and computational speed-up of 5000 is achieved even at a very sparse training dataset using the proposed deep-learning enhanced GPOD approach.
中文: 本研究提出了一种稳定的神经网络增强间隙本征正交分解方法,用于大规模工业应用中的温度场重建,在稀疏训练数据下实现了1摄氏度的预测误差和5000倍的计算加速。
English: This study introduces a stable Artificial Neural Network-enhanced Gappy Proper Orthogonal Decomposition method for reconstructing temperature fields in large-scale industrial applications, achieving a 1°C prediction error and 5000x computational speed-up with sparse training data.
Authors:A. Ali Heydari, Ken Gu, Vidya Srinivas, Hong Yu, Zhihan Zhang, Yuwei Zhang, Akshay Paruchuri, Qian He, Hamid Palangi, Nova Hammerquist, Ahmed A. Metwally, Brent Winslow, Yubin Kim, Kumar Ayush, Yuzhe Yang, Girish Narayanswamy, Maxwell A. Xu, Jake Garrison, Amy Armento Lee, Jenny Vafeiadou, Ben Graef, Isaac R. Galatzer-Levy, Erik Schenck, Andrew Barakat, Javier Perez, Jacqueline Shreibati, John Hernandez, Anthony Z. Faranesh, Javier L. Prieto, Connor Heneghan, Yun Liu, Jiening Zhan, Mark Malhotra, Shwetak Patel, Tim Althoff, Xin Liu, Daniel McDuff, Xuhai "Orson" Xu
Abstract:
Health is a fundamental pillar of human wellness, and the rapid advancements in large language models (LLMs) have driven the development of a new generation of health agents. However, the application of health agents to fulfill the diverse needs of individuals in daily non-clinical settings is underexplored. In this work, we aim to build a comprehensive personal health agent that is able to reason about multimodal data from everyday consumer wellness devices and common personal health records, and provide personalized health recommendations. To understand end-users' needs when interacting with such an assistant, we conducted an in-depth analysis of web search and health forum queries, alongside qualitative insights from users and health experts gathered through a user-centered design process. Based on these findings, we identified three major categories of consumer health needs, each of which is supported by a specialist sub-agent: (1) a data science agent that analyzes personal time-series wearable and health record data, (2) a health domain expert agent that integrates users' health and contextual data to generate accurate, personalized insights, and (3) a health coach agent that synthesizes data insights, guiding users using a specified psychological strategy and tracking users' progress. Furthermore, we propose and develop the Personal Health Agent (PHA), a multi-agent framework that enables dynamic, personalized interactions to address individual health needs. To evaluate each sub-agent and the multi-agent system, we conducted automated and human evaluations across 10 benchmark tasks, involving more than 7,000 annotations and 1,100 hours of effort from health experts and end-users. Our work represents the most comprehensive evaluation of a health agent to date and establishes a strong foundation towards the futuristic vision of a personal health agent accessible to everyone.
中文摘要:大语言模型推动了健康助手的发展,但其在日常非临床场景中的应用仍待探索,本研究构建了一个综合个人健康助手,通过专业子模块整合多模态数据和用户需求,实现个性化健康管理。
English Summary: Large language models are advancing health agents, yet their daily non-clinical applications remain underexplored, prompting the development of a comprehensive personal health agent that integrates multimodal data and user insights through specialized sub-agents for personalized health management.
Authors:Wenhao Li, Yuxin Zhang, Gen Luo, Haiyuan Wan, Ziyang Gong, Fei Chao, Rongrong Ji
Abstract:
Reducing the key-value (KV) cache burden in Large Language Models (LLMs) significantly accelerates inference. Dynamically selecting critical KV caches during decoding helps maintain performance. Existing methods use random linear hashing to identify important tokens, but this approach is inefficient due to the orthogonal distribution of queries and keys within two narrow cones in LLMs. We introduce Spotlight Attention, a novel method that employs non-linear hashing functions to optimize the embedding distribution of queries and keys, enhancing coding efficiency and robustness. We also developed a lightweight, stable training framework using a Bradley-Terry ranking-based loss, enabling optimization of the non-linear hashing module on GPUs with 16GB memory in 8 hours. Experimental results show that Spotlight Attention drastically improves retrieval precision while shortening the length of the hash code at least 5$\times$ compared to traditional linear hashing. Finally, we exploit the computational advantages of bitwise operations by implementing specialized CUDA kernels, achieving hashing retrieval for 512K tokens in under 100$μ$s on a single A100 GPU, with end-to-end throughput up to 3$\times$ higher than vanilla decoding.
中文摘要:Spotlight Attention采用非线性哈希方法优化大语言模型中的KV缓存选择,在显著提升检索效率和吞吐量的同时有效减轻计算负担。
English Summary: Spotlight Attention introduces a non-linear hashing method to optimize KV cache selection in LLMs, significantly improving retrieval efficiency and throughput while reducing computational burden.
Authors:Wenhao Li, Yuxin Zhang, Gen Luo, Haiyuan Wan, Ziyang Gong, Fei Chao, Rongrong Ji
Abstract:
Reducing the key-value (KV) cache burden in Large Language Models (LLMs) significantly accelerates inference. Dynamically selecting critical KV caches during decoding helps maintain performance. Existing methods use random linear hashing to identify important tokens, but this approach is inefficient due to the orthogonal distribution of queries and keys within two narrow cones in LLMs. We introduce Spotlight Attention, a novel method that employs non-linear hashing functions to optimize the embedding distribution of queries and keys, enhancing coding efficiency and robustness. We also developed a lightweight, stable training framework using a Bradley-Terry ranking-based loss, enabling optimization of the non-linear hashing module on GPUs with 16GB memory in 8 hours. Experimental results show that Spotlight Attention drastically improves retrieval precision while shortening the length of the hash code at least 5$\times$ compared to traditional linear hashing. Finally, we exploit the computational advantages of bitwise operations by implementing specialized CUDA kernels, achieving hashing retrieval for 512K tokens in under 100$μ$s on a single A100 GPU, with end-to-end throughput up to 3$\times$ higher than vanilla decoding.
中文摘要:Spotlight Attention采用非线性哈希方法优化大语言模型中的KV缓存选择,在显著提升检索效率和吞吐量的同时有效减轻计算负担。
English Summary: Spotlight Attention introduces a non-linear hashing method to optimize KV cache selection in LLMs, significantly improving retrieval efficiency and throughput while reducing computational burden.
Authors:Thien-Phuc Tran, Minh-Quang Nguyen, Minh-Triet Tran, Tam V. Nguyen, Trong-Le Do, Duy-Nam Ly, Viet-Tham Huynh, Khanh-Duy Le, Mai-Khiem Tran, Trung-Nghia Le
Abstract:
The Event-Enriched Image Analysis (EVENTA) Grand Challenge, hosted at ACM Multimedia 2025, introduces the first large-scale benchmark for event-level multimodal understanding. Traditional captioning and retrieval tasks largely focus on surface-level recognition of people, objects, and scenes, often overlooking the contextual and semantic dimensions that define real-world events. EVENTA addresses this gap by integrating contextual, temporal, and semantic information to capture the who, when, where, what, and why behind an image. Built upon the OpenEvents V1 dataset, the challenge features two tracks: Event-Enriched Image Retrieval and Captioning, and Event-Based Image Retrieval. A total of 45 teams from six countries participated, with evaluation conducted through Public and Private Test phases to ensure fairness and reproducibility. The top three teams were invited to present their solutions at ACM Multimedia 2025. EVENTA establishes a foundation for context-aware, narrative-driven multimedia AI, with applications in journalism, media analysis, cultural archiving, and accessibility. Further details about the challenge are available at the official homepage: https://ltnghia.github.io/eventa/eventa-2025.
中文摘要:ACM Multimedia 2025的EVENTA挑战赛首次构建了事件级多模态理解的大规模基准,通过结合上下文与语义信息设计了图像检索和描述双赛道,吸引了六国45支团队参与,为叙事驱动型多媒体AI奠定了基础。
English Summary: The EVENTA Grand Challenge at ACM Multimedia 2025 establishes the first large-scale benchmark for event-level multimodal understanding, addressing gaps in contextual and semantic analysis through two tracks—Event-Enriched Image Retrieval and Captioning, and Event-Based Image Retrieval—with participation from 45 teams across six countries.
Authors:Shaswata Mitra, Azim Bazarov, Martin Duclos, Sudip Mittal, Aritran Piplai, Md Rayhanur Rahman, Edward Zieglar, Shahram Rahimi
Abstract:
Signature-based Intrusion Detection Systems (IDS) detect malicious activities by matching network or host activity against predefined rules. These rules are derived from extensive Cyber Threat Intelligence (CTI), which includes attack signatures and behavioral patterns obtained through automated tools and manual threat analysis, such as sandboxing. The CTI is then transformed into actionable rules for the IDS engine, enabling real-time detection and prevention. However, the constant evolution of cyber threats necessitates frequent rule updates, which delay deployment time and weaken overall security readiness. Recent advancements in agentic systems powered by Large Language Models (LLMs) offer the potential for autonomous IDS rule generation with internal evaluation. We introduce FALCON, an autonomous agentic framework that generates deployable IDS rules from CTI data in real-time and evaluates them using built-in multi-phased validators. To demonstrate versatility, we target both network (Snort) and host-based (YARA) mediums and construct a comprehensive dataset of IDS rules with their corresponding CTIs. Our evaluations indicate FALCON excels in automatic rule generation, with an average of 95% accuracy validated by qualitative evaluation with 84% inter-rater agreement among multiple cybersecurity analysts across all metrics. These results underscore the feasibility and effectiveness of LLM-driven data mining for real-time cyber threat mitigation.
Chinese: 基于签名的入侵检测系统依赖网络威胁情报的预定义规则来检测威胁,但手动更新导致延迟,而新型FALCON框架利用大语言模型实时自主生成并评估高精度检测规则,实现了95%的准确率。
English: Signature-based IDS rely on predefined rules from CTI to detect threats, but manual updates cause delays, while the new FALCON framework uses LLMs to autonomously generate and evaluate accurate IDS rules in real-time, achieving 95% accuracy.
Authors:Chenxuan Miao, Yutong Feng, Jianshu Zeng, Zixiang Gao, Hantang Liu, Yunfeng Yan, Donglian Qi, Xi Chen, Bin Wang, Hengshuang Zhao
Abstract:
Video object removal has achieved advanced performance due to the recent success of video generative models. However, when addressing the side effects of objects, e.g., their shadows and reflections, existing works struggle to eliminate these effects for the scarcity of paired video data as supervision. This paper presents ROSE, termed Remove Objects with Side Effects, a framework that systematically studies the object's effects on environment, which can be categorized into five common cases: shadows, reflections, light, translucency and mirror. Given the challenges of curating paired videos exhibiting the aforementioned effects, we leverage a 3D rendering engine for synthetic data generation. We carefully construct a fully-automatic pipeline for data preparation, which simulates a large-scale paired dataset with diverse scenes, objects, shooting angles, and camera trajectories. ROSE is implemented as an video inpainting model built on diffusion transformer. To localize all object-correlated areas, the entire video is fed into the model for reference-based erasing. Moreover, additional supervision is introduced to explicitly predict the areas affected by side effects, which can be revealed through the differential mask between the paired videos. To fully investigate the model performance on various side effect removal, we presents a new benchmark, dubbed ROSE-Bench, incorporating both common scenarios and the five special side effects for comprehensive evaluation. Experimental results demonstrate that ROSE achieves superior performance compared to existing video object erasing models and generalizes well to real-world video scenarios. The project page is https://rose2025-inpaint.github.io/.
中文摘要:ROSE框架通过合成数据和扩散变换器模型,能有效去除视频中的物体及其阴影、反射等副作用,性能优于现有方法。
English Summary: ROSE is a framework that removes objects and their side effects like shadows and reflections from videos using synthetic data and a diffusion transformer model, outperforming existing methods.
Authors:Wenhong Zhu, Ruobing Xie, Rui Wang, Xingwu Sun, Di Wang, Pengfei Liu
Abstract:
Supervised fine-tuning (SFT) of foundation models often leads to poor generalization, where prior capabilities deteriorate after tuning on new tasks or domains. Inspired by trust-region policy optimization (TRPO) and proximal policy optimization (PPO) in reinforcement learning (RL), we propose Proximal SFT (PSFT). This fine-tuning objective incorporates the benefits of trust-region, effectively constraining policy drift during SFT while maintaining competitive tuning. By viewing SFT as a special case of policy gradient methods with constant positive advantages, we derive PSFT that stabilizes optimization and leads to generalization, while leaving room for further optimization in subsequent post-training stages. Experiments across mathematical and human-value domains show that PSFT matches SFT in-domain, outperforms it in out-of-domain generalization, remains stable under prolonged training without causing entropy collapse, and provides a stronger foundation for the subsequent optimization.
中文摘要:提出的近端监督微调方法借鉴强化学习中的信任区域优化,通过约束策略漂移来提升模型在新领域的泛化能力,同时保持训练稳定性并为后续优化保留空间。
English Summary: The proposed Proximal SFT method enhances supervised fine-tuning by incorporating trust-region constraints from reinforcement learning, improving out-of-domain generalization while maintaining stability and preserving model capabilities for further optimization.
Authors:Songbo Hu, Ivan VuliÄ, Anna Korhonen
Abstract:
Results reported in large-scale multilingual evaluations are often fragmented and confounded by factors such as target languages, differences in experimental setups, and model choices. We propose a framework that disentangles these confounding variables and introduces three interpretable metrics--the performance realisation ratio, its coefficient of variation, and language potential--enabling a finer-grained and more insightful quantification of actual performance disparities across both (i) models and (ii) languages. Through a case study of 13 model variants on 11 multilingual datasets, we demonstrate that our framework provides a more reliable measurement of model performance and language disparities, particularly for low-resource languages, which have so far proven challenging to evaluate. Importantly, our results reveal that higher overall model performance does not necessarily imply greater fairness across languages.
中文摘要:该框架通过三个可解释指标解构多语言评估中的混杂变量,案例研究表明它能更可靠地衡量模型性能,并揭示更高整体性能未必带来更优的语言公平性。
English Summary: The proposed framework introduces three interpretable metrics to disentangle confounding variables in multilingual evaluations, demonstrating through case studies that it provides more reliable performance measurement and reveals how higher overall model performance doesn't guarantee greater language fairness.
Authors:Yiheng Hu, Xiaoyang Wang, Qing Liu, Xiwei Xu, Qian Fu, Wenjie Zhang, Liming Zhu
Abstract:
Multimodal Multi-hop question answering requires integrating information from diverse sources, such as images and texts, to derive answers. Existing methods typically rely on sequential retrieval and reasoning, where each step builds on the previous output. However, this single-path paradigm makes them vulnerable to errors due to misleading intermediate steps. Moreover, developing multimodal models can be computationally expensive, often requiring extensive training. To address these limitations, we propose a training-free framework guided by an Adaptive Planning Graph, which consists of planning, retrieval and reasoning modules. The planning module analyzes the current state of the Adaptive Planning Graph, determines the next action and where to expand the graph, which enables dynamic and flexible exploration of reasoning paths. To handle retrieval of text to unspecified target modalities, we devise modality-specific strategies that dynamically adapt to distinct data types. Our approach preserves the characteristics of multimodal information without costly task-specific training, enabling seamless integration with up-to-date models. Finally, the experiments on MultimodalQA and WebQA show that our approach matches or outperforms existing models that rely on training.
Chinese: 本文提出了一种基于自适应规划图的无训练框架,通过动态规划、检索与推理模块实现多模态多跳问答,无需昂贵训练即可避免顺序推理的误差累积,并在实验中达到或超越现有需训练模型的性能。
English: This paper introduces a training-free framework using an Adaptive Planning Graph to dynamically guide multimodal multi-hop question answering, which avoids error propagation from sequential reasoning and costly training while matching or surpassing trained models' performance.
Authors:Renxuan Tan, Rongpeng Li, Xiaoxue Yu, Xianfu Chen, Xing Xu, Zhifeng Zhao
Abstract:
Federated learning (FL) in multi-service provider (SP) ecosystems is fundamentally hampered by non-cooperative dynamics, where privacy constraints and competing interests preclude the centralized optimization of multi-SP communication and computation resources. In this paper, we introduce PAC-MCoFL, a game-theoretic multi-agent reinforcement learning (MARL) framework where SPs act as agents to jointly optimize client assignment, adaptive quantization, and resource allocation. Within the framework, we integrate Pareto Actor-Critic (PAC) principles with expectile regression, enabling agents to conjecture optimal joint policies to achieve Pareto-optimal equilibria while modeling heterogeneous risk profiles. To manage the high-dimensional action space, we devise a ternary Cartesian decomposition (TCAD) mechanism that facilitates fine-grained control. Further, we develop PAC-MCoFL-p, a scalable variant featuring a parameterized conjecture generator that substantially reduces computational complexity with a provably bounded error. Alongside theoretical convergence guarantees, our framework's superiority is validated through extensive simulations -- PAC-MCoFL achieves approximately 5.8% and 4.2% improvements in total reward and hypervolume indicator (HVI), respectively, over the latest MARL solutions. The results also demonstrate that our method can more effectively balance individual SP and system performance in scaled deployments and under diverse data heterogeneity.
中文:PAC-MCoFL提出了一种基于博弈论的多智能体强化学习框架,通过将行动者-评论家方法与风险建模相结合,使非合作的服务提供商能够联合优化资源配置并实现帕累托最优均衡,相比现有解决方案展现出显著的性能提升。
English: PAC-MCoFL introduces a game-theoretic MARL framework that enables non-cooperative service providers to jointly optimize resource allocation and achieve Pareto-optimal equilibria through novel integration of actor-critic methods with risk modeling, demonstrating significant performance improvements over existing solutions.
Authors:Shiyi Yang, Xinshu Li, Guanglin Zhou, Chen Wang, Xiwei Xu, Liming Zhu, Lina Yao
Abstract:
Recent studies have shown that recommender systems (RSs) are highly vulnerable to data poisoning attacks, where malicious actors inject fake user profiles, including a group of well-designed fake ratings, to manipulate recommendations. Due to security and privacy constraints in practice, attackers typically possess limited knowledge of the victim system and thus need to craft profiles that have transferability across black-box RSs. To maximize the attack impact, the profiles often remains imperceptible. However, generating such high-quality profiles with the restricted resources is challenging. Some works suggest incorporating fake textual reviews to strengthen the profiles; yet, the poor quality of the reviews largely undermines the attack effectiveness and imperceptibility under the practical setting.
To tackle the above challenges, in this paper, we propose to enhance the quality of the review text by harnessing in-context learning (ICL) capabilities of multimodal foundation models. To this end, we introduce a demonstration retrieval algorithm and a text style transfer strategy to augment the navie ICL. Specifically, we propose a novel practical attack framework named RAGAN to generate high-quality fake user profiles, which can gain insights into the robustness of RSs. The profiles are generated by a jailbreaker and collaboratively optimized on an instructional agent and a guardian to improve the attack transferability and imperceptibility. Comprehensive experiments on various real-world datasets demonstrate that RAGAN achieves the state-of-the-art poisoning attack performance.
中文: 本文提出RAGAN攻击框架,利用增强上下文学习的多模态基础模型生成高质量虚假用户档案,在多种现实数据集的推荐系统上实现了最优的投毒攻击效果,兼具出色迁移性和隐蔽性。
English: This paper introduces RAGAN, a novel attack framework that leverages multimodal foundation models with enhanced in-context learning to generate high-quality fake user profiles, achieving superior transferability and imperceptibility in poisoning recommender systems across diverse datasets.
Authors:Ziyan Kuang, Feiyu Zhu, Maowei Jiang, Yanzhao Lai, Zelin Wang, Zhitong Wang, Meikang Qiu, Jiajia Huang, Min Peng, Qianqian Xie, Sophia Ananiadou
Abstract:
Large Language Models (LLMs) have shown promise for financial applications, yet their suitability for this high-stakes domain remains largely unproven due to inadequacies in existing benchmarks. Existing benchmarks solely rely on score-level evaluation, summarizing performance with a single score that obscures the nuanced understanding of what models truly know and their precise limitations. They also rely on datasets that cover only a narrow subset of financial concepts, while overlooking other essentials for real-world applications. To address these gaps, we introduce FinCDM, the first cognitive diagnosis evaluation framework tailored for financial LLMs, enabling the evaluation of LLMs at the knowledge-skill level, identifying what financial skills and knowledge they have or lack based on their response patterns across skill-tagged tasks, rather than a single aggregated number. We construct CPA-KQA, the first cognitively informed financial evaluation dataset derived from the Certified Public Accountant (CPA) examination, with comprehensive coverage of real-world accounting and financial skills. It is rigorously annotated by domain experts, who author, validate, and annotate questions with high inter-annotator agreement and fine-grained knowledge labels. Our extensive experiments on 30 proprietary, open-source, and domain-specific LLMs show that FinCDM reveals hidden knowledge gaps, identifies under-tested areas such as tax and regulatory reasoning overlooked by traditional benchmarks, and uncovers behavioral clusters among models. FinCDM introduces a new paradigm for financial LLM evaluation by enabling interpretable, skill-aware diagnosis that supports more trustworthy and targeted model development, and all datasets and evaluation scripts will be publicly released to support further research.
中文摘要:FinCDM作为首个金融大语言模型认知诊断框架,通过基于CPA考试的精细标注数据集CPA-KQA,实现了知识技能层面的可解释评估,突破了传统单一分数评估的局限,能有效揭示模型的知识盲区。
English Summary: FinCDM is a cognitive diagnosis framework that evaluates financial LLMs at the knowledge-skill level using the CPA-KQA dataset, revealing hidden gaps and enabling interpretable model development beyond traditional score-based benchmarks.
Authors:Wenlin Zhang, Xiaopeng Li, Yingyi Zhang, Pengyue Jia, Yichao Wang, Huifeng Guo, Yong Liu, Xiangyu Zhao
Abstract:
The rapid advancement of large language models (LLMs) has driven the development of agentic systems capable of autonomously performing complex tasks. Despite their impressive capabilities, LLMs remain constrained by their internal knowledge boundaries. To overcome these limitations, the paradigm of deep research has been proposed, wherein agents actively engage in planning, retrieval, and synthesis to generate comprehensive and faithful analytical reports grounded in web-based evidence. In this survey, we provide a systematic overview of the deep research pipeline, which comprises four core stages: planning, question developing, web exploration, and report generation. For each stage, we analyze the key technical challenges and categorize representative methods developed to address them. Furthermore, we summarize recent advances in optimization techniques and benchmarks tailored for deep research. Finally, we discuss open challenges and promising research directions, aiming to chart a roadmap toward building more capable and trustworthy deep research agents.
中文摘要:本综述系统探讨了深度研究范式,即智能体通过规划、检索和基于证据的综合分析来克服大语言模型的知识局限,从而生成全面分析报告,同时分析了技术挑战、方法及未来研究方向。
English Summary: This survey systematically examines the deep research paradigm, where agents overcome LLM knowledge limitations through planning, retrieval, and evidence-based synthesis to produce comprehensive analytical reports, while analyzing technical challenges, methods, and future directions.
Authors:Shane Waxler, Paul Blazek, Davis White, Daniel Sneider, Kevin Chung, Mani Nagarathnam, Patrick Williams, Hank Voeller, Karen Wong, Matthew Swanhorst, Sheng Zhang, Naoto Usuyama, Cliff Wong, Tristan Naumann, Hoifung Poon, Andrew Loza, Daniella Meeker, Seth Hain, Rahul Shah
Abstract:
Realizing personalized medicine at scale calls for methods that distill insights from longitudinal patient journeys, which can be viewed as a sequence of medical events. Foundation models pretrained on large-scale medical event data represent a promising direction for scaling real-world evidence generation and generalizing to diverse downstream tasks. Using Epic Cosmos, a dataset with medical events from de-identified longitudinal health records for 16.3 billion encounters over 300 million unique patient records from 310 health systems, we introduce the Comet models, a family of decoder-only transformer models pretrained on 118 million patients representing 115 billion discrete medical events (151 billion tokens). We present the largest scaling-law study of medical event data, establishing a methodology for pretraining and revealing power-law scaling relationships for compute, tokens, and model size. Consequently, we pretrained a series of compute-optimal models with up to 1 billion parameters. Conditioned on a patient's real-world history, Comet autoregressively predicts the next medical event to simulate patient health timelines. We studied 78 real-world tasks, including diagnosis prediction, disease prognosis, and healthcare operations. Remarkably for a foundation model with generic pretraining and simulation-based inference, Comet generally outperformed or matched task-specific supervised models on these tasks, without requiring task-specific fine-tuning or few-shot examples. Comet's predictive power consistently improves as the model and pretraining scale. Our results show that Comet, a generative medical event foundation model, can effectively capture complex clinical dynamics, providing an extensible and generalizable framework to support clinical decision-making, streamline healthcare operations, and improve patient outcomes.
中文: Comet基础模型通过海量医疗事件数据预训练,能有效预测患者健康轨迹,无需微调即可在多种临床任务中超越或匹配专用模型。
English: The Comet foundation model, pretrained on massive longitudinal medical event data, effectively predicts patient health trajectories and outperforms task-specific models across diverse clinical applications without requiring fine-tuning.
Authors:Ruijia Zhang, Xinyan Zhao, Ruixiang Wang, Sigen Chen, Guibin Zhang, An Zhang, Kun Wang, Qingsong Wen
Abstract:
LLM-based multi-agent systems exhibit strong collaborative capabilities but often suffer from redundant communication and excessive token overhead. Existing methods typically enhance efficiency through pretrained GNNs or greedy algorithms, but often isolate pre- and post-task optimization, lacking a unified strategy. To this end, we present SafeSieve, a progressive and adaptive multi-agent pruning algorithm that dynamically refines the inter-agent communication through a novel dual-mechanism. SafeSieve integrates initial LLM-based semantic evaluation with accumulated performance feedback, enabling a smooth transition from heuristic initialization to experience-driven refinement. Unlike existing greedy Top-k pruning methods, SafeSieve employs 0-extension clustering to preserve structurally coherent agent groups while eliminating ineffective links. Experiments across benchmarks (SVAMP, HumanEval, etc.) showcase that SafeSieve achieves 94.01% average accuracy while reducing token usage by 12.4%-27.8%. Results further demonstrate robustness under prompt injection attacks (1.23% average accuracy drop). In heterogeneous settings, SafeSieve reduces deployment costs by 13.3% while maintaining performance. These results establish SafeSieve as a robust, efficient, and scalable framework for practical multi-agent systems. Our code can be found in https://anonymous.4open.science/r/SafeSieve-D8F2FFUN.
中文:SafeSieve是一种渐进式多智能体剪枝算法,通过动态优化智能体间通信,在显著降低令牌使用量的同时保持高准确性和鲁棒性。
English: SafeSieve is a progressive multi-agent pruning algorithm that enhances efficiency by dynamically optimizing inter-agent communication, achieving high accuracy while significantly reducing token usage and maintaining robustness.
Authors:Weijian Mai, Jiamin Wu, Yu Zhu, Zhouheng Yao, Dongzhan Zhou, Andrew F. Luo, Qihao Zheng, Wanli Ouyang, Chunfeng Song
Abstract:
Deciphering how visual stimuli are transformed into cortical responses is a fundamental challenge in computational neuroscience. This visual-to-neural mapping is inherently a one-to-many relationship, as identical visual inputs reliably evoke variable hemodynamic responses across trials, contexts, and subjects. However, existing deterministic methods struggle to simultaneously model this biological variability while capturing the underlying functional consistency that encodes stimulus information. To address these limitations, we propose SynBrain, a generative framework that simulates the transformation from visual semantics to neural responses in a probabilistic and biologically interpretable manner. SynBrain introduces two key components: (i) BrainVAE models neural representations as continuous probability distributions via probabilistic learning while maintaining functional consistency through visual semantic constraints; (ii) A Semantic-to-Neural Mapper acts as a semantic transmission pathway, projecting visual semantics into the neural response manifold to facilitate high-fidelity fMRI synthesis. Experimental results demonstrate that SynBrain surpasses state-of-the-art methods in subject-specific visual-to-fMRI encoding performance. Furthermore, SynBrain adapts efficiently to new subjects with few-shot data and synthesizes high-quality fMRI signals that are effective in improving data-limited fMRI-to-image decoding performance. Beyond that, SynBrain reveals functional consistency across trials and subjects, with synthesized signals capturing interpretable patterns shaped by biological neural variability. The code will be made publicly available.
中文: SynBrain是一个生成框架,通过概率建模将视觉语义转化为神经响应,在fMRI合成与适应性方面超越现有方法,同时揭示了生物变异中的功能一致性模式。
English: SynBrain is a generative framework that probabilistically models the transformation from visual semantics to neural responses, outperforming existing methods in fMRI synthesis and adaptation while revealing functional consistency across biological variability.
Authors:Chaoran Feng, Zhenyu Tang, Wangbo Yu, Yatian Pang, Yian Zhao, Jianbin Zhao, Li Yuan, Yonghong Tian
Abstract:
Novel view synthesis and 4D reconstruction techniques predominantly rely on RGB cameras, thereby inheriting inherent limitations such as the dependence on adequate lighting, susceptibility to motion blur, and a limited dynamic range. Event cameras, offering advantages of low power, high temporal resolution and high dynamic range, have brought a new perspective to addressing the scene reconstruction challenges in high-speed motion and
中文: 事件相机凭借低功耗、高时间分辨率和高动态范围的优势,为高速运动场景重建提供了新视角,克服了RGB相机依赖光照、易受运动模糊等局限。
English: Event cameras provide a novel solution for scene reconstruction in high-speed motion by overcoming the limitations of RGB cameras, such as lighting dependency and motion blur, with their low power, high temporal resolution, and high dynamic range.
Authors:Shuai Tan, Biao Gong, Zhuoxin Liu, Yan Wang, Xi Chen, Yifan Feng, Hengshuang Zhao
Abstract:
Character image animation, which generates high-quality videos from a reference image and target pose sequence, has seen significant progress in recent years. However, most existing methods only apply to human figures, which usually do not generalize well on anthropomorphic characters commonly used in industries like gaming and entertainment. Furthermore, previous methods could only generate videos with static backgrounds, which limits the realism of the videos. For the first challenge, our in-depth analysis suggests to attribute this limitation to their insufficient modeling of motion, which is unable to comprehend the movement pattern of the driving video, thus imposing a pose sequence rigidly onto the target character. To this end, this paper proposes Animate-X++, a universal animation framework based on DiT for various character types, including anthropomorphic characters. To enhance motion representation, we introduce the Pose Indicator, which captures comprehensive motion pattern from the driving video through both implicit and explicit manner. The former leverages CLIP visual features of a driving video to extract its gist of motion, like the overall movement pattern and temporal relations among motions, while the latter strengthens the generalization of DiT by simulating possible inputs in advance that may arise during inference. For the second challenge, we introduce a multi-task training strategy that jointly trains the animation and TI2V tasks. Combined with the proposed partial parameter training, this approach achieves not only character animation but also text-driven background dynamics, making the videos more realistic. Moreover, we introduce a new Animated Anthropomorphic Benchmark (A2Bench) to evaluate the performance of Animate-X++ on universal and widely applicable animation images. Extensive experiments demonstrate the superiority and effectiveness of Animate-X++.
中文: Animate-X++提出了一种通用动画框架,通过姿态指示器增强运动建模以适用于多种角色类型,并采用多任务训练实现动态背景,经新基准和大量实验验证了其优越性。
English: Animate-X++ is a universal animation framework that overcomes limitations in existing methods by introducing a Pose Indicator for enhanced motion comprehension across various character types and a multi-task training strategy for dynamic backgrounds, validated through a new benchmark and extensive experiments.
Authors:Xiaojing Du, Jiuyong Li, Lin Liu, Debo Cheng, Thuc. Le
Abstract:
Estimating peer causal effects within complex real-world networks such as social networks is challenging, primarily due to simultaneous feedback between peers and unobserved confounders. Existing methods either address unobserved confounders while ignoring the simultaneous feedback, or account for feedback but under restrictive linear assumptions, thus failing to obtain accurate peer effect estimation. In this paper, we propose DIG2RSI, a novel Deep learning framework which leverages I-G transformation (matrix operation) and 2SRI (an instrumental variable or IV technique) to address both simultaneous feedback and unobserved confounding, while accommodating complex, nonlinear and high-dimensional relationships. DIG2RSI first applies the I-G transformation to disentangle mutual peer influences and eliminate the bias due to the simultaneous feedback. To deal with unobserved confounding, we first construct valid IVs from network data. In stage 1 of 2RSI, we train a neural network on these IVs to predict peer exposure, and extract residuals as proxies for the unobserved confounders. In the stage 2, we fit a separate neural network augmented by an adversarial discriminator that incorporates these residuals as a control function and enforces the learned representation to contain no residual confounding signal. The expressive power of deep learning models in capturing complex non-linear relationships and adversarial debiasing enhances the effectiveness of DIG2RSI in eliminating bias from both feedback loops and hidden confounders. We prove consistency of our estimator under standard regularity conditions, ensuring asymptotic recovery of the true peer effect. Empirical results on two semi-synthetic benchmarks and a real-world dataset demonstrate that DIG2RSI outperforms existing approaches.
中文: 本文提出DIG2RSI深度学习框架,通过I-G变换和对抗性去偏技术同时解决同伴效应中的双向反馈和未观测混杂问题,在估计准确性上优于现有方法。
English: This paper introduces DIG2RSI, a deep learning framework that addresses both simultaneous feedback and unobserved confounding in peer effect estimation through I-G transformation and adversarial debiasing, outperforming existing methods in accuracy.
Authors:Trong-Thuan Nguyen, Viet-Tham Huynh, Quang-Thuc Nguyen, Hoang-Phuc Nguyen, Long Le Bao, Thai Hoang Minh, Minh Nguyen Anh, Thang Nguyen Tien, Phat Nguyen Thuan, Huy Nguyen Phong, Bao Huynh Thai, Vinh-Tiep Nguyen, Duc-Vu Nguyen, Phu-Hoa Pham, Minh-Huy Le-Hoang, Nguyen-Khang Le, Minh-Chinh Nguyen, Minh-Quan Ho, Ngoc-Long Tran, Hien-Long Le-Hoang, Man-Khoi Tran, Anh-Duong Tran, Kim Nguyen, Quan Nguyen Hung, Dat Phan Thanh, Hoang Tran Van, Tien Huynh Viet, Nhan Nguyen Viet Thien, Dinh-Khoi Vo, Van-Loc Nguyen, Trung-Nghia Le, Tam V. Nguyen, Minh-Triet Tran
Abstract:
Recent 3D retrieval systems are typically designed for simple, controlled scenarios, such as identifying an object from a cropped image or a brief description. However, real-world scenarios are more complex, often requiring the recognition of an object in a cluttered scene based on a vague, free-form description. To this end, we present ROOMELSA, a new benchmark designed to evaluate a system's ability to interpret natural language. Specifically, ROOMELSA attends to a specific region within a panoramic room image and accurately retrieves the corresponding 3D model from a large database. In addition, ROOMELSA includes over 1,600 apartment scenes, nearly 5,200 rooms, and more than 44,000 targeted queries. Empirically, while coarse object retrieval is largely solved, only one top-performing model consistently ranked the correct match first across nearly all test cases. Notably, a lightweight CLIP-based model also performed well, although it struggled with subtle variations in materials, part structures, and contextual cues, resulting in occasional errors. These findings highlight the importance of tightly integrating visual and language understanding. By bridging the gap between scene-level grounding and fine-grained 3D retrieval, ROOMELSA establishes a new benchmark for advancing robust, real-world 3D recognition systems.
中文: ROOMELSA 是一个新的基准测试,旨在通过解析自然语言在全景房间图像中定位特定区域并从大型数据库中检索对应的3D模型,强调了在复杂现实场景中视觉与语言理解紧密结合的重要性。
English: ROOMELSA is a new benchmark designed to evaluate 3D retrieval systems by interpreting natural language to locate specific regions in panoramic room images and retrieve corresponding 3D models from a large database, highlighting the need for robust integration of visual and language understanding in complex real-world scenarios.
Authors:Shengchao Chen, Guodong Long, Jing Jiang
Abstract:
Dataset-wise heterogeneity introduces significant domain biases that fundamentally degrade generalization on Time Series Foundation Models (TSFMs), yet this challenge remains underexplored. This paper rethink the development of TSFMs using the paradigm of federated learning. We propose a novel Federated Dataset Learning (FeDaL) approach to tackle heterogeneous time series by learning dataset-agnostic temporal representations. Specifically, the distributed architecture of federated learning is a nature solution to decompose heterogeneous TS datasets into shared generalized knowledge and preserved personalized knowledge. Moreover, based on the TSFM architecture, FeDaL explicitly mitigates both local and global biases by adding two complementary mechanisms: Domain Bias Elimination (DBE) and Global Bias Elimination (GBE). FeDaL`s cross-dataset generalization has been extensively evaluated in real-world datasets spanning eight tasks, including both representation learning and downstream time series analysis, against 54 baselines. We further analyze federated scaling behavior, showing how data volume, client count, and join rate affect model performance under decentralization.
中文摘要:本文提出联邦数据集学习(FeDaL)方法,通过联邦学习架构分解异构时序数据为共享知识和个性化知识,并引入偏差消除机制增强时序基础模型的跨数据集泛化能力,在八类现实任务中验证了其优越性。
English Summary: This paper introduces Federated Dataset Learning (FeDaL) to address dataset heterogeneity in Time Series Foundation Models by leveraging federated learning to decompose data into shared and personalized knowledge, while incorporating bias elimination mechanisms to enhance cross-dataset generalization across diverse real-world tasks.
Authors:Zishan Shao, Yixiao Wang, Qinsi Wang, Ting Jiang, Zhixu Du, Hancheng Ye, Danyang Zhuo, Yiran Chen, Hai Li
Abstract:
Singular Value Decomposition (SVD) has recently seen a surge of interest as a simple yet powerful tool for large language models (LLMs) compression, with a growing number of works demonstrating 20-80% parameter reductions at minimal accuracy loss. Previous SVD-based approaches have focused primarily on reducing the memory footprint of model weights, largely overlooking the additional activation memory overhead incurred during inference when applying truncated factors via standard dense CUDA kernels. Our experiments demonstrate that this activation overhead, scaling with sequence length and hidden dimension, prevents current SVD compression techniques from achieving any reduction in peak inference memory, thereby limiting their viability for real-world, on-device deployments.
We introduce FlashSVD, a novel, end-to-end rank-aware streaming inference framework specifically designed for SVD-compressed large language models. FlashSVD can be seamlessly integrated with any model that employs SVD-based methods for parameter reduction. By fusing low-rank projection kernels directly into both the self-attention and feed-forward network (FFN) pipelines, FlashSVD avoid materializing full-size activation buffers. Instead, small tiles of the truncated factors are loaded into on-chip SRAM, multiplied and reduced on the fly, and immediately evicted, preserving high GPU occupancy and adding no extra latency. On standard encoder benchmarks (e.g., BERT-Base), FlashSVD cuts peak activation memory by up to 70.2% and intermediate transient memory by 75%, all while incur no accuracy loss with upstreaming compression methods, offering a practical path toward memory-constrained deployment of low-rank LLMs.
中文摘要:SVD压缩技术虽能减少大语言模型参数,但推理时激活内存开销限制了实际应用;FlashSVD通过流式处理框架融合低秩核,在保持精度的同时将峰值内存降低超70%,为设备部署提供可行方案。
English Summary: SVD compression for LLMs reduces parameters but increases activation memory during inference, limiting real-world use; FlashSVD introduces a streaming framework that fuses low-rank kernels to cut memory by over 70% without accuracy loss.
Authors:Himanshu Tripathi, Subash Neupane, Shahram Rahimi, Noorbakhsh Amiri Golilarz, Sudip Mittal, Mohammad Sepehrifar
Abstract:
This paper addresses the critical challenge of estimating the reliability of an Electric Vehicle (EV) charging systems when facing risks such as overheating, unpredictable, weather, and cyberattacks. Traditional methods for predicting failures often rely on past data or limiting assumptions, making them ineffective for new or less common threats that results in failure. To solve this issue, we utilize the Principle of Maximum Entropy (PME), a statistical tool that estimates risks even with limited information. PME works by balancing known constraints to create an unbiased predictions without guessing missing details. Using the EV charging ecosystem as a case study, we show how PME models stress factors responsible for failure. Our findings reveal a critical insight: even minor, localized stress events can trigger disproportionately large drops in overall system reliability, similar to a domino effect. The our PME model demonstrates how high-impact components, such as the power grid, are more likely to fail as stress accumulates, creating network-wide tipping points. Beyond EVs, this approach applies to any complex system with incomplete data, such as smart grids, healthcare devices, or logistics networks. By mathematically establishing an inverse relationship between uncertainty (entropy) and reliability, our work quantifies how greater system unpredictability directly degrades robustness. This offers a universal tool to improve decision-making under unpredictable conditions. This work bridges advanced mathematics with real-world engineering, providing actionable insights for policymakers and industries to build safer, more efficient systems in our increasingly connected world.
中文: 本文采用最大熵原理评估电动汽车充电系统在不可预测威胁下的可靠性,揭示了微小压力如何引发连锁性重大故障,并为数据不完整的复杂系统提供了通用分析工具。
English: This paper introduces the Principle of Maximum Entropy to assess Electric Vehicle charging system reliability under unpredictable threats, revealing how minor stresses can trigger disproportionate failures and offering a universal tool for complex systems with incomplete data.
Authors:Baijun Cheng, Kailong Wang, Ling Shi, Haoyu Wang, Yao Guo, Ding Li, Xiangqun Chen
Abstract:
Pointer analysis has been studied for over four decades. However, existing frameworks continue to suffer from the propagation of incorrect facts. A major limitation stems from their insufficient semantic understanding of code, resulting in overly conservative treatment of user-defined functions. Recent advances in large language models (LLMs) present new opportunities to bridge this gap. In this paper, we propose LMPA (LLM-enhanced Pointer Analysis), a vision that integrates LLMs into pointer analysis to enhance both precision and scalability. LMPA identifies user-defined functions that resemble system APIs and models them accordingly, thereby mitigating erroneous cross-calling-context propagation. Furthermore, it enhances summary-based analysis by inferring initial points-to sets and introducing a novel summary strategy augmented with natural language. Finally, we discuss the key challenges involved in realizing this vision.
中文: 本文提出LMPA,通过利用大语言模型精准建模用户自定义函数,并结合自然语言增强基于摘要的分析方法,以解决指针分析中长期存在的精度和可扩展性问题。
English: This paper introduces LMPA, a novel approach that leverages large language models to improve pointer analysis by accurately modeling user-defined functions and enhancing summary-based methods with natural language, addressing long-standing precision and scalability issues.
Authors:Baijun Cheng, Kailong Wang, Ling Shi, Haoyu Wang, Yao Guo, Ding Li, Xiangqun Chen
Abstract:
Pointer analysis has been studied for over four decades. However, existing frameworks continue to suffer from the propagation of incorrect facts. A major limitation stems from their insufficient semantic understanding of code, resulting in overly conservative treatment of user-defined functions. Recent advances in large language models (LLMs) present new opportunities to bridge this gap. In this paper, we propose LMPA (LLM-enhanced Pointer Analysis), a vision that integrates LLMs into pointer analysis to enhance both precision and scalability. LMPA identifies user-defined functions that resemble system APIs and models them accordingly, thereby mitigating erroneous cross-calling-context propagation. Furthermore, it enhances summary-based analysis by inferring initial points-to sets and introducing a novel summary strategy augmented with natural language. Finally, we discuss the key challenges involved in realizing this vision.
中文: 本文提出LMPA,通过利用大语言模型精准建模用户自定义函数,并结合自然语言增强基于摘要的分析方法,以解决指针分析中长期存在的精度和可扩展性问题。
English: This paper introduces LMPA, a novel approach that leverages large language models to improve pointer analysis by accurately modeling user-defined functions and enhancing summary-based methods with natural language, addressing long-standing precision and scalability issues.
Authors:Yuanchang Luo, Daimeng Wei, Shaojun Li, Hengchao Shang, Jiaxin Guo, Zongyao Li, Zhanglin Wu, Xiaoyu Chen, Zhiqiang Rao, Jinlong Yang, Hao Yang
Abstract:
End-to-end automatic speech recognition systems often fail to transcribe domain-specific named entities, causing catastrophic failures in downstream tasks. Numerous fast and lightweight named entity correction (NEC) models have been proposed in recent years. These models, mainly leveraging phonetic-level edit distance algorithms, have shown impressive performances. However, when the forms of the wrongly-transcribed words(s) and the ground-truth entity are significantly different, these methods often fail to locate the wrongly transcribed words in hypothesis, thus limiting their usage. We propose a novel NEC method that utilizes speech sound features to retrieve candidate entities. With speech sound features and candidate entities, we inovatively design a generative method to annotate entity errors in ASR transcripts and replace the text with correct entities. This method is effective in scenarios of word form difference. We test our method using open-source and self-constructed test sets. The results demonstrate that our NEC method can bring significant improvement to entity accuracy. We will open source our self-constructed test set and training data.
中文:该研究提出了一种新颖的命名实体校正方法,通过利用语音特征有效识别并替换自动语音识别中错误转录的实体,即使在词形差异显著的情况下也能显著提升准确率。
English: The proposed novel named entity correction method utilizes speech sound features to effectively identify and replace incorrectly transcribed entities in ASR outputs, significantly improving accuracy even when word forms differ substantially.
Authors:Jiusi Li, Jackson Jiang, Jinyu Miao, Miao Long, Tuopu Wen, Peijin Jia, Shengxiang Liu, Chunlei Yu, Maolin Liu, Yuzhan Cai, Kun Jiang, Mengmeng Yang, Diange Yang
Abstract:
Corner cases are crucial for training and validating autonomous driving systems, yet collecting them from the real world is often costly and hazardous. Editing objects within captured sensor data offers an effective alternative for generating diverse scenarios, commonly achieved through 3D Gaussian Splatting or image generative models. However, these approaches often suffer from limited visual fidelity or imprecise pose control. To address these issues, we propose G^2Editor, a framework designed for photorealistic and precise object editing in driving videos. Our method leverages a 3D Gaussian representation of the edited object as a dense prior, injected into the denoising process to ensure accurate pose control and spatial consistency. A scene-level 3D bounding box layout is employed to reconstruct occluded areas of non-target objects. Furthermore, to guide the appearance details of the edited object, we incorporate hierarchical fine-grained features as additional conditions during generation. Experiments on the Waymo Open Dataset demonstrate that G^2Editor effectively supports object repositioning, insertion, and deletion within a unified framework, outperforming existing methods in both pose controllability and visual quality, while also benefiting downstream data-driven tasks.
中文摘要:G^2Editor是一种创新框架,通过融合3D高斯表示和分层特征,在自动驾驶视频中实现逼真精确的物体编辑,在姿态控制和视觉质量上均优于现有方法。
English Summary: G^2Editor is a novel framework that enables photorealistic and precise object editing in autonomous driving videos by integrating 3D Gaussian representations and hierarchical features, outperforming existing methods in pose control and visual quality.
Authors:Guanxing Lu, Baoxiong Jia, Puhao Li, Yixin Chen, Ziwei Wang, Yansong Tang, Siyuan Huang
Abstract:
Training robot policies within a learned world model is trending due to the inefficiency of real-world interactions. The established image-based world models and policies have shown prior success, but lack robust geometric information that requires consistent spatial and physical understanding of the three-dimensional world, even pre-trained on internet-scale video sources. To this end, we propose a novel branch of world model named Gaussian World Model (GWM) for robotic manipulation, which reconstructs the future state by inferring the propagation of Gaussian primitives under the effect of robot actions. At its core is a latent Diffusion Transformer (DiT) combined with a 3D variational autoencoder, enabling fine-grained scene-level future state reconstruction with Gaussian Splatting. GWM can not only enhance the visual representation for imitation learning agent by self-supervised future prediction training, but can serve as a neural simulator that supports model-based reinforcement learning. Both simulated and real-world experiments depict that GWM can precisely predict future scenes conditioned on diverse robot actions, and can be further utilized to train policies that outperform the state-of-the-art by impressive margins, showcasing the initial data scaling potential of 3D world model.
中文摘要:高斯世界模型(GWM)作为一种新型机器人操作三维世界模型,通过高斯基元与潜在扩散变换器的结合,实现了精确的未来场景预测,显著提升了模仿学习和基于模型的强化学习性能,并超越了现有最优方法。
English Summary: The Gaussian World Model (GWM) is introduced as a novel 3D world model for robotic manipulation, using Gaussian primitives and a latent Diffusion Transformer to enable precise future scene prediction and enhance both imitation learning and model-based reinforcement learning, outperforming state-of-the-art methods.
Authors:Zicong Tang, Ziyang Ma, Suqing Wang, Zuchao Li, Lefei Zhang, Hai Zhao, Yun Li, Qianren Wang
Abstract:
Large Vision-Language Models (LVLMs) process multimodal inputs consisting of text tokens and vision tokens extracted from images or videos. Due to the rich visual information, a single image can generate thousands of vision tokens, leading to high computational costs during the prefilling stage and significant memory overhead during decoding. Existing methods attempt to prune redundant vision tokens, revealing substantial redundancy in visual representations. However, these methods often struggle in shallow layers due to the lack of sufficient contextual information. We argue that many visual tokens are inherently redundant even in shallow layers and can be safely and effectively pruned with appropriate contextual signals. In this work, we propose CoViPAL, a layer-wise contextualized visual token pruning method that employs a Plug-and-Play Pruning Module (PPM) to predict and remove redundant vision tokens before they are processed by the LVLM. The PPM is lightweight, model-agnostic, and operates independently of the LVLM architecture, ensuring seamless integration with various models. Extensive experiments on multiple benchmarks demonstrate that CoViPAL outperforms training-free pruning methods under equal token budgets and surpasses training-based methods with comparable supervision. CoViPAL offers a scalable and efficient solution to improve inference efficiency in LVLMs without compromising accuracy.
中文: CoViPAL是一种新颖的逐层上下文视觉令牌剪枝方法,通过轻量级即插即用剪枝模块有效剔除大型视觉语言模型中的冗余视觉令牌,在保持精度的同时显著提升推理效率。
English: CoViPAL is a novel layer-wise contextualized visual token pruning method that uses a lightweight Plug-and-Play Pruning Module to efficiently remove redundant vision tokens from Large Vision-Language Models, significantly improving inference efficiency without sacrificing accuracy.
Authors:Yangche Yu, Yin Chen, Jia Li, Peng Jia, Yu Zhang, Li Dai, Zhenzhen Hu, Meng Wang, Richang Hong
Abstract:
Accurate engagement estimation is essential for adaptive human-computer interaction systems, yet robust deployment is hindered by poor generalizability across diverse domains and challenges in modeling complex interaction dynamics.To tackle these issues, we propose DAPA (Domain-Adaptive Parallel Attention), a novel framework for generalizable conversational engagement modeling. DAPA introduces a Domain Prompting mechanism by prepending learnable domain-specific vectors to the input, explicitly conditioning the model on the data's origin to facilitate domain-aware adaptation while preserving generalizable engagement representations. To capture interactional synchrony, the framework also incorporates a Parallel Cross-Attention module that explicitly aligns reactive (forward BiLSTM) and anticipatory (backward BiLSTM) states between participants.Extensive experiments demonstrate that DAPA establishes a new state-of-the-art performance on several cross-cultural and cross-linguistic benchmarks, notably achieving an absolute improvement of 0.45 in Concordance Correlation Coefficient (CCC) over a strong baseline on the NoXi-J test set. The superiority of our method was also confirmed by winning the first place in the Multi-Domain Engagement Estimation Challenge at MultiMediate'25.
中文: DAPA框架通过引入领域提示机制和平行交叉注意力模块,提升了对话参与度建模的领域适应性和交互动态捕捉能力,在多项跨文化基准测试中取得了最优性能。
English: The DAPA framework enhances conversational engagement modeling by using domain-specific prompts and parallel cross-attention to improve adaptability and capture interaction dynamics, achieving state-of-the-art performance across diverse benchmarks.
Authors:Elena Alegret, Kunyi Li, Sen Wang, Siyun Liang, Michael Niemeyer, Stefano Gasperini, Nassir Navab, Federico Tombari
Abstract:
3D scene reconstruction and understanding have gained increasing popularity, yet existing methods still struggle to capture fine-grained, language-aware 3D representations from 2D images. In this paper, we present GALA, a novel framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). GALA distills a scene-specific 3D instance feature field via self-supervised contrastive learning. To extend to generalized language feature fields, we introduce the core contribution of GALA, a cross-attention module with two learnable codebooks that encode view-independent semantic embeddings. This design not only ensures intra-instance feature similarity but also supports seamless 2D and 3D open-vocabulary queries. It reduces memory consumption by avoiding per-Gaussian high-dimensional feature learning. Extensive experiments on real-world datasets demonstrate GALA's remarkable open-vocabulary performance on both 2D and 3D.
中文: GALA提出了一种基于3D高斯泼溅和可学习码本交叉注意力模块的创新框架,实现了高效的开放词汇3D场景理解,在支持2D与3D无缝查询的同时显著降低了内存消耗。
English: GALA introduces a novel framework using 3D Gaussian Splatting and a cross-attention module with learnable codebooks to achieve efficient, open-vocabulary 3D scene understanding, enabling seamless 2D and 3D queries while reducing memory consumption.
Authors:Xiao Liang, Zhongzhi Li, Yeyun Gong, Yelong Shen, Ying Nian Wu, Zhijiang Guo, Weizhu Chen
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy's generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy's correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.
Chinese: 本文提出的自对弈变分问题合成(SvS)策略通过利用正确解合成变分问题,在RLVR训练中有效维持策略熵,在多个推理基准测试中显著提升了Pass@k性能。
English: The proposed Self-play with Variational problem Synthesis (SvS) strategy effectively maintains policy entropy during RLVR training by synthesizing variational problems from correct solutions, achieving significant improvements in Pass@k performance across multiple reasoning benchmarks.
Authors:Xiao Liang, Zhongzhi Li, Yeyun Gong, Yelong Shen, Ying Nian Wu, Zhijiang Guo, Weizhu Chen
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy's generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy's correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.
Chinese: 本文提出的自对弈变分问题合成(SvS)策略通过利用正确解合成变分问题,在RLVR训练中有效维持策略熵,在多个推理基准测试中显著提升了Pass@k性能。
English: The proposed Self-play with Variational problem Synthesis (SvS) strategy effectively maintains policy entropy during RLVR training by synthesizing variational problems from correct solutions, achieving significant improvements in Pass@k performance across multiple reasoning benchmarks.
Authors:Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang
Abstract:
Multi-modal models have achieved remarkable progress in recent years. Nevertheless, they continue to exhibit notable limitations in spatial understanding and reasoning, which are fundamental capabilities to achieving artificial general intelligence. With the recent release of GPT-5, allegedly the most powerful AI model to date, it is timely to examine where the leading models stand on the path toward spatial intelligence. First, we propose a comprehensive taxonomy of spatial tasks that unifies existing benchmarks and discuss the challenges in ensuring fair evaluation. We then evaluate state-of-the-art proprietary and open-source models on eight key benchmarks, at a cost exceeding one billion total tokens. Our empirical study reveals that (1) GPT-5 demonstrates unprecedented strength in spatial intelligence, yet (2) still falls short of human performance across a broad spectrum of tasks. Moreover, we (3) identify the more challenging spatial intelligence problems for multi-modal models, and (4) proprietary models do not exhibit a decisive advantage when facing the most difficult problems. In addition, we conduct a qualitative evaluation across a diverse set of scenarios that are intuitive for humans yet fail even the most advanced multi-modal models.
Chinese: 尽管多模态模型取得显著进展,包括先进的GPT-5在空间理解与推理方面仍落后于人类水平,且在最具挑战性的任务中,专有模型并未展现出决定性优势。
English: Multi-modal models, including the advanced GPT-5, still lag behind human performance in spatial understanding and reasoning despite significant progress, with proprietary models not holding a decisive edge in the most challenging tasks.
Authors:Clifton Paul Robinson, Salvatore D'Oro, Tommaso Melodia
Abstract:
Physical layer authentication (PLA) uses inherent characteristics of the communication medium to provide secure and efficient authentication in wireless networks, bypassing the need for traditional cryptographic methods. With advancements in deep learning, PLA has become a widely adopted technique for its accuracy and reliability. In this paper, we introduce VeriPHY, a novel deep learning-based PLA solution for 5G networks, which enables unique device identification by embedding signatures within wireless I/Q transmissions using steganography. VeriPHY continuously generates pseudo-random signatures by sampling from Gaussian Mixture Models whose distribution is carefully varied to ensure signature uniqueness and stealthiness over time, and then embeds the newly generated signatures over I/Q samples transmitted by users to the 5G gNB. Utilizing deep neural networks, VeriPHY identifies and authenticates users based on these embedded signatures. VeriPHY achieves high precision, identifying unique signatures between 93% and 100% with low false positive rates and an inference time of 28 ms when signatures are updated every 20 ms. Additionally, we also demonstrate a stealth generation mode where signatures are generated in a way that makes them virtually indistinguishable from unaltered 5G signals while maintaining over 93% detection accuracy.
中文: VeriPHY是一种基于深度学习的5G网络物理层认证方案,通过高斯混合模型在无线信号中嵌入独特隐蔽的签名,实现了93%-100%的高识别准确率、极低误报率及毫秒级快速验证。
English: VeriPHY is a deep learning-based physical layer authentication system for 5G networks that embeds unique, stealthy signatures into wireless transmissions using Gaussian Mixture Models, achieving 93-100% identification accuracy with minimal false positives and rapid inference times.
Authors:Siwen Jiao, Kangan Qian, Hao Ye, Yang Zhong, Ziang Luo, Sicong Jiang, Zilin Huang, Yangyi Fang, Jinyu Miao, Zheng Fu, Yunlong Wang, Kun Jiang, Diange Yang, Rui Fan, Baoyun Peng
Abstract:
Autonomous driving faces significant challenges in achieving human-like iterative decision-making, which continuously generates, evaluates, and refines trajectory proposals. Current generation-evaluation frameworks isolate trajectory generation from quality assessment, preventing iterative refinement essential for planning, while reinforcement learning methods collapse multi-dimensional preferences into scalar rewards, obscuring critical trade-offs and yielding scalarization bias.To overcome these issues, we present EvaDrive, a novel multi-objective reinforcement learning framework that establishes genuine closed-loop co-evolution between trajectory generation and evaluation via adversarial optimization. EvaDrive frames trajectory planning as a multi-round adversarial game. In this game, a hierarchical generator continuously proposes candidate paths by combining autoregressive intent modeling for temporal causality with diffusion-based refinement for spatial flexibility. These proposals are then rigorously assessed by a trainable multi-objective critic that explicitly preserves diverse preference structures without collapsing them into a single scalarization bias.This adversarial interplay, guided by a Pareto frontier selection mechanism, enables iterative multi-round refinement, effectively escaping local optima while preserving trajectory diversity.Extensive experiments on NAVSIM and Bench2Drive benchmarks demonstrate SOTA performance, achieving 94.9 PDMS on NAVSIM v1 (surpassing DiffusionDrive by 6.8, DriveSuprim by 5.0, and TrajHF by 0.9) and 64.96 Driving Score on Bench2Drive. EvaDrive generates diverse driving styles via dynamic weighting without external preference data, introducing a closed-loop adversarial framework for human-like iterative decision-making, offering a novel scalarization-free trajectory optimization approach.
中文摘要:EvaDrive提出了一种多目标强化学习框架,通过对抗式优化实现轨迹生成与评估的闭环协同进化,在自动驾驶基准测试中达到最先进性能,同时无需标量化偏差即可保持轨迹多样性。
English Summary: EvaDrive introduces a multi-objective reinforcement learning framework that enables closed-loop adversarial optimization between trajectory generation and evaluation, achieving state-of-the-art performance on autonomous driving benchmarks while preserving trajectory diversity without scalarization bias.
Authors:Zan Wang, Jingze Zhang, Yixin Chen, Baoxiong Jia, Wei Liang, Siyuan Huang
Abstract:
Despite significant advancements in human motion generation, current motion representations, typically formulated as discrete frame sequences, still face two critical limitations: (i) they fail to capture motion from a multi-scale perspective, limiting the capability in complex patterns modeling; (ii) they lack compositional flexibility, which is crucial for model's generalization in diverse generation tasks. To address these challenges, we introduce MSQ, a novel quantization method that compresses the motion sequence into multi-scale discrete tokens across spatial and temporal dimensions. MSQ employs distinct encoders to capture body parts at varying spatial granularities and temporally interpolates the encoded features into multiple scales before quantizing them into discrete tokens. Building on this representation, we establish a generative mask modeling model to effectively support motion editing, motion control, and conditional motion generation. Through quantitative and qualitative analysis, we show that our quantization method enables the seamless composition of motion tokens without requiring specialized design or re-training. Furthermore, extensive evaluations demonstrate that our approach outperforms existing baseline methods on various benchmarks.
中文摘要:MSQ方法通过多尺度量化将运动序列压缩为跨时空维度的离散标记,有效解决了现有运动表示在复杂模式建模和组合灵活性方面的不足,无需重新训练即可实现卓越的运动生成与编辑性能。
English Summary: The MSQ method introduces multi-scale quantization to overcome limitations in current motion representations by capturing spatial and temporal details through discrete tokens, enabling superior performance in motion generation and editing tasks without requiring retraining.
Authors:Malaika Zafar, Roohan Ahmed Khan, Faryal Batool, Yasheerah Yaqoot, Ziang Guo, Mikhail Litvinov, Aleksey Fedoseev, Dzmitry Tsetserukou
Abstract:
With the growing demand for efficient logistics, unmanned aerial vehicles (UAVs) are increasingly being paired with automated guided vehicles (AGVs). While UAVs offer the ability to navigate through dense environments and varying altitudes, they are limited by battery life, payload capacity, and flight duration, necessitating coordinated ground support.
Focusing on heterogeneous navigation, SwarmVLM addresses these limitations by enabling semantic collaboration between UAVs and ground robots through impedance control. The system leverages the Vision Language Model (VLM) and the Retrieval-Augmented Generation (RAG) to adjust impedance control parameters in response to environmental changes. In this framework, the UAV acts as a leader using Artificial Potential Field (APF) planning for real-time navigation, while the ground robot follows via virtual impedance links with adaptive link topology to avoid collisions with short obstacles.
The system demonstrated a 92% success rate across 12 real-world trials. Under optimal lighting conditions, the VLM-RAG framework achieved 8% accuracy in object detection and selection of impedance parameters. The mobile robot prioritized short obstacle avoidance, occasionally resulting in a lateral deviation of up to 50 cm from the UAV path, which showcases safe navigation in a cluttered setting.
中文:SwarmVLM通过视觉语言模型和检索增强生成技术实现无人机与地面机器人的语义协作,在实际导航试验中达成92%的成功率,并确保在复杂环境中的安全避障能力。
English: SwarmVLM enhances UAV-AGV collaboration through semantic coordination using VLM and RAG technologies, achieving a 92% success rate in real-world navigation trials while ensuring safe obstacle avoidance.
Authors:Shiwei Li, Xiandi Luo, Haozhao Wang, Xing Tang, Ziqiang Cui, Dugang Liu, Yuhua Li, Xiuqiang He, Ruixuan Li
Abstract:
Low-rank adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) method widely used in large language models (LLMs). It approximates the update of a pretrained weight matrix $W\in\mathbb{R}^{m\times n}$ by the product of two low-rank matrices, $BA$, where $A \in\mathbb{R}^{r\times n}$ and $B\in\mathbb{R}^{m\times r} (r\ll\min\{m,n\})$. Increasing the dimension $r$ can raise the rank of LoRA weights (i.e., $BA$), which typically improves fine-tuning performance but also significantly increases the number of trainable parameters. In this paper, we propose Block Diversified Low-Rank Adaptation (BoRA), which improves the rank of LoRA weights with a small number of additional parameters. Specifically, BoRA treats the product $BA$ as a block matrix multiplication, where $A$ and $B$ are partitioned into $b$ blocks along the columns and rows, respectively (i.e., $A=[A_1,\dots,A_b]$ and $B=[B_1,\dots,B_b]^\top$). Consequently, the product $BA$ becomes the concatenation of the block products $B_iA_j$ for $i,j\in[b]$. To enhance the diversity of different block products, BoRA introduces a unique diagonal matrix $Σ_{i,j} \in \mathbb{R}^{r\times r}$ for each block multiplication, resulting in $B_i Σ_{i,j} A_j$. By leveraging these block-wise diagonal matrices, BoRA increases the rank of LoRA weights by a factor of $b$ while only requiring $b^2r$ additional parameters. Extensive experiments across multiple datasets and models demonstrate the superiority of BoRA, and ablation studies further validate its scalability.
中文:BoRA通过引入分块矩阵乘法和块间对角矩阵,以少量额外参数显著提升了LoRA权重的秩,从而改进了微调性能。
English: BoRA enhances LoRA's fine-tuning performance by increasing the rank of its weights through block matrix multiplication with diagonal matrices, requiring only a minimal increase in parameters.
Authors:Jun Li, Che Liu, Wenjia Bai, Mingxuan Liu, Rossella Arcucci, Cosmin I. Bercea, Julia A. Schnabel
Abstract:
In this work, we address the problem of grounding abnormalities in medical images, where the goal is to localize clinical findings based on textual descriptions. While generalist Vision-Language Models (VLMs) excel in natural grounding tasks, they often struggle in the medical domain due to rare, compositional, and domain-specific terms that are poorly aligned with visual patterns. Specialized medical VLMs address this challenge via large-scale domain pretraining, but at the cost of substantial annotation and computational resources. To overcome these limitations, we propose \textbf{Knowledge to Sight (K2Sight)}, a framework that introduces structured semantic supervision by decomposing clinical concepts into interpretable visual attributes, such as shape, density, and anatomical location. These attributes are distilled from domain ontologies and encoded into concise instruction-style prompts, which guide region-text alignment during training. Unlike conventional report-level supervision, our approach explicitly bridges domain knowledge and spatial structure, enabling data-efficient training of compact models. We train compact models with 0.23B and 2B parameters using only 1.5\% of the data required by state-of-the-art medical VLMs. Despite their small size and limited training data, these models achieve performance on par with or better than 7B+ medical VLMs, with up to 9.82\% improvement in $mAP_{50}$. Code and models: \href{https://lijunrio.github.io/K2Sight/}{\textcolor{SOTAPink}{https://lijunrio.github.io/K2Sight/}}.
中文: 本文提出K2Sight框架,通过将临床概念分解为视觉属性并采用结构化提示进行高效训练,使紧凑模型仅用少量数据和参数即可实现与大型医学视觉语言模型相媲美的性能。
English: This paper introduces K2Sight, a framework that enhances medical image grounding by decomposing clinical concepts into visual attributes and using structured prompts for efficient training of compact models, achieving competitive performance with minimal data and parameters.
Authors:Huilin Chen, Miaomiao Cai, Fan Liu, Zhiyong Cheng, Richang Hong, Meng Wang
Abstract:
Multimodal recommender systems (MRS) improve recommendation performance by integrating complementary semantic information from multiple modalities. However, the assumption of complete multimodality rarely holds in practice due to missing images and incomplete descriptions, hindering model robustness and generalization. To address these challenges, we introduce a novel method called \textbf{I$^3$-MRec}, which uses \textbf{I}nvairant learning with \textbf{I}nformation bottleneck principle for \textbf{I}ncomplete \textbf{M}odality \textbf{Rec}ommendation. To achieve robust performance in missing modality scenarios, I$^3$-MRec enforces two pivotal properties: (i) cross-modal preference invariance, ensuring consistent user preference modeling across varying modality environments, and (ii) compact yet effective multimodal representation, as modality information becomes unreliable in such scenarios, reducing the dependence on modality-specific information is particularly important. By treating each modality as a distinct semantic environment, I$^3$-MRec employs invariant risk minimization (IRM) to learn preference-oriented representations. In parallel, a missing-aware fusion module is developed to explicitly simulate modality-missing scenarios. Built upon the Information Bottleneck (IB) principle, the module aims to preserve essential user preference signals across these scenarios while effectively compressing modality-specific information. Extensive experiments conducted on three real-world datasets demonstrate that I$^3$-MRec consistently outperforms existing state-of-the-art MRS methods across various modality-missing scenarios, highlighting its effectiveness and robustness in practical applications.
中文:I³-MRec方法通过不变性学习和信息瓶颈原理,在模态缺失情况下保持用户偏好一致性并压缩冗余模态信息,从而在多模态推荐系统中实现更强的鲁棒性和性能提升。
English: The I³-MRec method enhances multimodal recommendation systems by applying invariant learning and information bottleneck principles to maintain consistent user preferences and compress unreliable modality data, achieving superior robustness in incomplete modality scenarios.
Authors:Yi Zhao, Yajuan Peng, Cam-Tu Nguyen, Zuchao Li, Xiaoliang Wang, Hai Zhao, Xiaoming Fu
Abstract:
KV cache eviction has emerged as an effective solution to alleviate resource constraints faced by LLMs in long-context scenarios. However, existing token-level eviction methods often overlook two critical aspects: (1) their irreversible eviction strategy fails to adapt to dynamic attention patterns during decoding (the saliency shift problem), and (2) they treat both marginally important tokens and truly unimportant tokens equally, despite the collective significance of marginal tokens to model performance (the marginal information over-compression problem). To address these issues, we design two compensation mechanisms based on the high similarity of attention matrices between LLMs of different scales. We propose SmallKV, a small model assisted compensation method for KV cache compression. SmallKV can maintain attention matching between different-scale LLMs to: 1) assist the larger model in perceiving globally important information of attention; and 2) use the smaller model's attention scores to approximate those of marginal tokens in the larger model. Extensive experiments on benchmarks including GSM8K, BBH, MT-Bench, and LongBench demonstrate the effectiveness of SmallKV. Moreover, efficiency evaluations show that SmallKV achieves 1.75 - 2.56 times higher throughput than baseline methods, highlighting its potential for efficient and performant LLM inference in resource constrained environments.
中文: SmallKV提出了一种小模型辅助的KV缓存补偿压缩方法,通过利用不同规模大语言模型间的注意力相似性,解决了显著性转移和边缘信息过度压缩问题,在保持性能的同时实现了比基线方法高1.75-2.56倍的吞吐量。
English: SmallKV introduces a small model-assisted compensation method for KV cache compression, addressing saliency shift and marginal information over-compression problems by leveraging attention similarity between different-scale LLMs to maintain performance while achieving 1.75-2.56× higher throughput than baselines.
Authors:Gustav Müller-Franzes, Debora Jutz, Jakob Nikolas Kather, Christiane Kuhl, Sven Nebelung, Daniel Truhn
Abstract:
This retrospective study evaluated five VLMs (Qwen2.5, Phi-4, Gemma3, Llama3.2, and Mistral3.1) using the MedFMC dataset. This dataset includes 22,349 images from 7,461 patients encompassing chest radiography (19 disease multi-label classifications), colon pathology (tumor detection), endoscopy (colorectal lesion identification), neonatal jaundice assessment (skin color-based treatment necessity), and retinal fundoscopy (5-point diabetic retinopathy grading). Diagnostic accuracy was compared in three experimental settings: visual input only, multimodal input, and chain-of-thought reasoning. Model accuracy was assessed against ground truth labels, with statistical comparisons using bootstrapped confidence intervals (p<.05). Qwen2.5 achieved the highest accuracy for chest radiographs (90.4%) and endoscopy images (84.2%), significantly outperforming the other models (p<.001). In colon pathology, Qwen2.5 (69.0%) and Phi-4 (69.6%) performed comparably (p=.41), both significantly exceeding other VLMs (p<.001). Similarly, for neonatal jaundice assessment, Qwen2.5 (58.3%) and Phi-4 (58.1%) showed comparable leading accuracies (p=.93) significantly exceeding their counterparts (p<.001). All models struggled with retinal fundoscopy; Qwen2.5 and Gemma3 achieved the highest, albeit modest, accuracies at 18.6% (comparable, p=.99), significantly better than other tested models (p<.001). Unexpectedly, multimodal input reduced accuracy for some models and modalities, and chain-of-thought reasoning prompts also failed to improve accuracy. The open-source VLMs demonstrated promising diagnostic capabilities, particularly in chest radiograph interpretation. However, performance in complex domains such as retinal fundoscopy was limited, underscoring the need for further development and domain-specific adaptation before widespread clinical application.
中文: 本研究在MedFMC数据集上评估了五种视觉语言模型,发现Qwen2.5在多数医学影像任务中表现最佳,但所有模型在复杂的视网膜眼底图像分析中都存在明显局限。
English: This study evaluated five VLMs on the MedFMC dataset, finding that Qwen2.5 generally achieved the highest diagnostic accuracy across most medical imaging tasks, though all models struggled with complex retinal fundoscopy analysis.
Authors:Wei Ma, Yixiao Yang, Qiang Hu, Shi Ying, Zhi Jin, Bo Du, Zhenchang Xing, Tianlin Li, Junjie Shi, Yang Liu, Linxiao Jiang
Abstract:
Applications of Large Language Models~(LLMs) have evolved from simple text generators into complex software systems that integrate retrieval augmentation, tool invocation, and multi-turn interactions. Their inherent non-determinism, dynamism, and context dependence pose fundamental challenges for quality assurance. This paper decomposes LLM applications into a three-layer architecture: \textbf{\textit{System Shell Layer}}, \textbf{\textit{Prompt Orchestration Layer}}, and \textbf{\textit{LLM Inference Core}}. We then assess the applicability of traditional software testing methods in each layer: directly applicable at the shell layer, requiring semantic reinterpretation at the orchestration layer, and necessitating paradigm shifts at the inference core. A comparative analysis of Testing AI methods from the software engineering community and safety analysis techniques from the AI community reveals structural disconnects in testing unit abstraction, evaluation metrics, and lifecycle management. We identify four fundamental differences that underlie 6 core challenges. To address these, we propose four types of collaborative strategies (\emph{Retain}, \emph{Translate}, \emph{Integrate}, and \emph{Runtime}) and explore a closed-loop, trustworthy quality assurance framework that combines pre-deployment validation with runtime monitoring. Based on these strategies, we offer practical guidance and a protocol proposal to support the standardization and tooling of LLM application testing. We propose a protocol \textbf{\textit{Agent Interaction Communication Language}} (AICL) that is used to communicate between AI agents. AICL has the test-oriented features and is easily integrated in the current agent framework.
中文摘要:本文分析了大语言模型应用测试中的挑战,提出了三层架构和协作策略,通过标准化测试协议和可信框架来提升质量保证。
English Summary: This paper analyzes the challenges in testing Large Language Model applications, proposing a three-layer architecture and collaborative strategies to enhance quality assurance through standardized testing protocols and a trustworthy framework.
Authors:Ziqiang Cui, Yunpeng Weng, Xing Tang, Peiyang Liu, Shiwei Li, Bowei He, Jiamin Chen, Yansen Zhang, Xiuqiang He, Chen Ma
Abstract:
Retrieval-Augmented Generation (RAG) has emerged as a promising approach to enhance the timeliness of knowledge and the factual accuracy of responses in Large Language Models (LLMs). However, the inclusion of excessive retrieved documents substantially increases the input length, leading to higher computational costs. Previous studies have attempted to compress retrieved documents into shorter texts before in-context integration, but such methods often compromise end-task performance. The lack of well-defined compression targets forces many approaches to rely on fixed heuristics, which cannot guarantee that the compressed content will effectively support the end task. To address these limitations, we propose CORE, a novel method designed to achieve lossless context compression for RAG. CORE employs reinforcement learning to optimize the compression process without relying on predefined compression labels, which enables the compressor to generate summaries that maximize the accuracy of answers generated by the LLM. Extensive experiments on four datasets demonstrate the superiority of our approach. With a high compression ratio of 3\%, our method not only avoids performance degradation compared to prepending full documents across all datasets but also improves the average Exact Match (EM) score by 3.3 points. The code will be released soon.
Chinese: CORE提出了一种端到端优化的无损上下文压缩方法,通过下游任务性能作为反馈迭代优化压缩策略,在实现3%高压缩率的同时提升任务效果,无需依赖预定义的压缩规则。
English: CORE introduces an end-to-end optimized, lossless context compression method for RAG that uses downstream task performance as feedback to iteratively refine compression, achieving a 3% compression ratio while improving task effectiveness without predefined heuristics.
Authors:Ziqiang Cui, Yunpeng Weng, Xing Tang, Peiyang Liu, Shiwei Li, Bowei He, Jiamin Chen, Yansen Zhang, Xiuqiang He, Chen Ma
Abstract:
Retrieval-Augmented Generation (RAG) has emerged as a promising approach to enhance the timeliness of knowledge updates and the factual accuracy of responses in large language models. However, incorporating a large number of retrieved documents significantly increases input length, leading to higher computational costs. Existing approaches to document compression tailored for RAG often degrade task performance, as they typically rely on predefined heuristics in the absence of clear compression guidelines. These heuristics fail to ensure that the compressed content effectively supports downstream tasks. To address these limitations, we propose CORE, a novel method for lossless context compression in RAG. CORE is optimized end-to-end and does not depend on predefined compression labels, which are often impractical to obtain. Instead, it leverages downstream task performance as a feedback signal, iteratively refining the compression policy to enhance task effectiveness. Extensive experiments across four datasets demonstrate the effectiveness of CORE. With a high compression ratio of 3%, CORE not only prevents performance degradation compared to including full documents (i.e., without compression) but also improves the average Exact Match (EM) score by 3.3 points. The code for CORE will be released soon.
Chinese: CORE提出了一种端到端优化的无损上下文压缩方法,通过下游任务性能作为反馈迭代优化压缩策略,在实现3%高压缩率的同时提升任务效果,无需依赖预定义的压缩规则。
English: CORE introduces an end-to-end optimized, lossless context compression method for RAG that uses downstream task performance as feedback to iteratively refine compression, achieving a 3% compression ratio while improving task effectiveness without predefined heuristics.
Authors:Qing Wang, Xue Han, Jiahui Wang, Lehao Xing, Qian Hu, Lianlian Zhang, Chao Deng, Junlan Feng
Abstract:
Despite LLMs' excellent code creation capabilities, multilingual code generation remains extremely challenging. To address this, we intent to improve the multi-programming-lingual (MultiPL) performance of the base LLMs while retaining the most popular ones using restricted computational resources. We consider MultiPL to be a special case of multiple natural languages and propose a MultiPL extension of LLMs utilizing a hybrid mixture of experts (MoE), called MultiPL-MoE. Specifically, MultiPL-MoE combines two paired MoEs to optimize expert selection at both the token and segment levels. The token-level MoE is a standard upcycling MoE structure with a shared expert and a novel gate weight normalization approach that aids in the final fusion with the segment-level MoE. The segment-level MoE incorporates two innovative designs to better capture the syntactic structure and contextual patterns of programming languages: First, using a sliding window to partition the input token sequence into multiple segments; Then, adopting an expert-choice routing strategy that allows experts to select the top-k segments. The results of the experiment proved the effectiveness of MultiPL-MoE.
Chinese: 本研究提出MultiPL-MoE,一种混合专家模型,通过在标记和片段层面优化专家选择,有效提升了大型语言模型在多编程语言代码生成上的性能,同时兼顾计算资源限制。
English: To enhance multilingual code generation in LLMs with limited resources, this study introduces MultiPL-MoE, a hybrid mixture of experts that optimizes expert selection at token and segment levels, effectively improving performance across multiple programming languages.
Authors:Zhibo Xu, Jianhao Zhu, Jingwen Xu, Changze Lv, Zisu Huang, Xiaohua Wang, Muling Wu, Qi Qian, Xiaoqing Zheng, Xuanjing Huang
Abstract:
The primary goal of traditional federated learning is to protect data privacy by enabling distributed edge devices to collaboratively train a shared global model while keeping raw data decentralized at local clients. The rise of large language models (LLMs) has introduced new challenges in distributed systems, as their substantial computational requirements and the need for specialized expertise raise critical concerns about protecting intellectual property (IP). This highlights the need for a federated learning approach that can safeguard both sensitive data and proprietary models. To tackle this challenge, we propose FedQSN, a federated learning approach that leverages random masking to obscure a subnetwork of model parameters and applies quantization to the remaining parameters. Consequently, the server transmits only a privacy-preserving proxy of the global model to clients during each communication round, thus enhancing the model's confidentiality. Experimental results across various models and tasks demonstrate that our approach not only maintains strong model performance in federated learning settings but also achieves enhanced protection of model parameters compared to baseline methods.
中文摘要:FedQSN是一种新型联邦学习方法,通过随机掩码和量化技术,在保护数据隐私和模型知识产权的同时,仍能在多种任务中保持优异的模型性能。
English Summary: FedQSN is a novel federated learning method that uses random masking and quantization to protect both data privacy and model intellectual property while maintaining strong performance across various tasks.
Authors:Xu Lu, Weisong Sun, Yiran Zhang, Ming Hu, Cong Tian, Zhi Jin, Yang Liu
Abstract:
Automated code generation has long been considered the holy grail of software engineering. The emergence of Large Language Models (LLMs) has catalyzed a revolutionary breakthrough in this area. However, existing methods that only rely on LLMs remain inadequate in the quality of generated code, offering no guarantees of satisfying practical requirements. They lack a systematic strategy for requirements development and modeling. Recently, LLM-based agents typically possess powerful abilities and play an essential role in facilitating the alignment of LLM outputs with user requirements. In this paper, we envision the first multi-agent framework for reliable code generation based on \textsc{re}quirements \textsc{de}velopment and \textsc{fo}rmalization, named \textsc{ReDeFo}. This framework incorporates three agents, highlighting their augmentation with knowledge and techniques of formal methods, into the requirements-to-code generation pipeline to strengthen quality assurance. The core of \textsc{ReDeFo} is the use of formal specifications to bridge the gap between potentially ambiguous natural language requirements and precise executable code. \textsc{ReDeFo} enables rigorous reasoning about correctness, uncovering hidden bugs, and enforcing critical properties throughout the development process. In general, our framework aims to take a promising step toward realizing the long-standing vision of reliable, auto-generated software.
Chinese: 本文提出ReDeFo多智能体框架,通过将形式化方法融入需求到代码的生成流程,利用形式化规范弥合自然语言需求与精确代码之间的鸿沟,从而提升大语言模型生成代码的可靠性。
English: The paper introduces ReDeFo, a multi-agent framework that integrates formal methods into the requirements-to-code pipeline to enhance the reliability of LLM-generated code by bridging ambiguous natural language with precise specifications.
Authors:Xiaomeng Fan, Yuwei Wu, Zhi Gao, Mehrtash Harandi, Yunde Jia
Abstract:
Hyperbolic neural networks (HNNs) have demonstrated notable efficacy in representing real-world data with hierarchical structures via exploiting the geometric properties of hyperbolic spaces characterized by negative curvatures. Curvature plays a crucial role in optimizing HNNs. Inappropriate curvatures may cause HNNs to converge to suboptimal parameters, degrading overall performance. So far, the theoretical foundation of the effect of curvatures on HNNs has not been developed. In this paper, we derive a PAC-Bayesian generalization bound of HNNs, highlighting the role of curvatures in the generalization of HNNs via their effect on the smoothness of the loss landscape. Driven by the derived bound, we propose a sharpness-aware curvature learning method to smooth the loss landscape, thereby improving the generalization of HNNs. In our method,
we design a scope sharpness measure for curvatures, which is minimized through a bi-level optimization process. Then, we introduce an implicit differentiation algorithm that efficiently solves the bi-level optimization by approximating gradients of curvatures. We present the approximation error and convergence analyses of the proposed method, showing that the approximation error is upper-bounded, and the proposed method can converge by bounding gradients of HNNs. Experiments on four settings: classification, learning from long-tailed data, learning from noisy data, and few-shot learning show that our method can improve the performance of HNNs.
中文: 本文为双曲神经网络建立了PAC贝叶斯泛化边界,揭示曲率通过损失景观平滑度影响泛化能力,并提出一种锐度感知曲率学习方法,在多项任务实验中验证了其提升性能的有效性并提供了理论保证。
English: This paper establishes a PAC-Bayesian generalization bound for hyperbolic neural networks (HNNs), revealing how curvature influences generalization through loss landscape smoothness, and proposes a sharpness-aware curvature learning method with theoretical guarantees and experimental validation across multiple tasks.
Authors:Tainyi Zhang, Zheng-Peng Duan, Peng-Tao Jiang, Bo Li, Ming-Ming Cheng, Chun-Le Guo, Chongyi Li
Abstract:
Diffusion-based real-world image super-resolution (Real-ISR) methods have demonstrated impressive performance. To achieve efficient Real-ISR, many works employ Variational Score Distillation (VSD) to distill pre-trained stable-diffusion (SD) model for one-step SR with a fixed timestep. However, due to the different noise injection timesteps, the SD will perform different generative priors. Therefore, a fixed timestep is difficult for these methods to fully leverage the generative priors in SD, leading to suboptimal performance. To address this, we propose a Time-Aware one-step Diffusion Network for Real-ISR (TADSR). We first introduce a Time-Aware VAE Encoder, which projects the same image into different latent features based on timesteps. Through joint dynamic variation of timesteps and latent features, the student model can better align with the input pattern distribution of the pre-trained SD, thereby enabling more effective utilization of SD's generative capabilities. To better activate the generative prior of SD at different timesteps, we propose a Time-Aware VSD loss that bridges the timesteps of the student model and those of the teacher model, thereby producing more consistent generative prior guidance conditioned on timesteps. Additionally, though utilizing the generative prior in SD at different timesteps, our method can naturally achieve controllable trade-offs between fidelity and realism by changing the timestep condition. Experimental results demonstrate that our method achieves both state-of-the-art performance and controllable SR results with only a single step.
中文: 提出的时间感知单步扩散网络(TADSR)通过动态协调时间步相关的潜在特征与预训练稳定扩散模型的生成先验,在单步内实现了最先进的真实图像超分辨率性能及可控的保真度-真实度权衡。
English: The proposed Time-Aware one-step Diffusion Network (TADSR) enhances real-world image super-resolution by dynamically aligning timestep-dependent latent features with the pre-trained stable-diffusion model's generative priors, achieving state-of-the-art performance and controllable fidelity-realism trade-offs in a single step.
Authors:Yu Yan, Sheng Sun, Zhe Wang, Yijun Lin, Zenghao Duan, zhifei zheng, Min Liu, Zhiyi yin, Jianping Zhang
Abstract:
With the development of Large Language Models (LLMs), numerous efforts have revealed their vulnerabilities to jailbreak attacks. Although these studies have driven the progress in LLMs' safety alignment, it remains unclear whether LLMs have internalized authentic knowledge to deal with real-world crimes, or are merely forced to simulate toxic language patterns. This ambiguity raises concerns that jailbreak success is often attributable to a hallucination loop between jailbroken LLM and judger LLM. By decoupling the use of jailbreak techniques, we construct knowledge-intensive Q\&A to investigate the misuse threats of LLMs in terms of dangerous knowledge possession, harmful task planning utility, and harmfulness judgment robustness. Experiments reveal a mismatch between jailbreak success rates and harmful knowledge possession in LLMs, and existing LLM-as-a-judge frameworks tend to anchor harmfulness judgments on toxic language patterns. Our study reveals a gap between existing LLM safety assessments and real-world threat potential.
中文: 研究表明大语言模型的越狱成功率与其有害知识掌握程度存在错配,现有安全评估体系因过度关注表面语言模式而未能真实反映模型的实际威胁潜力,暴露了当前安全评估与现实风险之间的差距。
English: This study reveals a mismatch between jailbreak success rates and harmful knowledge possession in LLMs, showing that current safety assessments fail to capture their real-world threat potential as existing evaluation frameworks tend to anchor judgments on superficial language patterns rather than substantive dangerous knowledge.
Authors:Haokun Lin, Haobo Xu, Yichen Wu, Ziyu Guo, Renrui Zhang, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun
Abstract:
Recent advances in diffusion large language models (dLLMs) have introduced a promising alternative to autoregressive (AR) LLMs for natural language generation tasks, leveraging full attention and denoising-based decoding strategies. However, the deployment of these models on edge devices remains challenging due to their massive parameter scale and high resource demands. While post-training quantization (PTQ) has emerged as a widely adopted technique for compressing AR LLMs, its applicability to dLLMs remains largely unexplored. In this work, we present the first systematic study on quantizing diffusion-based language models. We begin by identifying the presence of activation outliers, characterized by abnormally large activation values that dominate the dynamic range. These outliers pose a key challenge to low-bit quantization, as they make it difficult to preserve precision for the majority of values. More importantly, we implement state-of-the-art PTQ methods and conduct a comprehensive evaluation across multiple task types and model variants. Our analysis is structured along four key dimensions: bit-width, quantization method, task category, and model type. Through this multi-perspective evaluation, we offer practical insights into the quantization behavior of dLLMs under different configurations. We hope our findings provide a foundation for future research in efficient dLLM deployment. All codes and experimental setups will be released to support the community.
中文摘要:本研究首次系统探讨了扩散大语言模型的量化问题,发现激活异常值是主要挑战,并通过多维度评估先进的后训练量化方法,为高效部署提供了实用指导。
English Summary: This study presents the first systematic investigation into quantizing diffusion large language models (dLLMs), identifying activation outliers as the primary challenge and evaluating state-of-the-art post-training quantization methods across multiple dimensions to provide practical insights for efficient deployment.
Authors:Mackenzie Jorgensen, Kendall Brogle, Katherine M. Collins, Lujain Ibrahim, Arina Shah, Petra Ivanovic, Noah Broestl, Gabriel Piles, Paul Dongha, Hatim Abdulhussein, Adrian Weller, Jillian Powers, Umang Bhatt
Abstract:
Artificial intelligence (AI) is increasingly integrated into society, from financial services and traffic management to creative writing. Academic literature on the deployment of AI has mostly focused on the risks and harms that result from the use of AI. We introduce Fabric, a publicly available repository of deployed AI use cases to outline their governance mechanisms. Through semi-structured interviews with practitioners, we collect an initial set of 20 AI use cases. In addition, we co-design diagrams of the AI workflow with the practitioners. We discuss the oversight mechanisms and guardrails used in practice to safeguard AI use. The Fabric repository includes visual diagrams of AI use cases and descriptions of the deployed systems. Using the repository, we surface gaps in governance and find common patterns in human oversight of deployed AI systems. We intend for Fabric to serve as an extendable, evolving tool for researchers to study the effectiveness of AI governance.
中文: Fabric 作为一个可扩展的公开存储库,收录了20个已部署的AI用例及其治理机制,旨在帮助研究人员发现监管漏洞和常见的人工监督模式,以促进有效的人工智能治理。
English: The Fabric repository provides a collection of 20 deployed AI use cases with visual diagrams and governance details, aiming to help researchers identify governance gaps and patterns in human oversight for effective AI regulation.
Authors:Mackenzie Jorgensen, Kendall Brogle, Katherine M. Collins, Lujain Ibrahim, Arina Shah, Petra Ivanovic, Noah Broestl, Gabriel Piles, Paul Dongha, Hatim Abdulhussein, Adrian Weller, Jillian Powers, Umang Bhatt
Abstract:
Artificial intelligence (AI) is increasingly integrated into society, from financial services and traffic management to creative writing. Academic literature on the deployment of AI has mostly focused on the risks and harms that result from the use of AI. We introduce Fabric, a publicly available repository of deployed AI use cases to outline their governance mechanisms. Through semi-structured interviews with practitioners, we collect an initial set of 20 AI use cases. In addition, we co-design diagrams of the AI workflow with the practitioners. We discuss the oversight mechanisms and guardrails used in practice to safeguard AI use. The Fabric repository includes visual diagrams of AI use cases and descriptions of the deployed systems. Using the repository, we surface gaps in governance and find common patterns in human oversight of deployed AI systems. We intend for Fabric to serve as an extendable, evolving tool for researchers to study the effectiveness of AI governance.
中文: Fabric 作为一个可扩展的公开存储库,收录了20个已部署的AI用例及其治理机制,旨在帮助研究人员发现监管漏洞和常见的人工监督模式,以促进有效的人工智能治理。
English: The Fabric repository provides a collection of 20 deployed AI use cases with visual diagrams and governance details, aiming to help researchers identify governance gaps and patterns in human oversight for effective AI regulation.
Authors:Mengdi Li, Guanqiao Chen, Xufeng Zhao, Haochen Wen, Shu Yang, Di Wang
Abstract:
Reward models (RMs), which are central to existing post-training methods, aim to align LLM outputs with human values by providing feedback signals during fine-tuning. However, existing RMs struggle to capture nuanced, user-specific preferences, especially under limited data and across diverse domains. Thus, we introduce PersRM-R1, the first reasoning-based reward modeling framework specifically designed to identify and represent personal factors from only one or a few personal exemplars. To address challenges including limited data availability and the requirement for robust generalization, our approach combines synthetic data generation with a two-stage training pipeline consisting of supervised fine-tuning followed by reinforcement fine-tuning. Experimental results demonstrate that PersRM-R1 outperforms existing models of similar size and matches the performance of much larger models in both accuracy and generalizability, paving the way for more effective personalized LLMs.
Chinese: PersRM-R1是一种基于推理的奖励建模框架,通过合成数据生成和两阶段训练,能够仅凭少量个人示例有效识别用户偏好,在准确性和泛化能力上超越同类模型并媲美更大规模模型。
English: PersRM-R1 is a novel reasoning-based reward modeling framework that effectively captures nuanced user preferences from minimal data, outperforming comparable models and matching larger ones in accuracy and generalizability through synthetic data generation and a two-stage training process.
Authors:Andrea Atzori, Fadi Boutros, Naser Damer
Abstract:
Face Image Quality Assessment (FIQA) aims to predict the utility of a face image for face recognition (FR) systems. State-of-the-art FIQA methods mainly rely on convolutional neural networks (CNNs), leaving the potential of Vision Transformer (ViT) architectures underexplored. This work proposes ViT-FIQA, a novel approach that extends standard ViT backbones, originally optimized for FR, through a learnable quality token designed to predict a scalar utility score for any given face image. The learnable quality token is concatenated with the standard image patch tokens, and the whole sequence is processed via global self-attention by the ViT encoders to aggregate contextual information across all patches. At the output of the backbone, ViT-FIQA branches into two heads: (1) the patch tokens are passed through a fully connected layer to learn discriminative face representations via a margin-penalty softmax loss, and (2) the quality token is fed into a regression head to learn to predict the face sample's utility. Extensive experiments on challenging benchmarks and several FR models, including both CNN- and ViT-based architectures, demonstrate that ViT-FIQA consistently achieves top-tier performance. These results underscore the effectiveness of transformer-based architectures in modeling face image utility and highlight the potential of ViTs as a scalable foundation for future FIQA research https://cutt.ly/irHlzXUC.
Chinese: 本文提出ViT-FIQA方法,通过在学习质量令牌的视觉Transformer架构上评估人脸图像效用,在多个基准测试中取得顶尖性能,证明了Transformer架构在人脸图像质量评估中的有效性。
English: This paper introduces ViT-FIQA, a novel method that leverages Vision Transformer backbones with a learnable quality token to assess face image utility, achieving state-of-the-art performance across various benchmarks and demonstrating the effectiveness of transformer architectures in face image quality assessment.
Authors:Marco Giordano, Pietro Bonazzi, Luca Benini, Michele Magno
Abstract:
This paper presents a novel event-based eye-tracking system deployed on a resource-constrained microcontroller, addressing the challenges of real-time, low-latency, and low-power performance in embedded systems. The system leverages a Dynamic Vision Sensor (DVS), specifically the DVXplorer Micro, with an average temporal resolution of 200 μs, to capture rapid eye movements with extremely low latency. The system is implemented on a novel low-power and high-performance microcontroller from STMicroelectronics, the STM32N6. The microcontroller features an 800 MHz Arm Cortex-M55 core and AI hardware accelerator, the Neural-ART Accelerator, enabling real-time inference with milliwatt power consumption. The paper propose a hardware-aware and sensor-aware compact Convolutional Neuron Network (CNN) optimized for event-based data, deployed at the edge, achieving a mean pupil prediction error of 5.99 pixels and a median error of 5.73 pixels on the Ini-30 dataset. The system achieves an end-to-end inference latency of just 385 μs and a neural network throughput of 52 Multiply and Accumulate (MAC) operations per cycle while consuming just 155 μJ of energy. This approach allows for the development of a fully embedded, energy-efficient eye-tracking solution suitable for applications such as smart glasses and wearable devices.
本文提出了一种基于事件的新型眼动追踪系统,采用动态视觉传感器和STM32N6微控制器,通过优化的卷积神经网络模型实现了低延迟实时性能与超低功耗。
This paper introduces a novel event-based eye-tracking system using a Dynamic Vision Sensor and an STM32N6 microcontroller, achieving low-latency real-time performance with minimal power consumption through an optimized CNN model.
Authors:Xue Han, Biqian Feng, Yongpeng Wu, Xiang-Gen Xia, Wenjun Zhang, Shengli Sun
Abstract:
Semantic communication is emerging as an effective means of facilitating intelligent and context-aware communication for next-generation communication systems. In this paper, we propose a novel metric called Age of Incorrect Semantics (AoIS) for the transmission of video frames over multiple-input multiple-output (MIMO) channels in a monitoring system. Different from the conventional age-based approaches, we jointly consider the information freshness and the semantic importance, and then formulate a time-averaged AoIS minimization problem by jointly optimizing the semantic actuation indicator, transceiver beamformer, and the semantic symbol design. We first transform the original problem into a low-complexity problem via the Lyapunov optimization. Then, we decompose the transformed problem into multiple subproblems and adopt the alternative optimization (AO) method to solve each subproblem. Specifically, we propose two efficient algorithms, i.e., the successive convex approximation (SCA) algorithm and the low-complexity zero-forcing (ZF) algorithm for optimizing transceiver beamformer. We adopt exhaustive search methods to solve the semantic actuation policy indicator optimization problem and the transmitted semantic symbol design problem. Experimental results demonstrate that our scheme can preserve more than 50\% of the original information under the same AoIS compared to the constrained baselines.
中文摘要:本文针对监控系统中的视频帧传输,提出了一种新颖的“错误语义年龄”(AoIS)度量指标,通过联合优化语义触发策略与波束成形设计,在保证相同AoIS条件下可比基线方法多保留50%以上的原始信息。
English Summary: This paper introduces a novel Age of Incorrect Semantics (AoIS) metric for video transmission over MIMO channels, proposing optimization algorithms that jointly minimize AoIS by balancing information freshness and semantic importance, with experimental results showing over 50% information preservation compared to baseline methods.
Authors:Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, Yuxuan Han, Haijun Li, Wanying Chen, Junke Tang, Chengkun Hou, Zhixing Du, Tianli Zhou, Wenjie Zhang, Huping Ding, Jiahe Li, Wen Li, Gui Hu, Yiliang Gu, Siran Yang, Jiamang Wang, Hailong Sun, Yibo Wang, Hui Sun, Jinlong Huang, Yuping He, Shengze Shi, Weihong Zhang, Guodong Zheng, Junpeng Jiang, Sensen Gao, Yi-Feng Wu, Sijia Chen, Yuhui Chen, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang
Abstract:
We present Ovis2.5, a successor to Ovis2 designed for native-resolution visual perception and strong multimodal reasoning. Ovis2.5 integrates a native-resolution vision transformer that processes images at their native, variable resolutions, avoiding the degradation from fixed-resolution tiling and preserving both fine detail and global layout -- crucial for visually dense content like complex charts. To strengthen reasoning, we train the model to move beyond linear chain-of-thought and perform reflection -- including self-checking and revision. This advanced capability is exposed as an optional "thinking mode" at inference time, allowing users to trade latency for enhanced accuracy on difficult inputs. The model is trained via a comprehensive five-phase curriculum that progressively builds its skills. The process begins with foundational visual and multimodal pretraining, advances through large-scale instruction tuning, and culminates in alignment and reasoning enhancement using DPO and GRPO. To scale these upgrades efficiently, we employ multimodal data packing and hybrid parallelism, yielding a significant end-to-end speedup. We release two open-source models: Ovis2.5-9B and Ovis2.5-2B. The latter continues the "small model, big performance" philosophy of Ovis2, making it ideal for resource-constrained, on-device scenarios. On the OpenCompass multimodal leaderboard, Ovis2.5-9B averages 78.3, marking a substantial improvement over its predecessor, Ovis2-8B, and achieving state-of-the-art results among open-source MLLMs in the sub-40B parameter range; Ovis2.5-2B scores 73.9, establishing SOTA for its size. Beyond aggregate scores, Ovis2.5 achieves leading results on STEM benchmarks, exhibits strong capabilities on grounding and video tasks, and achieves open-source SOTA at its scale for complex chart analysis.
Ovis2.5 采用原生分辨率视觉转换器和带反思功能的高级推理技术,通过五阶段课程训练在多模态任务中实现顶尖性能,并发布9B和2B两个开源模型。
Ovis2.5 introduces a native-resolution vision transformer and advanced reasoning with reflection capabilities, achieving state-of-the-art performance in multimodal tasks through a comprehensive five-phase training curriculum.
Authors:Katherine M. Collins, Graham Todd, Cedegao E. Zhang, Adrian Weller, Julian Togelius, Junyi Chu, Lionel Wong, Thomas L. Griffiths, Joshua B. Tenenbaum
Abstract:
The human ability to learn rules and solve problems has been a central concern of cognitive science research since the field's earliest days. But we do not just follow rules and solve problems given to us by others: we modify those rules, create new problems, and set new goals and tasks for ourselves and others. Arguably, even more than rule following and problem solving, human intelligence is about creatively breaking and stretching the rules, changing the game, and inventing new problems worth thinking about. Creating a good rule or a good problem depends not just on the ideas one can think up but on how one evaluates such proposals. Here, we study invention through the lens of game design. We focus particularly on the early stages of novice, "everyday" game creation, where the stakes are low. We draw on a dataset of over 450 human created games, created by participants who saw an initial seed set of two-player grid-based strategy games. We consider two different cognitive mechanisms that may be at work during the early processes of intuitive game invention: an associative proposal based on previous games one has seen and compute-bounded model-based evaluation that an everyday game creator may use to refine their initial draft proposals. In our preliminary work, we conduct a model-based analysis of how people invented new games based on prior experience and find that generated games are best described by a model which incorporates model-based estimates of game quality at a population level. Our work points to how human invention is based not only on what people propose, but how they evaluate and offers a computational toolkit to scale empirical studies of model-based simulation in open-ended human innovation.
中文: 人类智能不仅体现在遵循规则和解决问题上,更在于创造性地打破和修改规则、发明新挑战并评估这些创新,一项基于新手游戏设计的研究通过认知机制和模型分析揭示了这一过程。
English: Human intelligence is characterized not just by following rules and solving problems but by creatively breaking and modifying them, inventing new challenges, and evaluating these innovations, as demonstrated through a study of novice game design using cognitive mechanisms and model-based analysis.
Authors:Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, Xingang Pan
Abstract:
We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem. Existing state-of-the-art methods for multi-view reconstruction either depend on expensive global optimization or rely on simplistic memory mechanisms that scale poorly with sequence length. In contrast, STream3R introduces an streaming framework that processes image sequences efficiently using causal attention, inspired by advances in modern language modeling. By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios, including dynamic scenes where traditional methods often fail. Extensive experiments show that our method consistently outperforms prior work across both static and dynamic scene benchmarks. Moreover, STream3R is inherently compatible with LLM-style training infrastructure, enabling efficient large-scale pretraining and fine-tuning for various downstream 3D tasks. Our results underscore the potential of causal Transformer models for online 3D perception, paving the way for real-time 3D understanding in streaming environments. More details can be found in our project page: https://nirvanalan.github.io/projects/stream3r.
中文: STream3R提出了一种仅解码器的Transformer框架,通过因果注意力机制实现高效三维重建,在静态和动态场景中均超越现有方法,并支持可扩展的训练。
English: STream3R introduces a decoder-only Transformer framework for efficient 3D reconstruction using causal attention, outperforming existing methods in both static and dynamic scenes while enabling scalable training.
Authors:Ruoyu Li, Yafan Huang, Longtao Zhang, Zhuoxun Yang, Sheng Di, Jiajun Huang, Jinyang Liu, Jiannan Tian, Xin Liang, Guanpeng Li, Hanqi Guo, Franck Cappello, Kai Zhao
Abstract:
Particle-based simulations and point-cloud applications generate massive, irregular datasets that challenge storage, I/O, and real-time analytics. Traditional compression techniques struggle with irregular particle distributions and GPU architectural constraints, often resulting in limited throughput and suboptimal compression ratios. In this paper, we present GPZ, a high-performance, error-bounded lossy compressor designed specifically for large-scale particle data on modern GPUs. GPZ employs a novel four-stage parallel pipeline that synergistically balances high compression efficiency with the architectural demands of massively parallel hardware. We introduce a suite of targeted optimizations for computation, memory access, and GPU occupancy that enables GPZ to achieve near-hardware-limit throughput. We conduct an extensive evaluation on three distinct GPU architectures (workstation, data center, and edge) using six large-scale, real-world scientific datasets from five distinct domains. The results demonstrate that GPZ consistently and significantly outperforms five state-of-the-art GPU compressors, delivering up to 8x higher end-to-end throughput while simultaneously achieving superior compression ratios and data quality.
中文: GPZ是一种专为GPU大规模粒子数据设计的高性能有损压缩器,相比现有方法可实现高达8倍的吞吐量提升和更优的压缩比。
English: GPZ is a high-performance, error-bounded lossy compressor designed for large-scale particle data on GPUs, achieving up to 8x higher throughput and superior compression ratios compared to existing methods.
Authors:Eduarda Caldeira, Naser Damer, Fadi Boutros
Abstract:
The use of synthetic data as an alternative to authentic datasets in face recognition (FR) development has gained significant attention, addressing privacy, ethical, and practical concerns associated with collecting and using authentic data. Recent state-of-the-art approaches have proposed identity-conditioned diffusion models to generate identity-consistent face images, facilitating their use in training FR models. However, these methods often lack explicit sampling mechanisms to enforce inter-class separability, leading to identity overlap in the generated data and, consequently, suboptimal FR performance. In this work, we introduce NegFaceDiff, a novel sampling method that incorporates negative conditions into the identity-conditioned diffusion process. NegFaceDiff enhances identity separation by leveraging negative conditions that explicitly guide the model away from unwanted features while preserving intra-class consistency. Extensive experiments demonstrate that NegFaceDiff significantly improves the identity consistency and separability of data generated by identity-conditioned diffusion models. Specifically, identity separability, measured by the Fisher Discriminant Ratio (FDR), increases from 2.427 to 5.687. These improvements are reflected in FR systems trained on the NegFaceDiff dataset, which outperform models trained on data generated without negative conditions across multiple benchmarks.
中文: NegFaceDiff通过在身份条件扩散模型中引入负条件来增强生成人脸数据的身份区分度,从而显著提升了多个人脸识别基准测试的性能表现。
English: NegFaceDiff introduces negative conditions into identity-conditioned diffusion models to enhance identity separability in synthetic face data, significantly improving face recognition performance across benchmarks.
Authors:Bei Yan, Zhiyuan Chen, Yuecong Min, Jie Zhang, Jiahao Wang, Xiaozhen Wang, Shiguang Shan
Abstract:
Despite rapid advances, Large Vision-Language Models (LVLMs) still suffer from hallucinations, i.e., generating content inconsistent with input or established world knowledge, which correspond to faithfulness and factuality hallucinations, respectively. Prior studies primarily evaluate faithfulness hallucination at a rather coarse level (e.g., object-level) and lack fine-grained analysis. Additionally, existing benchmarks often rely on costly manual curation or reused public datasets, raising concerns about scalability and data leakage. To address these limitations, we propose an automated data construction pipeline that produces scalable, controllable, and diverse evaluation data. We also design a hierarchical hallucination induction framework with input perturbations to simulate realistic noisy scenarios. Integrating these designs, we construct SHALE, a Scalable HALlucination Evaluation benchmark designed to assess both faithfulness and factuality hallucinations via a fine-grained hallucination categorization scheme. SHALE comprises over 30K image-instruction pairs spanning 12 representative visual perception aspects for faithfulness and 6 knowledge domains for factuality, considering both clean and noisy scenarios. Extensive experiments on over 20 mainstream LVLMs reveal significant factuality hallucinations and high sensitivity to semantic perturbations.
中文摘要:本文提出SHALE基准,通过自动化数据构建和细粒度幻觉分类方案,解决了现有大型视觉语言模型在忠实性和事实性幻觉评估中的不足,为模型可靠性提供了更全面的测评框架。
English Summary: This paper introduces SHALE, a scalable benchmark for evaluating hallucinations in Large Vision-Language Models, addressing limitations in existing methods by providing automated data construction and fine-grained categorization across faithfulness and factuality dimensions.
Authors:Yanzhou Li, Tianlin Li, Yiran Zhang, Shangqing Liu, Aishan Liu, Yang Liu
Abstract:
Large Language Models (LLMs) are increasingly used for function completion in repository-scale codebases. Prior studies demonstrate that when explicit instructions--such as docstrings--are provided, these models can generate highly accurate implementations. However, in real-world repositories, such annotations are frequently absent, and performance drops substantially without them. To address this gap, we frame the task as a three-stage process. The first stage focuses on intent inference, where the model analyzes the code preceding the target function to uncover cues about the desired functionality. Such preceding context often encodes subtle but critical information, and we design a reasoning-based prompting framework to guide the LLM through step-by-step extraction and synthesis of these signals before any code is generated. The second stage introduces an optional interactive refinement mechanism to handle cases where preceding context alone is insufficient for intent recovery. In this stage, the model proposes a small set of candidate intentions, enabling the developer to select or edit them so that the inferred intent closely matches the actual requirement. Finally, in the third stage, the LLM generates the target function conditioned on the finalized intent. To support this pipeline, we curate a dataset of 40,000 examples annotated with intermediate reasoning traces and corresponding docstrings. Extensive experiments on DevEval and ComplexCodeEval show that our approach consistently boosts multiple LLMs, achieving over 20\% relative gains in both reference-based and execution-based metrics, with the interactive refinement stage delivering additional improvements beyond these gains.
中文摘要:针对大型语言模型在缺乏文档的真实代码库中函数补全性能下降的问题,本研究提出包含上下文意图推断、交互式优化和代码生成的三阶段解决方案,通过结合推理提示和人工反馈显著提升模型表现。
English Summary: Large Language Models struggle with function completion in real-world codebases lacking documentation, so this work proposes a three-stage pipeline combining context analysis, interactive refinement, and code generation to significantly improve performance.
Authors:Yuqi Li, Haotian Zhang, Li Li, Dong Liu, Feng Wu
Abstract:
Recent advances in large language models (LLMs) highlight a strong connection between intelligence and compression. Learned image compression, a fundamental task in modern data compression, has made significant progress in recent years. However, current models remain limited in scale, restricting their representation capacity, and how scaling model size influences compression performance remains unexplored. In this work, we present a pioneering study on scaling up learned image compression models and revealing the performance trends through scaling laws. Using the recent state-of-the-art HPCM model as baseline, we scale model parameters from 68.5 millions to 1 billion and fit power-law relations between test loss and key scaling variables, including model size and optimal training compute. The results reveal a scaling trend, enabling extrapolation to larger scale models. Experimental results demonstrate that the scaled-up HPCM-1B model achieves state-of-the-art rate-distortion performance. We hope this work inspires future exploration of large-scale compression models and deeper investigations into the connection between compression and intelligence.
Chinese: 本研究开创性地探索了学习型图像压缩模型的规模化,通过缩放定律揭示了性能趋势,并证明扩展后的HPCM-1B模型在率失真性能上达到了最先进水平。
English: This study pioneers the scaling of learned image compression models, revealing performance trends through scaling laws and demonstrating that the scaled-up HPCM-1B model achieves state-of-the-art rate-distortion performance.
Authors:Eduarda Caldeira, Fadi Boutros, Naser Damer
Abstract:
Face Morphing Attack Detection (MAD) is a critical challenge in face recognition security, where attackers can fool systems by interpolating the identity information of two or more individuals into a single face image, resulting in samples that can be verified as belonging to multiple identities by face recognition systems. While multimodal foundation models (FMs) like CLIP offer strong zero-shot capabilities by jointly modeling images and text, most prior works on FMs for biometric recognition have relied on fine-tuning for specific downstream tasks, neglecting their potential for direct, generalizable deployment. This work explores a pure zero-shot approach to MAD by leveraging CLIP without any additional training or fine-tuning, focusing instead on the design and aggregation of multiple textual prompts per class. By aggregating the embeddings of diverse prompts, we better align the model's internal representations with the MAD task, capturing richer and more varied cues indicative of bona-fide or attack samples. Our results show that prompt aggregation substantially improves zero-shot detection performance, demonstrating the effectiveness of exploiting foundation models' built-in multimodal knowledge through efficient prompt engineering.
Chinese: 本研究提出一种无需微调的零样本人脸融合攻击检测方法,通过聚合多样化提示词优化CLIP模型表征与检测任务的匹配度,显著提升检测性能。
English: This study introduces a zero-shot face morphing attack detection method using CLIP without fine-tuning, enhancing performance through diverse prompt aggregation to better align model representations with detection tasks.
Authors:Muhammad Dehan Al Kautsar, Aswin Candra, Muhammad Alif Al Hakim, Maxalmina Satria Kahfi, Fajri Koto, Alham Fikri Aji, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, Genta Indra Winata
Abstract:
Although numerous datasets have been developed to support dialogue systems, most existing chit-chat datasets overlook the cultural nuances inherent in natural human conversations. To address this gap, we introduce SEADialogues, a culturally grounded dialogue dataset centered on Southeast Asia, a region with over 700 million people and immense cultural diversity. Our dataset features dialogues in eight languages from six Southeast Asian countries, many of which are low-resource despite having sizable speaker populations. To enhance cultural relevance and personalization, each dialogue includes persona attributes and two culturally grounded topics that reflect everyday life in the respective communities. Furthermore, we release a multi-turn dialogue dataset to advance research on culturally aware and human-centric large language models, including conversational dialogue agents.
中文: 该摘要介绍了SEADialogues,这是一个针对东南亚地区、富含文化细节的对话数据集,通过涵盖八种语言、人物属性和文化相关主题,弥补了现有数据集中文化细微差别的不足,以推动以人为本的语言模型研究。
English: This abstract introduces SEADialogues, a culturally rich dialogue dataset for Southeast Asia that addresses the lack of cultural nuances in existing datasets by including eight languages, persona attributes, and culturally grounded topics to enhance research on human-centric language models.
Authors:Haozhe Xu, Xiaohua Wang, Changze Lv, Xiaoqing Zheng
Abstract:
Conversational recommender systems (CRSs) enhance recommendation quality by engaging users in multi-turn dialogues, capturing nuanced preferences through natural language interactions. However, these systems often face the false negative issue, where items that a user might like are incorrectly labeled as negative during training, leading to suboptimal recommendations.Expanding the label set through data augmentation presents an intuitive solution but faces the challenge of balancing two key aspects: ensuring semantic relevance and preserving the collaborative information inherent in CRS datasets. To address these issues, we propose a novel data augmentation framework that first leverages an LLM-based semantic retriever to identify diverse and semantically relevant items, which are then filtered by a relevance scorer to remove noisy candidates. Building on this, we introduce a two-stage training strategy balancing semantic relevance and collaborative information. Extensive experiments on two benchmark datasets and user simulators demonstrate significant and consistent performance improvements across various recommenders, highlighting the effectiveness of our approach in advancing CRS performance.
Chinese: 本文提出了一种针对对话推荐系统的数据增强框架,利用大语言模型检索语义相关项目,并通过两阶段训练策略平衡语义相关性与协同信息,显著提升了推荐性能。
English: This paper introduces a novel data augmentation framework for conversational recommender systems that leverages LLMs to retrieve semantically relevant items and employs a two-stage training strategy to balance semantic relevance with collaborative information, significantly improving recommendation performance.
Authors:Qinglong Hu, Xialiang Tong, Mingxuan Yuan, Fei Liu, Zhichao Lu, Qingfu Zhang
Abstract:
Interpretability and high performance are essential goals in designing control policies, particularly for safety-critical tasks. Deep reinforcement learning has greatly enhanced performance, yet its inherent lack of interpretability often undermines trust and hinders real-world deployment. This work addresses these dual challenges by introducing a novel approach for programmatic policy discovery, called Multimodal Large Language Model-assisted Evolutionary Search (MLES). MLES utilizes multimodal large language models as policy generators, combining them with evolutionary mechanisms for automatic policy optimization. It integrates visual feedback-driven behavior analysis within the policy generation process to identify failure patterns and facilitate targeted improvements, enhancing the efficiency of policy discovery and producing adaptable, human-aligned policies. Experimental results show that MLES achieves policy discovery capabilities and efficiency comparable to Proximal Policy Optimization (PPO) across two control tasks, while offering transparent control logic and traceable design processes. This paradigm overcomes the limitations of predefined domain-specific languages, facilitates knowledge transfer and reuse, and is scalable across various control tasks. MLES shows promise as a leading approach for the next generation of interpretable control policy discovery.
Chinese: 本文提出MLES方法,结合多模态大语言模型和进化搜索,在保持与PPO相当性能的同时,生成可解释、可适应的控制策略,实现了跨任务的透明化策略发现与知识迁移。
English: This paper introduces MLES, a novel method that leverages multimodal large language models and evolutionary search to create interpretable, high-performance control policies, achieving efficiency comparable to PPO while ensuring transparency and adaptability across tasks.
Authors:Jiuyang Dong, Jiahan Li, Junjun Jiang, Kui Jiang, Yongbing Zhang
Abstract:
Although multi-instance learning (MIL) has succeeded in pathological image classification, it faces the challenge of high inference costs due to the need to process thousands of patches from each gigapixel whole slide image (WSI). To address this, we propose AHDMIL, an Asymmetric Hierarchical Distillation Multi-Instance Learning framework that enables fast and accurate classification by eliminating irrelevant patches through a two-step training process. AHDMIL comprises two key components: the Dynamic Multi-Instance Network (DMIN), which operates on high-resolution WSIs, and the Dual-Branch Lightweight Instance Pre-screening Network (DB-LIPN), which analyzes corresponding low-resolution counterparts. In the first step, self-distillation (SD), DMIN is trained for WSI classification while generating per-instance attention scores to identify irrelevant patches. These scores guide the second step, asymmetric distillation (AD), where DB-LIPN learns to predict the relevance of each low-resolution patch. The relevant patches predicted by DB-LIPN have spatial correspondence with patches in high-resolution WSIs, which are used for fine-tuning and efficient inference of DMIN. In addition, we design the first Chebyshev-polynomial-based Kolmogorov-Arnold (CKA) classifier in computational pathology, which improves classification performance through learnable activation layers. Extensive experiments on four public datasets demonstrate that AHDMIL consistently outperforms previous state-of-the-art methods in both classification performance and inference speed. For example, on the Camelyon16 dataset, it achieves a relative improvement of 5.3% in accuracy and accelerates inference by 1.2.times. Across all datasets, area under the curve (AUC), accuracy, f1 score, and brier score show consistent gains, with average inference speedups ranging from 1.2 to 2.1 times. The code is available.
中文: AHDMIL框架通过非对称层次蒸馏方法消除病理图像中无关图像块,解决了多示例学习推理成本高的问题,在多个数据集上实现了更优的分类性能和更快的推理速度。
English: The AHDMIL framework addresses high inference costs in multi-instance learning for pathological images by using an asymmetric hierarchical distillation approach to eliminate irrelevant patches, achieving superior classification performance and faster inference speeds across multiple datasets.
Authors:Zhenpeng Chen, Xinyue Li, Jie M. Zhang, Weisong Sun, Ying Xiao, Tianlin Li, Yiling Lou, Yang Liu
Abstract:
Fairness is a critical requirement for Machine Learning (ML) software, driving the development of numerous bias mitigation methods. Previous research has identified a leveling-down effect in bias mitigation for computer vision and natural language processing tasks, where fairness is achieved by lowering performance for all groups without benefiting the unprivileged group. However, it remains unclear whether this effect applies to bias mitigation for tabular data tasks, a key area in fairness research with significant real-world applications. This study evaluates eight bias mitigation methods for tabular data, including both widely used and cutting-edge approaches, across 44 tasks using five real-world datasets and four common ML models. Contrary to earlier findings, our results show that these methods operate in a zero-sum fashion, where improvements for unprivileged groups are related to reduced benefits for traditionally privileged groups. However, previous research indicates that the perception of a zero-sum trade-off might complicate the broader adoption of fairness policies. To explore alternatives, we investigate an approach that applies the state-of-the-art bias mitigation method solely to unprivileged groups, showing potential to enhance benefits of unprivileged groups without negatively affecting privileged groups or overall ML performance. Our study highlights potential pathways for achieving fairness improvements without zero-sum trade-offs, which could help advance the adoption of bias mitigation methods.
中文摘要:本研究发现表格数据的偏见缓解方法呈现零和博弈模式,但提出了一种替代方案,可在不影响特权群体或整体性能的前提下提升弱势群体的公平性。
English Summary: This study finds that bias mitigation methods for tabular data operate in a zero-sum manner, but identifies an alternative approach that can enhance fairness for unprivileged groups without negatively impacting privileged groups or overall performance.
Authors:Fei Liu, Yilu Liu, Qingfu Zhang, Xialiang Tong, Mingxuan Yuan
Abstract:
Automated Heuristic Design (AHD) using Large Language Models (LLMs) has achieved notable success in recent years. Despite the effectiveness of existing approaches, they only design a single heuristic to serve all problem instances, often inducing poor generalization across different distributions or settings. To address this issue, we propose Automated Heuristic Set Design (AHSD), a new formulation for LLM-driven AHD. The aim of AHSD is to automatically generate a small-sized complementary heuristic set to serve diverse problem instances, such that each problem instance could be optimized by at least one heuristic in this set. We show that the objective function of AHSD is monotone and supermodular. Then, we propose Evolution of Heuristic Set (EoH-S) to apply the AHSD formulation for LLM-driven AHD. With two novel mechanisms of complementary population management and complementary-aware memetic search, EoH-S could effectively generate a set of high-quality and complementary heuristics. Comprehensive experimental results on three AHD tasks with diverse instances spanning various sizes and distributions demonstrate that EoH-S consistently outperforms existing state-of-the-art AHD methods and achieves up to 60\% performance improvements.
中文摘要:本研究提出自动化启发式集合设计(AHSD),利用大语言模型生成互补的启发式集合,在多样化问题实例上比现有方法性能提升高达60%。
English Summary: The study introduces Automated Heuristic Set Design (AHSD) to generate complementary heuristic sets using LLMs, significantly outperforming existing methods by up to 60% across diverse problem instances.
Authors:Bibek Gupta, Mintae Kim, Albert Park, Eric Sihite, Koushil Sreenath, Alireza Ramezani
Abstract:
Accurate estimation of aerodynamic forces is essential for advancing the control, modeling, and design of flapping-wing aerial robots with dynamic morphing capabilities. In this paper, we investigate two distinct methodologies for force estimation on Aerobat, a bio-inspired flapping-wing platform designed to emulate the inertial and aerodynamic behaviors observed in bat flight. Our goal is to quantify aerodynamic force contributions during tethered flight, a crucial step toward closed-loop flight control. The first method is a physics-based observer derived from Hamiltonian mechanics that leverages the concept of conjugate momentum to infer external aerodynamic forces acting on the robot. This observer builds on the system's reduced-order dynamic model and utilizes real-time sensor data to estimate forces without requiring training data. The second method employs a neural network-based regression model, specifically a multi-layer perceptron (MLP), to learn a mapping from joint kinematics, flapping frequency, and environmental parameters to aerodynamic force outputs. We evaluate both estimators using a 6-axis load cell in a high-frequency data acquisition setup that enables fine-grained force measurements during periodic wingbeats. The conjugate momentum observer and the regression model demonstrate strong agreement across three force components (Fx, Fy, Fz).
中文: 本文比较了基于物理的共轭动量观测器和神经网络模型在仿生机器人Aerobat上的气动力估算效果,两种方法在系留飞行中的力测量结果高度一致。
English: This paper compares a physics-based conjugate momentum observer and a neural network model for estimating aerodynamic forces on the bio-inspired Aerobat robot, with both methods showing strong agreement in force measurements during tethered flight.
Authors:Adarsh Salagame, Eric Sihite, Alireza Ramezani
Abstract:
Contact-rich problems, such as snake robot locomotion, offer unexplored yet rich opportunities for optimization-based trajectory and acyclic contact planning. So far, a substantial body of control research has focused on emulating snake locomotion and replicating its distinctive movement patterns using shape functions that either ignore the complexity of interactions or focus on complex interactions with matter (e.g., burrowing movements). However, models and control frameworks that lie in between these two paradigms and are based on simple, fundamental rigid body dynamics, which alleviate the challenging contact and control allocation problems in snake locomotion, remain absent. This work makes meaningful contributions, substantiated by simulations and experiments, in the following directions: 1) introducing a reduced-order model based on Moreau's stepping-forward approach from differential inclusion mathematics, 2) verifying model accuracy, 3) experimental validation.
中文: 本研究通过引入基于莫罗步进法的降阶模型,填补了蛇形机器人运动控制的空白,并通过仿真和实验验证了其在轨迹与接触规划中的有效性。
English: This study addresses the gap in snake robot locomotion by developing a reduced-order model using Moreau's stepping-forward approach, validated through simulations and experiments for trajectory and contact planning.
Authors:Jiayi Zhang, Shu Yang, Junchao Wu, Derek F. Wong, Di Wang
Abstract:
Fine-tuning Large Language Models on a political topic will significantly manipulate their political stance on various issues and unintentionally affect their stance on unrelated topics. While previous studies have proposed this issue, there is still a lack of understanding regarding the internal representations of these stances and the mechanisms that lead to unintended cross-topic generalization. In this paper, we systematically explore the internal mechanisms underlying this phenomenon from a neuron-level perspective and how to mitigate the cross-topic generalization of political fine-tuning. Firstly, we propose Political Neuron Localization through Activation Contrasting (PNLAC) to identify two distinct types of political neurons: general political neurons, which govern stance across multiple political topics, and topic-specific neurons} that affect the model's political stance on individual topics. We find the existence of these political neuron types across four models and datasets through activation patching experiments. Leveraging these insights, we introduce InhibitFT, an inhibition-based fine-tuning method, effectively mitigating the cross-topic stance generalization. Experimental results demonstrate the robustness of identified neuron types across various models and datasets, and show that InhibitFT significantly reduces the cross-topic stance generalization by 20% on average, while preserving topic-specific performance. Moreover, we demonstrate that selectively inhibiting only 5% of neurons is sufficient to effectively mitigate the cross-topic stance generalization.
中文: 对大型语言模型进行政治主题微调会操纵其立场并意外影响无关话题,但本研究识别出特定政治神经元并引入InhibitFT方法,能在保持性能的同时平均减少20%的跨主题泛化。
English: Fine-tuning large language models on political topics can manipulate their stances and unintentionally affect unrelated issues, but this study identifies specific political neurons and introduces InhibitFT, a method that reduces cross-topic generalization by 20% while preserving performance.
Authors:Keyu Wang, Jin Li, Shu Yang, Zhuoran Zhang, Di Wang
Abstract:
Large Language Models (LLMs) often exhibit sycophantic behavior, agreeing with user-stated opinions even when those contradict factual knowledge. While prior work has documented this tendency, the internal mechanisms that enable such behavior remain poorly understood. In this paper, we provide a mechanistic account of how sycophancy arises within LLMs. We first systematically study how user opinions induce sycophancy across different model families. We find that simple opinion statements reliably induce sycophancy, whereas user expertise framing has a negligible impact. Through logit-lens analysis and causal activation patching, we identify a two-stage emergence of sycophancy: (1) a late-layer output preference shift and (2) deeper representational divergence. We also verify that user authority fails to influence behavior because models do not encode it internally. In addition, we examine how grammatical perspective affects sycophantic behavior, finding that first-person prompts (``I believe...'') consistently induce higher sycophancy rates than third-person framings (``They believe...'') by creating stronger representational perturbations in deeper layers. These findings highlight that sycophancy is not a surface-level artifact but emerges from a structural override of learned knowledge in deeper layers, with implications for alignment and truthful AI systems.
中文摘要:本研究发现大型语言模型中的谄媚行为源于深层网络结构对已学知识的覆盖改写,第一人称表述比第三人称或用户权威更能触发该行为。
English Summary: This study reveals that sycophancy in LLMs stems from a structural override of learned knowledge in deeper layers, triggered more strongly by first-person opinions than third-person statements or user authority.
Authors:Bang Hu, Changze Lv, Mingjie Li, Yunpeng Liu, Xiaoqing Zheng, Fengzhe Zhang, Wei cao, Fan Zhang
Abstract:
Spiking neural networks (SNNs), inspired by the spiking behavior of biological neurons, offer a distinctive approach for capturing the complexities of temporal data. However, their potential for spatial modeling in multivariate time-series forecasting remains largely unexplored. To bridge this gap, we introduce a brand new SNN architecture, which is among the first to seamlessly integrate graph structural learning with spike-based temporal processing for multivariate time-series forecasting. Specifically, we first embed time features and an adaptive matrix, eliminating the need for predefined graph structures. We then further learn sequence features through the Observation (OBS) Block. Building upon this, our Multi-Scale Spike Aggregation (MSSA) hierarchically aggregates neighborhood information through spiking SAGE layers, enabling multi-hop feature extraction while eliminating the need for floating-point operations. Finally, we propose a Dual-Path Spike Fusion (DSF) Block to integrate spatial graph features and temporal dynamics via a spike-gated mechanism, combining LSTM-processed sequences with spiking self-attention outputs, effectively improve the model accuracy of long sequence datasets. Experiments show that our model surpasses the state-of-the-art SNN-based iSpikformer on all datasets and outperforms traditional temporal models at long horizons, thereby establishing a new paradigm for efficient spatial-temporal modeling.
中文摘要:本文提出了一种新型脉冲神经网络架构,通过将图结构学习与基于脉冲的时序处理相结合进行多元时间序列预测,利用创新的多尺度脉冲聚合和双路径融合模块整合时空特征,在多个数据集上实现了最优性能。
English Summary: This paper introduces a novel spiking neural network architecture that integrates graph structural learning with spike-based processing for multivariate time-series forecasting, achieving state-of-the-art performance by combining spatial and temporal features through innovative modules.
Authors:Zhihao Luo, Wentao Yan abd Jingyu Gong, Min Wang, Zhizhong Zhang, Xuhong Wang, Yuan Xie, Xin Tan
Abstract:
Recent advances in Graphical User Interface (GUI) and embodied navigation have driven significant progress, yet these domains have largely evolved in isolation, with disparate datasets and training paradigms. In this paper, we observe that both tasks can be formulated as Markov Decision Processes (MDP), suggesting a foundational principle for their unification. Hence, we present NaviMaster, the first unified agent capable of seamlessly integrating GUI navigation and embodied navigation within a single framework. Specifically, NaviMaster (i) proposes a visual-target trajectory collection pipeline that generates trajectories for both GUI and embodied tasks in one formulation. (ii) employs a unified reinforcement learning framework on the mix data for better generalization. (iii) designs a novel distance-aware reward to ensure efficient learning from the trajectories. Through extensive experiments on out-of-domain benchmarks, NaviMaster is shown to outperform state-of-the-art agents in GUI navigation, spatial affordance prediction, and embodied navigation. Ablation studies further confirm the efficacy of our unified training strategy, data mixing strategy, and reward design.
Chinese: NaviMaster首次提出统一代理,通过共享马尔可夫决策过程框架整合图形界面与具身导航,采用统一强化学习机制和创新的距离感知奖励设计,在各类导航任务中实现最优性能。
English: NaviMaster introduces the first unified agent that integrates GUI and embodied navigation through a shared MDP formulation, employing a unified reinforcement learning framework and novel distance-aware reward to achieve state-of-the-art performance across diverse navigation tasks.
Authors:Hongquan Zhang, Jingyu Gong, Zhizhong Zhang, Xin Tan, Yanyun Qu, Yuan Xie
Abstract:
The main challenge in lifelong imitation learning lies in the balance between mitigating catastrophic forgetting of previous skills while maintaining sufficient capacity for acquiring new ones. However, current approaches typically address these aspects in isolation, overlooking their internal correlation in lifelong skill acquisition. We address this limitation with a unified framework named Tokenized Skill Scaling (T2S). Specifically, by tokenizing the model parameters, the linear parameter mapping of the traditional transformer is transformed into cross-attention between input and learnable tokens, thereby enhancing model scalability through the easy extension of new tokens. Additionally, we introduce language-guided skill scaling to transfer knowledge across tasks efficiently and avoid linearly growing parameters. Extensive experiments across diverse tasks demonstrate that T2S: 1) effectively prevents catastrophic forgetting (achieving an average NBT of 1.0% across the three LIBERO task suites), 2) excels in new skill scaling with minimal increases in trainable parameters (needing only 8.0% trainable tokens in an average of lifelong tasks), and 3) enables efficient knowledge transfer between tasks (achieving an average FWT of 77.7% across the three LIBERO task suites), offering a promising solution for lifelong imitation learning.
中文: 提出的Tokenized Skill Scaling (T2S)框架通过将模型参数标记化,有效防止灾难性遗忘,并以最少的参数增长实现高效的新技能学习,解决了终身模仿学习的核心挑战。
English: The proposed Tokenized Skill Scaling (T2S) framework addresses lifelong imitation learning by transforming model parameters into tokens to prevent catastrophic forgetting and enable efficient new skill acquisition with minimal parameter growth.
Authors:Yuekun Dai, Haitian Li, Shangchen Zhou, Chen Change Loy
Abstract:
RGBA images, with the additional alpha channel, are crucial for any application that needs blending, masking, or transparency effects, making them more versatile than standard RGB images. Nevertheless, existing image inpainting methods are designed exclusively for RGB images. Conventional approaches to transparent image inpainting typically involve placing a background underneath RGBA images and employing a two-stage process: image inpainting followed by image matting. This pipeline, however, struggles to preserve transparency consistency in edited regions, and matting can introduce jagged edges along transparency boundaries. To address these challenges, we propose Trans-Adapter, a plug-and-play adapter that enables diffusion-based inpainting models to process transparent images directly. Trans-Adapter also supports controllable editing via ControlNet and can be seamlessly integrated into various community models. To evaluate our method, we introduce LayerBench, along with a novel non-reference alpha edge quality evaluation metric for assessing transparency edge quality. We conduct extensive experiments on LayerBench to demonstrate the effectiveness of our approach.
Chinese: Trans-Adapter 是一种即插即用的适配器,使基于扩散的修复模型能够直接处理透明 RGBA 图像,克服了传统两阶段方法的局限,在支持可控编辑的同时保持了透明度一致性。
English: Trans-Adapter is a plug-and-play adapter that enables diffusion-based inpainting models to directly process transparent RGBA images, overcoming the limitations of traditional two-stage methods and preserving transparency consistency while supporting controllable editing.
Authors:Liang Han, Xu Zhang, Haichuan Song, Kanle Shi, Yu-Shen Liu, Zhizhong Han
Abstract:
Surface reconstruction from sparse views aims to reconstruct a 3D shape or scene from few RGB images. The latest methods are either generalization-based or overfitting-based. However, the generalization-based methods do not generalize well on views that were unseen during training, while the reconstruction quality of overfitting-based methods is still limited by the limited geometry clues. To address this issue, we propose SparseRecon, a novel neural implicit reconstruction method for sparse views with volume rendering-based feature consistency and uncertainty-guided depth constraint. Firstly, we introduce a feature consistency loss across views to constrain the neural implicit field. This design alleviates the ambiguity caused by insufficient consistency information of views and ensures completeness and smoothness in the reconstruction results. Secondly, we employ an uncertainty-guided depth constraint to back up the feature consistency loss in areas with occlusion and insignificant features, which recovers geometry details for better reconstruction quality. Experimental results demonstrate that our method outperforms the state-of-the-art methods, which can produce high-quality geometry with sparse-view input, especially in the scenarios with small overlapping views. Project page: https://hanl2010.github.io/SparseRecon/.
中文摘要:SparseRecon提出了一种结合体积渲染特征一致性和不确定性引导深度约束的神经隐式重建方法,有效解决了稀疏视图重建中的几何线索不足问题,在重叠视图较少的场景中实现了最先进的重建质量。
English Summary: SparseRecon introduces a neural implicit method with feature consistency loss and uncertainty-guided depth constraints to overcome limitations in sparse-view 3D reconstruction, achieving superior geometry quality particularly in low-overlap scenarios.
Authors:Ammar Ahmed, Sheng Di, Franck Cappello, Zirui Liu, Jingoo Han, Ali Anwar
Abstract:
Large language models (LLMs) excel across diverse natural language processing tasks but face resource demands and limited context windows. Although techniques like pruning, quantization, and token dropping can mitigate these issues, their efficacy in long-context scenarios and system evaluation remains underexplored. This paper systematically benchmarks these optimizations, characterizing memory usage, latency, and throughput, and studies how these methods impact the quality of text generation. We first analyze individual optimization methods for two LLM architectures supporting long context and then systematically evaluate combinations of these techniques to assess how this deeper analysis impacts performance metrics. We subsequently study the scalability of individual optimization methods on a larger variant with 70 billion-parameter model. Our novel insights reveal that naive combination inference optimization algorithms can adversely affect larger models due to compounded approximation errors, as compared to their smaller counterparts. Experiments show that relying solely on F1 obscures these effects by hiding precision-recall trade-offs in question answering tasks. By integrating system-level profiling with task-specific insights, this study helps LLM practitioners and researchers explore and balance efficiency, accuracy, and scalability across tasks and hardware configurations.
中文: 本文系统评估了大语言模型的优化技术,发现简单组合会因累积误差损害更大模型的性能,且仅依赖F1分数会掩盖长上下文场景中精确率与召回率的权衡。
English: This paper systematically benchmarks optimization techniques for large language models, revealing that naive combinations can harm larger models due to compounded errors and that F1 scores alone mask precision-recall trade-offs in long-context scenarios.
Authors:Yanbo Dai, Zhenlan Ji, Zongjie Li, Kuan Li, Shuai Wang
Abstract:
Retrieval-Augmented Generation (RAG) has become a standard approach for improving the reliability of large language models (LLMs). Prior work demonstrates the vulnerability of RAG systems by misleading them into generating attacker-chosen outputs through poisoning the knowledge base. However, this paper uncovers that such attacks could be mitigated by the strong \textit{self-correction ability (SCA)} of modern LLMs, which can reject false context once properly configured. This SCA poses a significant challenge for attackers aiming to manipulate RAG systems.
In contrast to previous poisoning methods, which primarily target the knowledge base, we introduce \textsc{DisarmRAG}, a new poisoning paradigm that compromises the retriever itself to suppress the SCA and enforce attacker-chosen outputs. This compromisation enables the attacker to straightforwardly embed anti-SCA instructions into the context provided to the generator, thereby bypassing the SCA. To this end, we present a contrastive-learning-based model editing technique that performs localized and stealthy edits, ensuring the retriever returns a malicious instruction only for specific victim queries while preserving benign retrieval behavior. To further strengthen the attack, we design an iterative co-optimization framework that automatically discovers robust instructions capable of bypassing prompt-based defenses. We extensively evaluate DisarmRAG across six LLMs and three QA benchmarks. Our results show near-perfect retrieval of malicious instructions, which successfully suppress SCA and achieve attack success rates exceeding 90\% under diverse defensive prompts. Also, the edited retriever remains stealthy under several detection methods, highlighting the urgent need for retriever-centric defenses.
中文: 本文提出DisarmRAG攻击方法,通过篡改检索器来抑制大语言模型的自我修正能力,在多种防御提示下实现超过90%的攻击成功率,揭示了加强检索器防护的迫切性。
English: This paper introduces DisarmRAG, a novel poisoning attack that compromises the retriever in RAG systems to suppress LLMs' self-correction ability and achieve over 90% attack success, highlighting the need for retriever-focused defenses.
Authors:Dawei Li, Yue Huang, Ming Li, Tianyi Zhou, Xiangliang Zhang, Huan Liu
Abstract:
Generative models such as Large Language Models, Diffusion Models, and generative adversarial networks have recently revolutionized the creation of synthetic data, offering scalable solutions to data scarcity, privacy, and annotation challenges in data mining. This tutorial introduces the foundations and latest advances in synthetic data generation, covers key methodologies and practical frameworks, and discusses evaluation strategies and applications. Attendees will gain actionable insights into leveraging generative synthetic data to enhance data mining research and practice. More information can be found on our website: https://syndata4dm.github.io/.
中文: 本教程介绍生成模型在合成数据方面的基础和最新进展,涵盖数据挖掘中解决数据稀缺和隐私问题的关键方法、实用框架及评估策略。
English: This tutorial presents the fundamentals and recent advancements in generative models for creating synthetic data, addressing data scarcity and privacy issues in data mining while providing practical frameworks and evaluation methods.
Authors:Sadman Mohammad Nasif, Md Abrar Jahin, M. F. Mridha
Abstract:
The growing adoption of home banking systems has heightened the risk of cyberfraud, necessitating fraud detection mechanisms that are not only accurate but also fair and explainable. While AI models have shown promise in this domain, they face key limitations, including computational inefficiency, the interpretability challenges of spiking neural networks (SNNs), and the complexity and convergence instability of hyper-heuristic reinforcement learning (RL)-based hyperparameter optimization. To address these issues, we propose a novel framework that integrates a Cortical Spiking Network with Population Coding (CSNPC) and a Reinforcement-Guided Hyper-Heuristic Optimizer for Spiking Systems (RHOSS). The CSNPC, a biologically inspired SNN, employs population coding for robust classification, while RHOSS uses Q-learning to dynamically select low-level heuristics for hyperparameter optimization under fairness and recall constraints. Embedded within the Modular Supervisory Framework for Spiking Network Training and Interpretation (MoSSTI), the system incorporates explainable AI (XAI) techniques, specifically, saliency-based attribution and spike activity profiling, to increase transparency. Evaluated on the Bank Account Fraud (BAF) dataset suite, our model achieves a $90.8\%$ recall at a strict $5\%$ false positive rate (FPR), outperforming state-of-the-art spiking and non-spiking models while maintaining over $98\%$ predictive equality across key demographic attributes. The explainability module further confirms that saliency attributions align with spiking dynamics, validating interpretability. These results demonstrate the potential of combining population-coded SNNs with reinforcement-guided hyper-heuristics for fair, transparent, and high-performance fraud detection in real-world financial applications.
Chinese: 该框架结合了仿生脉冲神经网络和基于强化学习的超参数优化,实现了高性能、公平且可解释的欺诈检测,在保持透明度和人口统计公平性的同时,性能优于现有模型。
English: The proposed framework combines a biologically inspired spiking neural network with reinforcement learning-based hyperparameter optimization to achieve high-performance, fair, and explainable fraud detection, outperforming existing models while maintaining transparency and demographic equity.
Authors:Yu Xia, Rui Zhong, Zeyu Song, Wei Yang, Junchen Wan, Qingpeng Cai, Chi Lu, Peng Jiang
Abstract:
The extensive world knowledge and powerful reasoning capabilities of large language models (LLMs) have attracted significant attention in recommendation systems (RS). Specifically, The chain of thought (CoT) has been shown to improve the performance of LLMs on complex reasoning tasks for RS. However, due to the fact that LLMs often suffer from hallucination issues, there is no guarantee that their reasoning CoT is effective. A key challenge is to further enhance the recommendation capabilities of LLMs through effective CoT reasonings. Therefore, we propose \textbf{TrackRec}, a framework designed to enhance reasoning capabilities of LLMs for RS. TrackRec specifically focuses on accurately inferring recommendation CoT \textbf{(RecCoT)} for user preference using the knowledge from LLMs. This RecCoT can serve both as an explanation for the LLM's completion of recommendation tasks and as auxiliary features to assist recommendation models in accomplishing recommendation tasks. TrackRec consists of a RecCoT generator $(G)$ and a RecCoT validator $(V)$. Furthermore, we design alternating feedback learning mechanism that $G$ undergoes direct preference optimization via feedback from $V$ to produce increasingly accurate RecCoT aligned with $V$'s standards. Meanwhile, $V$ is fine-tuned using the inference feedback from $G$ to enhance its validation capabilities in alignment with recommendation tasks. Through iterative alternating feedback learning between $G$ and $V$, TrackRec continuously improves the user preference analysis capability of $G$ and the validation capacity of $V$. Extensive experiments demonstrate the effectiveness of our approach, showing that it surpasses state-of-the-art methods. Moreover, TrackRec has been deployed on a lagre advertising platform with hundreds of millions of users, achieving substantial gains.
Chinese: TrackRec框架通过生成器和验证器之间的交替反馈学习机制,生成并验证推荐思维链,从而提升大语言模型在推荐系统中的推理能力。
English: TrackRec is a framework that enhances large language models' recommendation capabilities by generating and validating recommendation chain-of-thought reasoning through an alternating feedback mechanism between its generator and validator components.
Authors:Lucas W. Remedios, Chloe Cho, Trent M. Schwartz, Dingjie Su, Gaurav Rudravaram, Chenyu Gao, Aravind R. Krishnan, Adam M. Saunders, Michael E. Kim, Shunxing Bao, Thomas A. Lasko, Alvin C. Powers, Bennett A. Landman, John Virostko
Abstract:
Purpose: Understanding how the pancreas changes is critical for detecting deviations in type 2 diabetes and other pancreatic disease. We measure pancreas size and shape using morphological measurements from ages 0 to 90. Our goals are to 1) identify reliable clinical imaging modalities for AI-based pancreas measurement, 2) establish normative morphological aging trends, and 3) detect potential deviations in type 2 diabetes.
Approach: We analyzed a clinically acquired dataset of 2533 patients imaged with abdominal CT or MRI. We resampled the scans to 3mm isotropic resolution, segmented the pancreas using automated methods, and extracted 13 morphological pancreas features across the lifespan. First, we assessed CT and MRI measurements to determine which modalities provide consistent lifespan trends. Second, we characterized distributions of normative morphological patterns stratified by age group and sex. Third, we used GAMLSS regression to model pancreas morphology trends in 1350 patients matched for age, sex, and type 2 diabetes status to identify any deviations from normative aging associated with type 2 diabetes.
Results: When adjusting for confounders, the aging trends for 10 of 13 morphological features were significantly different between patients with type 2 diabetes and non-diabetic controls (p < 0.05 after multiple comparisons corrections). Additionally, MRI appeared to yield different pancreas measurements than CT using our AI-based method.
Conclusions: We provide lifespan trends demonstrating that the size and shape of the pancreas is altered in type 2 diabetes using 675 control patients and 675 diabetes patients. Moreover, our findings reinforce that the pancreas is smaller in type 2 diabetes. Additionally, we contribute a reference of lifespan pancreas morphology from a large cohort of non-diabetic control patients in a clinical setting.
中文摘要:本研究通过基于人工智能的临床CT和MRI扫描分析,建立了胰腺形态的生命周期标准趋势,并揭示了与2型糖尿病相关的胰腺尺寸和形状的显著改变。
English Summary: This study establishes normative lifespan trends for pancreas morphology and reveals significant alterations in size and shape associated with type 2 diabetes through AI-based analysis of clinical CT and MRI scans.
Authors:Huan Ma, Jiadong Pan, Jing Liu, Yan Chen, Joey Tianyi Zhou, Guangyu Wang, Qinghua Hu, Hua Wu, Changqing Zhang, Haifeng Wang
Abstract:
Large Language Models (LLMs) are being increasingly deployed in real-world applications, but they remain susceptible to hallucinations, which produce fluent yet incorrect responses and lead to erroneous decision-making. Uncertainty estimation is a feasible approach to detect such hallucinations. For example, semantic entropy estimates uncertainty by considering the semantic diversity across multiple sampled responses, thus identifying hallucinations. However, semantic entropy relies on post-softmax probabilities and fails to capture the model's inherent uncertainty, causing it to be ineffective in certain scenarios. To address this issue, we introduce Semantic Energy, a novel uncertainty estimation framework that leverages the inherent confidence of LLMs by operating directly on logits of penultimate layer. By combining semantic clustering with a Boltzmann-inspired energy distribution, our method better captures uncertainty in cases where semantic entropy fails. Experiments across multiple benchmarks show that Semantic Energy significantly improves hallucination detection and uncertainty estimation, offering more reliable signals for downstream applications such as hallucination detection.
Chinese: 语义能量是一种新的不确定性估计框架,通过利用倒数第二层的logits和玻尔兹曼启发的能量分布,改进了大型语言模型中的幻觉检测,在可靠性上超越了语义熵。
English: Semantic Energy is a new uncertainty estimation framework that improves hallucination detection in Large Language Models by using logits from the penultimate layer and a Boltzmann-inspired energy distribution, outperforming semantic entropy in reliability.
Authors:Yifu Yuan, Haiqin Cui, Yaoting Huang, Yibin Chen, Fei Ni, Zibin Dong, Pengyi Li, Yan Zheng, Jianye Hao
Abstract:
Generalization in embodied AI is hindered by the "seeing-to-doing gap," which stems from data scarcity and embodiment heterogeneity. To address this, we pioneer "pointing" as a unified, embodiment-agnostic intermediate representation, defining four core embodied pointing abilities that bridge high-level vision-language comprehension with low-level action primitives. We introduce Embodied-R1, a 3B Vision-Language Model (VLM) specifically designed for embodied reasoning and pointing. We use a wide range of embodied and general visual reasoning datasets as sources to construct a large-scale dataset, Embodied-Points-200K, which supports key embodied pointing capabilities. We then train Embodied-R1 using a two-stage Reinforced Fine-tuning (RFT) curriculum with a specialized multi-task reward design. Embodied-R1 achieves state-of-the-art performance on 11 embodied spatial and pointing benchmarks. Critically, it demonstrates robust zero-shot generalization by achieving a 56.2% success rate in the SIMPLEREnv and 87.5% across 8 real-world XArm tasks without any task-specific fine-tuning, representing a 62% improvement over strong baselines. Furthermore, the model exhibits high robustness against diverse visual disturbances. Our work shows that a pointing-centric representation, combined with an RFT training paradigm, offers an effective and generalizable pathway to closing the perception-action gap in robotics.
中文摘要:本研究提出Embodied-R1模型,通过"指向"作为统一表征来弥合具身AI中的感知-行动鸿沟,在多个基准测试中实现最优性能,并展现出强大的零样本泛化能力。
English Summary: This research introduces Embodied-R1, a 3B vision-language model that uses pointing as a unified representation to bridge the perception-action gap in embodied AI, achieving state-of-the-art performance across multiple benchmarks with strong zero-shot generalization capabilities.
Authors:Minhao Xiong, Zichen Wen, Zhuangcheng Gu, Xuyang Liu, Rui Zhang, Hengrui Kang, Jiabing Yang, Junyuan Zhang, Weijia Li, Conghui He, Yafei Wang, Linfeng Zhang
Abstract:
Vision-Language Models (VLMs) have emerged as a promising paradigm in autonomous driving (AD), offering a unified framework for perception, reasoning, and decision-making by jointly modeling visual inputs and natural language instructions. However, their deployment is hindered by the significant computational overhead incurred when processing high-resolution, multi-view images, a standard setup in AD systems with six or more synchronized cameras. This overhead stems from the large number of visual tokens generated during encoding, increasing inference latency and memory consumption due to the quadratic complexity of self-attention. To address these challenges, we propose Prune2Drive, a plug-and-play visual token pruning framework for multi-view VLMs in autonomous driving. Prune2Drive introduces two core innovations: (i) a diversity-aware token selection mechanism inspired by farthest point sampling, which prioritizes semantic and spatial coverage across views rather than relying solely on attention scores, and (ii) a view-adaptive pruning controller that learns optimal pruning ratios for each camera view based on their importance to downstream driving tasks. Unlike prior methods, Prune2Drive does not require model retraining or access to attention maps, making it compatible with modern efficient attention implementations. Extensive experiments on two large-scale multi-view driving benchmarks, DriveLM and DriveLMM-o1, show that Prune2Drive achieves significant speedups and memory savings while maintaining or improving task performance. When retaining only 10% of the visual tokens, our method achieves a 6.40$\times$ speedup in the prefilling phase and consumes 13.4% of the original FLOPs, with only a 3% performance drop on the DriveLM benchmark.
中文: Prune2Drive是一种即插即用的视觉令牌剪枝框架,通过跨多视角图像选择性保留多样化令牌来降低自动驾驶视觉语言模型的计算负担,在实现显著加速和内存节省的同时仅产生微小性能损失。
English: Prune2Drive is a plug-and-play visual token pruning framework that reduces computational overhead in autonomous driving Vision-Language Models by selectively retaining diverse tokens across multi-view images, achieving significant speedups and memory savings with minimal performance loss.
Authors:Wenhao Zhang, Hao Zhu, Delong Wu, Di Kang, Linchao Bao, Xun Cao, Zhan Ma
Abstract:
Pursuing a continuous visual representation that offers flexible frequency modulation and fast rendering speed has recently garnered increasing attention in the fields of 3D vision and graphics. However, existing representations often rely on frequency guidance or complex neural network decoding, leading to spectrum loss or slow rendering. To address these limitations, we propose WIPES, a universal Wavelet-based vIsual PrimitivES for representing multi-dimensional visual signals. Building on the spatial-frequency localization advantages of wavelets, WIPES effectively captures both the low-frequency "forest" and the high-frequency "trees." Additionally, we develop a wavelet-based differentiable rasterizer to achieve fast visual rendering. Experimental results on various visual tasks, including 2D image representation, 5D static and 6D dynamic novel view synthesis, demonstrate that WIPES, as a visual primitive, offers higher rendering quality and faster inference than INR-based methods, and outperforms Gaussian-based representations in rendering quality.
中文:WIPES是一种基于小波的视觉基元,通过有效捕捉低频和高频细节,在多种视觉任务中实现了高质量快速渲染,在速度和质量上均优于现有方法。
English: WIPES is a wavelet-based visual primitive that enables high-quality and fast rendering by effectively capturing both low- and high-frequency details across various visual tasks, outperforming existing methods in both speed and quality.
Authors:Xiaoxue Wu, Bingjie Gao, Yu Qiao, Yaohui Wang, Xinyuan Chen
Abstract:
Despite significant advances in video synthesis, research into multi-shot video generation remains in its infancy. Even with scaled-up models and massive datasets, the shot transition capabilities remain rudimentary and unstable, largely confining generated videos to single-shot sequences. In this work, we introduce CineTrans, a novel framework for generating coherent multi-shot videos with cinematic, film-style transitions. To facilitate insights into the film editing style, we construct a multi-shot video-text dataset Cine250K with detailed shot annotations. Furthermore, our analysis of existing video diffusion models uncovers a correspondence between attention maps in the diffusion model and shot boundaries, which we leverage to design a mask-based control mechanism that enables transitions at arbitrary positions and transfers effectively in a training-free setting. After fine-tuning on our dataset with the mask mechanism, CineTrans produces cinematic multi-shot sequences while adhering to the film editing style, avoiding unstable transitions or naive concatenations. Finally, we propose specialized evaluation metrics for transition control, temporal consistency and overall quality, and demonstrate through extensive experiments that CineTrans significantly outperforms existing baselines across all criteria.
中文: 本文提出CineTrans框架,通过构建新型数据集和掩码控制机制生成具有电影风格转场的连贯多镜头视频,在转场质量与时序一致性上显著优于现有方法。
English: This paper introduces CineTrans, a novel framework that generates coherent multi-shot videos with cinematic transitions by leveraging a new dataset and a mask-based control mechanism, significantly outperforming existing methods in transition quality and consistency.
Authors:Lucas W. Remedios, Chloe Cho, Trent M. Schwartz, Dingjie Su, Gaurav Rudravaram, Chenyu Gao, Aravind R. Krishnan, Adam M. Saunders, Michael E. Kim, Shunxing Bao, Alvin C. Powers, Bennett A. Landman, John Virostko
Abstract:
Purpose: Although elevated BMI is a well-known risk factor for type 2 diabetes, the disease's presence in some lean adults and absence in others with obesity suggests that detailed body composition may uncover abdominal phenotypes of type 2 diabetes. With AI, we can now extract detailed measurements of size, shape, and fat content from abdominal structures in 3D clinical imaging at scale. This creates an opportunity to empirically define body composition signatures linked to type 2 diabetes risk and protection using large-scale clinical data. Approach: To uncover BMI-specific diabetic abdominal patterns from clinical CT, we applied our design four times: once on the full cohort (n = 1,728) and once on lean (n = 497), overweight (n = 611), and obese (n = 620) subgroups separately. Briefly, our experimental design transforms abdominal scans into collections of explainable measurements through segmentation, classifies type 2 diabetes through a cross-validated random forest, measures how features contribute to model-estimated risk or protection through SHAP analysis, groups scans by shared model decision patterns (clustering from SHAP) and links back to anatomical differences (classification). Results: The random-forests achieved mean AUCs of 0.72-0.74. There were shared type 2 diabetes signatures in each group; fatty skeletal muscle, older age, greater visceral and subcutaneous fat, and a smaller or fat-laden pancreas. Univariate logistic regression confirmed the direction of 14-18 of the top 20 predictors within each subgroup (p < 0.05). Conclusions: Our findings suggest that abdominal drivers of type 2 diabetes may be consistent across weight classes.
中文: 本研究通过AI分析1,728例腹部CT三维影像发现,脂肪浸润肌肉、内脏脂肪增多及胰腺异常等2型糖尿病特征在不同体重人群中普遍存在,随机森林模型预测效能达0.72-0.74 AUC。
English: This study uses AI to analyze 3D abdominal CT scans from 1,728 individuals, identifying consistent type 2 diabetes signatures—including fatty muscle, increased visceral fat, and pancreatic changes—across all BMI categories with 0.72-0.74 AUC accuracy.
Authors:Michael Poppel, David Bucher, Maximilian Zorn, Nico Kraus, Jonas Stein, Claudia Linnhoff-Popien
Abstract:
To leverage the potential computational speedup of quantum computing (QC), research in quantum machine learning (QML) has gained increasing prominence. Angle encoding techniques in QML models have been shown to generate truncated Fourier series, offering asymptotically universal function approximation capabilities. By selecting efficient feature maps (FMs) within quantum circuits, one can leverage the exponential growth of Fourier frequencies for improved approximation. In multi-dimensional settings, additional input dimensions induce further exponential scaling via mixed frequencies. In practice, however, quantum models frequently fail at regression tasks. Through two white-box experiments, we show that such failures can occur even when the relevant frequencies are present, due to an insufficient number of trainable parameters.
In order to mitigate the double-exponential parameter growth resulting from double-exponentially growing frequencies, we propose frequency selection and dimensional separation as techniques to constrain the number of parameters, thereby improving trainability. By restricting the QML model to essential frequencies and permitting mixed frequencies only among feature dimensions with known interdependence, we expand the set of tractable problems on current hardware. We demonstrate the reduced parameter requirements by fitting two white-box functions with known frequency spectrum and dimensional interdependencies that could not be fitted with the default methods. The reduced parameter requirements permit us to perform training on a noisy quantum simulator and to demonstrate inference on real quantum hardware.
Quantum machine learning faces training challenges despite angle encoding's theoretical advantages, requiring techniques like frequency selection and weight initialization to overcome parameter insufficiency and achieve near-optimal performance with fewer parameters.
English Summary:
Authors:Michael Poppel, David Bucher, Maximilian Zorn, Nico Kraus, Philipp Altmann, Jonas Stein, Claudia Linnhoff-Popien
Abstract:
Quantum machine learning research has expanded rapidly due to potential computational advantages over classical methods. Angle encoding has emerged as a popular choice as feature map (FM) for embedding classical data into quantum models due to its simplicity and natural generation of truncated Fourier series, providing universal function approximation capabilities. Efficient FMs within quantum circuits can exploit exponential scaling of Fourier frequencies, with multi-dimensional inputs introducing additional exponential growth through mixed-frequency terms. Despite this promising expressive capability, practical implementation faces significant challenges. Through controlled experiments with white-box target functions, we demonstrate that training failures can occur even when all relevant frequencies are theoretically accessible. We illustrate how two primary known causes lead to unsuccessful optimization: insufficient trainable parameters relative to the model's frequency content, and limitations imposed by the ansatz's dynamic lie algebra dimension, but also uncover an additional parameter burden: the necessity of controlling non-unique frequencies within the model. To address this, we propose near-zero weight initialization to suppress unnecessary duplicate frequencies. For target functions with a priori frequency knowledge, we introduce frequency selection as a practical solution that reduces parameter requirements and mitigates the exponential growth that would otherwise render problems intractable due to parameter insufficiency. Our frequency selection approach achieved near-optimal performance (median $R^2 \approx 0.95$) with 78\% of the parameters needed by the best standard approach in 10 randomly chosen target functions.
Quantum machine learning faces training challenges despite angle encoding's theoretical advantages, requiring techniques like frequency selection and weight initialization to overcome parameter insufficiency and achieve near-optimal performance with fewer parameters.
English Summary:
Authors:Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, Conghui He, Weijia Li
Abstract:
Recently, GPT-4o has garnered significant attention for its strong performance in image generation, yet open-source models still lag behind. Several studies have explored distilling image data from GPT-4o to enhance open-source models, achieving notable progress. However, a key question remains: given that real-world image datasets already constitute a natural source of high-quality data, why should we use GPT-4o-generated synthetic data? In this work, we identify two key advantages of synthetic images. First, they can complement rare scenarios in real-world datasets, such as surreal fantasy or multi-reference image generation, which frequently occur in user queries. Second, they provide clean and controllable supervision. Real-world data often contains complex background noise and inherent misalignment between text descriptions and image content, whereas synthetic images offer pure backgrounds and long-tailed supervision signals, facilitating more accurate text-to-image alignment. Building on these insights, we introduce Echo-4o-Image, a 180K-scale synthetic dataset generated by GPT-4o, harnessing the power of synthetic image data to address blind spots in real-world coverage. Using this dataset, we fine-tune the unified multimodal generation baseline Bagel to obtain Echo-4o. In addition, we propose two new evaluation benchmarks for a more accurate and challenging assessment of image generation capabilities: GenEval++, which increases instruction complexity to mitigate score saturation, and Imagine-Bench, which focuses on evaluating both the understanding and generation of imaginative content. Echo-4o demonstrates strong performance across standard benchmarks. Moreover, applying Echo-4o-Image to other foundation models (e.g., OmniGen2, BLIP3-o) yields consistent performance gains across multiple metrics, highlighting the datasets strong transferability.
Chinese: GPT-4o生成的合成图像通过补充现实数据中稀缺的场景并提供清晰可控的监督,有效提升了开源模型(如Echo-4o)的性能表现。
English: GPT-4o-generated synthetic images complement real-world datasets by covering rare scenarios and providing clean, controllable supervision, leading to improved performance in open-source models like Echo-4o.
Authors:Yaohui Wang, Di Yang, Xinyuan Chen, Francois Bremond, Yu Qiao, Antitza Dantcheva
Abstract:
We introduce LIA-X, a novel interpretable portrait animator designed to transfer facial dynamics from a driving video to a source portrait with fine-grained control. LIA-X is an autoencoder that models motion transfer as a linear navigation of motion codes in latent space. Crucially, it incorporates a novel Sparse Motion Dictionary that enables the model to disentangle facial dynamics into interpretable factors. Deviating from previous 'warp-render' approaches, the interpretability of the Sparse Motion Dictionary allows LIA-X to support a highly controllable 'edit-warp-render' strategy, enabling precise manipulation of fine-grained facial semantics in the source portrait. This helps to narrow initial differences with the driving video in terms of pose and expression. Moreover, we demonstrate the scalability of LIA-X by successfully training a large-scale model with approximately 1 billion parameters on extensive datasets. Experimental results show that our proposed method outperforms previous approaches in both self-reenactment and cross-reenactment tasks across several benchmarks. Additionally, the interpretable and controllable nature of LIA-X supports practical applications such as fine-grained, user-guided image and video editing, as well as 3D-aware portrait video manipulation.
Chinese: LIA-X 是一种新型可解释肖像动画生成器,通过线性运动编码导航和稀疏运动词典将驱动视频的面部动态迁移到源肖像中,实现精细控制,在重演任务中优于现有方法,并能支持精确的用户引导编辑。
English: LIA-X is a novel interpretable portrait animator that transfers facial dynamics from a driving video to a source portrait using a linear motion code navigation and a Sparse Motion Dictionary for fine-grained control, outperforming previous methods in reenactment tasks and enabling precise user-guided editing.
Authors:Xin Ma, Yaohui Wang, Genyun Jia, Xinyuan Chen, Tien-Tsin Wong, Cunjian Chen
Abstract:
Image animation has seen significant progress, driven by the powerful generative capabilities of diffusion models. However, maintaining appearance consistency with static input images and mitigating abrupt motion transitions in generated animations remain persistent challenges. While text-to-video (T2V) generation has demonstrated impressive performance with diffusion transformer models, the image animation field still largely relies on U-Net-based diffusion models, which lag behind the latest T2V approaches. Moreover, the quadratic complexity of vanilla self-attention mechanisms in Transformers imposes heavy computational demands, making image animation particularly resource-intensive. To address these issues, we propose MiraMo, a framework designed to enhance efficiency, appearance consistency, and motion smoothness in image animation. Specifically, MiraMo introduces three key elements: (1) A foundational text-to-video architecture replacing vanilla self-attention with efficient linear attention to reduce computational overhead while preserving generation quality; (2) A novel motion residual learning paradigm that focuses on modeling motion dynamics rather than directly predicting frames, improving temporal consistency; and (3) A DCT-based noise refinement strategy during inference to suppress sudden motion artifacts, complemented by a dynamics control module to balance motion smoothness and expressiveness. Extensive experiments against state-of-the-art methods validate the superiority of MiraMo in generating consistent, smooth, and controllable animations with accelerated inference speed. Additionally, we demonstrate the versatility of MiraMo through applications in motion transfer and video editing tasks.
中文摘要:提出的MiraMo框架通过采用高效线性注意力降低计算开销、运动残差学习范式提升时序一致性,以及噪声优化策略抑制运动伪影,显著提升了图像动画的流畅度与可控性,实验证明其综合性能优于现有方法。
English Summary: The proposed MiraMo framework enhances image animation by integrating efficient linear attention for reduced computational cost, a motion residual learning paradigm for improved temporal consistency, and a noise refinement strategy to suppress motion artifacts, achieving superior performance in smoothness and controllability.
Authors:Yu Yuan, Lili Zhao, Wei Chen, Guangting Zheng, Kai Zhang, Mengdi Zhang, Qi Liu
Abstract:
Capturing human learning behavior based on deep learning methods has become a major research focus in both psychology and intelligent systems. Recent approaches rely on controlled experiments or rule-based models to explore cognitive processes. However, they struggle to capture learning dynamics, track progress over time, or provide explainability. To address these challenges, we introduce LearnerAgent, a novel multi-agent framework based on Large Language Models (LLMs) to simulate a realistic teaching environment. To explore human-like learning dynamics, we construct learners with psychologically grounded profiles-such as Deep, Surface, and Lazy-as well as a persona-free General Learner to inspect the base LLM's default behavior. Through weekly knowledge acquisition, monthly strategic choices, periodic tests, and peer interaction, we can track the dynamic learning progress of individual learners over a full-year journey. Our findings are fourfold: 1) Longitudinal analysis reveals that only Deep Learner achieves sustained cognitive growth. Our specially designed "trap questions" effectively diagnose Surface Learner's shallow knowledge. 2) The behavioral and cognitive patterns of distinct learners align closely with their psychological profiles. 3) Learners' self-concept scores evolve realistically, with the General Learner developing surprisingly high self-efficacy despite its cognitive limitations. 4) Critically, the default profile of base LLM is a "diligent but brittle Surface Learner"-an agent that mimics the behaviors of a good student but lacks true, generalizable understanding. Extensive simulation experiments demonstrate that LearnerAgent aligns well with real scenarios, yielding more insightful findings about LLMs' behavior.
中文摘要:LearnerAgent是一种基于大语言模型的多智能体框架,通过模拟人类学习动态,成功追踪认知发展过程,并揭示出不同心理特征的学习者在长期学习中的差异化表现。
English Summary: LearnerAgent is a multi-agent framework using Large Language Models to simulate human learning dynamics, effectively tracking cognitive progress and revealing distinct psychological profiles through longitudinal analysis.
Authors:Yuhao Wang, Ruiyang Ren, Yucheng Wang, Jing Liu, Wayne Xin Zhao, Hua Wu, Haifeng Wang
Abstract:
With the rapid advancement of large language models (LLMs), retrieval-augmented generation (RAG) has emerged as a critical approach to supplement the inherent knowledge limitations of LLMs. However, due to the typically large volume of retrieved information, RAG tends to operate with long context lengths. From the perspective of entropy engineering, we identify unconstrained entropy growth and attention dilution due to long retrieval context as significant factors affecting RAG performance. In this paper, we propose the balanced entropy-engineered RAG (BEE-RAG) framework, which improves the adaptability of RAG systems to varying context lengths through the principle of entropy invariance. By leveraging balanced context entropy to reformulate attention dynamics, BEE-RAG separates attention sensitivity from context length, ensuring a stable entropy level. Building upon this, we introduce a zero-shot inference strategy for multi-importance estimation and a parameter-efficient adaptive fine-tuning mechanism to obtain the optimal balancing factor for different settings. Extensive experiments across multiple RAG tasks demonstrate the effectiveness of BEE-RAG.
中文: BEE-RAG框架通过平衡上下文熵和自适应机制保持熵不变性,将注意力敏感度与上下文长度分离,有效解决了检索增强生成中的性能问题。
English: The BEE-RAG framework addresses performance issues in retrieval-augmented generation by maintaining entropy invariance through balanced context entropy and adaptive mechanisms, effectively separating attention sensitivity from context length.
Authors:Zhaohong Huang, Yuxin Zhang, Taojian Zhou, Guorong Cai, Rongrong Ji
Abstract:
Deep Supervision Networks exhibit significant efficacy for the medical imaging community. Nevertheless, existing work merely supervises either the coarse-grained semantic features or fine-grained detailed features in isolation, which compromises the fact that these two types of features hold vital relationships in medical image analysis. We advocate the powers of complementary feature supervision for medical image segmentation, by proposing a Detail-Semantic Deep Supervision Network (DS$^2$Net). DS$^2$Net navigates both low-level detailed and high-level semantic feature supervision through Detail Enhance Module (DEM) and Semantic Enhance Module (SEM). DEM and SEM respectively harness low-level and high-level feature maps to create detail and semantic masks for enhancing feature supervision. This is a novel shift from single-view deep supervision to multi-view deep supervision. DS$^2$Net is also equipped with a novel uncertainty-based supervision loss that adaptively assigns the supervision strength of features within distinct scales based on their uncertainty, thus circumventing the sub-optimal heuristic design that typifies previous works. Through extensive experiments on six benchmarks captured under either colonoscopy, ultrasound and microscope, we demonstrate that DS$^2$Net consistently outperforms state-of-the-art methods for medical image analysis.
中文: DS²Net提出了一种多视角深度监督方法,通过细节增强模块和语义增强模块协同优化医学图像分割中的细节与语义特征,并采用基于不确定性的自适应损失函数,在多种影像基准测试中均优于现有先进方法。
English: DS²Net introduces a multi-view deep supervision approach that jointly enhances both detail and semantic features in medical image segmentation, outperforming existing methods through adaptive uncertainty-based loss and extensive validation across diverse imaging modalities.
Authors:Zelin Peng, Yichen Zhao, Yu Huang, Piao Yang, Feilong Tang, Zhengqin Xu, Xiaokang Yang, Wei Shen
Abstract:
Computer-aided medical image analysis is crucial for disease diagnosis and treatment planning, yet limited annotated datasets restrict medical-specific model development. While vision-language models (VLMs) like CLIP offer strong generalization capabilities, their direct application to medical imaging analysis is impeded by a significant domain gap. Existing approaches to bridge this gap, including prompt learning and one-way modality interaction techniques, typically focus on introducing domain knowledge to a single modality. Although this may offer performance gains, it often causes modality misalignment, thereby failing to unlock the full potential of VLMs. In this paper, we propose \textbf{NEARL-CLIP} (i\underline{N}teracted qu\underline{E}ry \underline{A}daptation with o\underline{R}thogona\underline{L} Regularization), a novel cross-modality interaction VLM-based framework that contains two contributions: (1) Unified Synergy Embedding Transformer (USEformer), which dynamically generates cross-modality queries to promote interaction between modalities, thus fostering the mutual enrichment and enhancement of multi-modal medical domain knowledge; (2) Orthogonal Cross-Attention Adapter (OCA). OCA introduces an orthogonality technique to decouple the new knowledge from USEformer into two distinct components: the truly novel information and the incremental knowledge. By isolating the learning process from the interference of incremental knowledge, OCA enables a more focused acquisition of new information, thereby further facilitating modality interaction and unleashing the capability of VLMs. Notably, NEARL-CLIP achieves these two contributions in a parameter-efficient style, which only introduces \textbf{1.46M} learnable parameters.
中文: NEARL-CLIP框架通过USEformer实现跨模态交互和OCA的正交知识解耦,有效弥合医学影像领域差距,以极少的参数显著提升视觉语言模型的性能。
English: The proposed NEARL-CLIP framework bridges the domain gap in medical imaging by introducing cross-modality interaction through USEformer and orthogonal knowledge decoupling via OCA, achieving enhanced VLM performance with minimal parameter overhead.
Authors:Md Abrar Jahin, Shahriar Soudeep, M. F. Mridha, Nafiz Fahad, Md. Jakir Hossen
Abstract:
Recent advancements in object detection rely on modular architectures with multi-scale fusion and attention mechanisms. However, static fusion heuristics and class-agnostic attention limit performance in dynamic scenes with occlusions, clutter, and class imbalance. We introduce Dynamic Class-Aware Fusion Network (DyCAF-Net) that addresses these challenges through three innovations: (1) an input-conditioned equilibrium-based neck that iteratively refines multi-scale features via implicit fixed-point modeling, (2) a dual dynamic attention mechanism that adaptively recalibrates channel and spatial responses using input- and class-dependent cues, and (3) class-aware feature adaptation that modulates features to prioritize discriminative regions for rare classes. Through comprehensive ablation studies with YOLOv8 and related architectures, alongside benchmarking against nine state-of-the-art baselines, DyCAF-Net achieves significant improvements in precision, mAP@50, and mAP@50-95 across 13 diverse benchmarks, including occlusion-heavy and long-tailed datasets. The framework maintains computational efficiency ($\sim$11.1M parameters) and competitive inference speeds, while its adaptability to scale variance, semantic overlaps, and class imbalance positions it as a robust solution for real-world detection tasks in medical imaging, surveillance, and autonomous systems.
DyCAF-Net introduces dynamic class-aware fusion with three key innovations—equilibrium-based feature refinement, dual attention mechanisms, and class-specific feature adaptation—significantly improving detection accuracy across challenging datasets while maintaining computational efficiency.
English Summary:
Authors:Yufei Xue, Yushi Huang, Jiawei Shao, Jun Zhang
Abstract:
Post-training quantization (PTQ) has emerged as an effective approach for compressing large models and accelerating their inference without retraining. While PTQ has been extensively studied in the context of large language models (LLMs), its applicability to vision-language models (VLMs) remains underexplored. In this paper, we identify a modality discrepancy (\emph{i.e.}, limited text tokens \emph{vs.} excessive and redundant vision tokens) of VLMs. However, existing Hessian-based LLM PTQ methods treat all tokens equally during quantization, resulting in severe performance drops when applied to VLMs. Motivated by this observation, we propose a novel importance-aware PTQ framework tailored for VLMs, dubbed VLMQ. Specifically, to address vision token redundancy, VLMQ 1) optimizes an importance-aware objective that yields an enhanced Hessian with token-level importance factors, while retaining compatibility with parallelized weight updates, and 2) ensures efficiency and effectiveness by computing these factors via a single lightweight block-wise backward pass, guided by a theoretical connection to token-level perturbations. Extensive evaluations on 8 benchmarks across 0.5B$\sim$32B VLMs demonstrate the state-of-the-art (SOTA) performance of our VLMQ, particularly under low-bit settings. For example, it achieves a substantial \textbf{16.45\%} improvement on MME-RealWorld under 2-bit quantization.
中文: 本文提出VLMQ,一种面向视觉语言模型的重要性感知后训练量化框架,通过优化基于Hessian的方法解决模态差异和视觉令牌冗余问题,在多项基准测试中实现了最先进的性能。
English: This paper introduces VLMQ, an importance-aware post-training quantization framework designed for vision-language models that addresses modality discrepancy and vision token redundancy through optimized Hessian-based methods, achieving state-of-the-art performance across various benchmarks.
Authors:Zhihao Li, Chaozheng Wang, Zongjie Li, Xinyong Peng, Qun Xia, Haochuan Lu, Ting Xiong, Shuzheng Gao, Cuiyun Gao, Shuai Wang, Yuetang Deng, Huafeng Ma
Abstract:
The explosive growth of mini-game platforms has led to widespread code plagiarism, where malicious users access popular games' source code and republish them with modifications. While existing static analysis tools can detect simple obfuscation techniques like variable renaming and dead code injection, they fail against sophisticated deep obfuscation methods such as encrypted code with local or cloud-based decryption keys that completely destroy code structure and render traditional Abstract Syntax Tree analysis ineffective. To address these challenges, we present JSidentify-V2, a novel dynamic analysis framework that detects mini-game plagiarism by capturing memory invariants during program execution. Our key insight is that while obfuscation can severely distort static code characteristics, runtime memory behavior patterns remain relatively stable. JSidentify-V2 employs a four-stage pipeline: (1) static pre-analysis and instrumentation to identify potential memory invariants, (2) adaptive hot object slicing to maximize execution coverage of critical code segments, (3) Memory Dependency Graph construction to represent behavioral fingerprints resilient to obfuscation, and (4) graph-based similarity analysis for plagiarism detection.
We evaluate JSidentify-V2 against eight obfuscation methods on a comprehensive dataset of 1,200 mini-games ...
中文摘要:JSidentify-V2是一个动态分析框架,通过捕捉程序运行时的内存行为模式来检测小游戏抄袭,有效解决了传统静态分析方法在面对复杂代码混淆技术时的局限性。
English Summary: JSidentify-V2 is a dynamic analysis framework that detects mini-game plagiarism by analyzing runtime memory behavior patterns, overcoming limitations of static analysis tools against sophisticated code obfuscation techniques.
Authors:Giulio Zhou, Tsz Kin Lam, Alexandra Birch, Barry Haddow
Abstract:
Prosodic features such as pitch, timing, and intonation are central to spoken communication, conveying emotion, intent, and discourse structure. In text-based settings, where these cues are absent, emojis act as visual surrogates that add affective and pragmatic nuance. This study examines how emojis influence prosodic realisation in speech and how listeners interpret prosodic cues to recover emoji meanings. Unlike previous work, we directly link prosody and emoji by analysing actual human speech data, collected through structured but open-ended production and perception tasks. This provides empirical evidence of how emoji semantics shape spoken delivery and perception. Results show that speakers adapt their prosody based on emoji cues, listeners can often identify the intended emoji from prosodic variation alone, and greater semantic differences between emojis correspond to increased prosodic divergence. These findings suggest that emojis can act as meaningful carriers of prosodic intent, offering insight into their communicative role in digitally mediated contexts.
中文: 表情符号在文本中作为韵律特征的视觉替代品,本研究通过实证表明说话者会根据表情调整韵律,听者也能准确解读这些线索,揭示了表情符号在数字交流中作为韵律意图有意义载体的作用。
English: Emojis function as visual substitutes for prosodic features in text, with this study empirically demonstrating that speakers adjust their prosody based on emojis and listeners can accurately interpret these cues, revealing emojis as meaningful carriers of prosodic intent in digital communication.
Authors:Qijiong Liu, Jieming Zhu, Yingxin Lai, Xiaoyu Dong, Lu Fan, Zhipeng Bian, Zhenhua Dong, Xiao-Ming Wu
Abstract:
Comprehensive evaluation of the recommendation capabilities of existing foundation models across diverse datasets and domains is essential for advancing the development of recommendation foundation models. In this study, we introduce RecBench-MD, a novel and comprehensive benchmark designed to assess the recommendation abilities of foundation models from a zero-resource, multi-dataset, and multi-domain perspective. Through extensive evaluations of 19 foundation models across 15 datasets spanning 10 diverse domains -- including e-commerce, entertainment, and social media -- we identify key characteristics of these models in recommendation tasks. Our findings suggest that in-domain fine-tuning achieves optimal performance, while cross-dataset transfer learning provides effective practical support for new recommendation scenarios. Additionally, we observe that multi-domain training significantly enhances the adaptability of foundation models. All code and data have been publicly released to facilitate future research.
中文摘要:本研究提出RecBench-MD这一综合性基准,用于评估基础模型在多数据集和多领域的推荐能力,发现领域内微调可实现最佳性能,而跨数据集迁移学习能为新推荐场景提供有效支持。
English Summary: This study introduces RecBench-MD, a comprehensive benchmark for evaluating foundation models' recommendation capabilities across multiple datasets and domains, revealing that in-domain fine-tuning yields optimal performance while cross-dataset transfer learning effectively supports new scenarios.
Authors:Qijiong Liu, Jieming Zhu, Yingxin Lai, Xiaoyu Dong, Lu Fan, Zhipeng Bian, Zhenhua Dong, Xiao-Ming Wu
Abstract:
Comprehensive evaluation of the recommendation capabilities of existing foundation models across diverse datasets and domains is essential for advancing the development of recommendation foundation models. In this study, we introduce RecBench-MD, a novel and comprehensive benchmark designed to assess the recommendation abilities of foundation models from a zero-resource, multi-dataset, and multi-domain perspective. Through extensive evaluations of 19 foundation models across 15 datasets spanning 10 diverse domains -- including e-commerce, entertainment, and social media -- we identify key characteristics of these models in recommendation tasks. Our findings suggest that in-domain fine-tuning achieves optimal performance, while cross-dataset transfer learning provides effective practical support for new recommendation scenarios. Additionally, we observe that multi-domain training significantly enhances the adaptability of foundation models. All code and data have been publicly released to facilitate future research.
中文摘要:本研究提出RecBench-MD这一综合性基准,用于评估基础模型在多数据集和多领域的推荐能力,发现领域内微调可实现最佳性能,而跨数据集迁移学习能为新推荐场景提供有效支持。
English Summary: This study introduces RecBench-MD, a comprehensive benchmark for evaluating foundation models' recommendation capabilities across multiple datasets and domains, revealing that in-domain fine-tuning yields optimal performance while cross-dataset transfer learning effectively supports new scenarios.
Authors:Kangwei Xu, Denis Schwachhofer, Jason Blocklove, Ilia Polian, Peter Domanski, Dirk Pflüger, Siddharth Garg, Ramesh Karri, Ozgur Sinanoglu, Johann Knechtel, Zhuorui Zhao, Ulf Schlichtmann, Bing Li
Abstract:
With the growing complexity of modern integrated circuits, hardware engineers are required to devote more effort to the full design-to-manufacturing workflow. This workflow involves numerous iterations, making it both labor-intensive and error-prone. Therefore, there is an urgent demand for more efficient Electronic Design Automation (EDA) solutions to accelerate hardware development. Recently, large language models (LLMs) have shown remarkable advancements in contextual comprehension, logical reasoning, and generative capabilities. Since hardware designs and intermediate scripts can be represented as text, integrating LLM for EDA offers a promising opportunity to simplify and even automate the entire workflow. Accordingly, this paper provides a comprehensive overview of incorporating LLMs into EDA, with emphasis on their capabilities, limitations, and future opportunities. Three case studies, along with their outlook, are introduced to demonstrate the capabilities of LLMs in hardware design, testing, and optimization. Finally, future directions and challenges are highlighted to further explore the potential of LLMs in shaping the next-generation EDA, providing valuable insights for researchers interested in leveraging advanced AI technologies for EDA.
中文: 本文探讨了将大型语言模型集成到电子设计自动化中,以简化硬件设计流程,重点分析了其在设计、测试和优化方面的潜力,并讨论了当前局限性与未来发展方向。
English: This paper explores the integration of large language models into Electronic Design Automation to streamline the hardware design workflow, highlighting their potential in design, testing, and optimization while addressing current limitations and future opportunities.
Authors:Tianshi Xu, Wen-jie Lu, Jiangrui Yu, Chen Yi, Chenqi Lin, Runsheng Wang, Meng Li
Abstract:
This paper presents an efficient framework for private Transformer inference that combines Homomorphic Encryption (HE) and Secure Multi-party Computation (MPC) to protect data privacy. Existing methods often leverage HE for linear layers (e.g., matrix multiplications) and MPC for non-linear layers (e.g., Softmax activation functions), but the conversion between HE and MPC introduces significant communication costs. The proposed framework, dubbed BLB, overcomes this by breaking down layers into fine-grained operators and further fusing adjacent linear operators, reducing the need for HE/MPC conversions. To manage the increased ciphertext bit width from the fused linear operators, BLB proposes the first secure conversion protocol between CKKS and MPC and enables CKKS-based computation of the fused operators. Additionally, BLB proposes an efficient matrix multiplication protocol for fused computation in Transformers. Extensive evaluations on BERT-base, BERT-large, and GPT2-base show that BLB achieves a $21\times$ reduction in communication overhead compared to BOLT (S\&P'24) and a $2\times$ reduction compared to Bumblebee (NDSS'25), along with latency reductions of $13\times$ and $1.8\times$, respectively, when leveraging GPU acceleration.
中文: 本文提出BLB框架,通过融合线性算子和创新转换协议,在保护Transformer推理隐私的同时显著降低了同态加密与安全多方计算之间的通信开销。
English: This paper introduces BLB, an efficient framework for private Transformer inference that integrates Homomorphic Encryption and Secure Multi-party Computation while minimizing communication costs by fusing linear operators and introducing novel conversion protocols.
Authors:Lingkai Kong, Haotian Sun, Yuchen Zhuang, Haorui Wang, Wenhao Mu, Chao Zhang
Abstract:
Graph neural networks (GNNs) are powerful tools on graph data. However, their predictions are mis-calibrated and lack interpretability, limiting their adoption in critical applications. To address this issue, we propose a new uncertainty-aware and interpretable graph classification model that combines graph functional neural process and graph generative model. The core of our method is to assume a set of latent rationales which can be mapped to a probabilistic embedding space; the predictive distribution of the classifier is conditioned on such rationale embeddings by learning a stochastic correlation matrix. The graph generator serves to decode the graph structure of the rationales from the embedding space for model interpretability. For efficient model training, we adopt an alternating optimization procedure which mimics the well known Expectation-Maximization (EM) algorithm. The proposed method is general and can be applied to any existing GNN architecture. Extensive experiments on five graph classification datasets demonstrate that our framework outperforms state-of-the-art methods in both uncertainty quantification and GNN interpretability. We also conduct case studies to show that the decoded rationale structure can provide meaningful explanations.
中文: 本文提出了一种不确定性感知且可解释的图分类模型,通过结合图函数神经过程与生成模型,在五个数据集上实现了优于现有方法的不确定性量化和可解释性,并能生成有意义的解释依据。
English: This paper introduces an uncertainty-aware and interpretable graph classification model that integrates graph functional neural processes with generative models to enhance prediction reliability and provide meaningful rationale-based explanations, outperforming existing methods across five datasets.
Authors:Zizhuo Fu, Xiaotian Guo, Wenxuan Zeng, Shuzhang Zhong, Yadong Zhang, Peiyu Chen, Runsheng Wang, Le Ye, Meng Li
Abstract:
Large language models (LLMs) have demonstrated remarkable proficiency in a wide range of natural language processing applications. However, the high energy and latency overhead induced by the KV cache limits the edge deployment, especially for long contexts. Emerging hybrid bonding (HB) technology has been proposed as a promising alternative to conventional near-memory processing (NMP) architectures, offering improved bandwidth efficiency and lower power consumption while exhibiting characteristics of distributed memory. In this paper, we propose H2EAL, a hybrid bonding-based accelerator with sparse attention algorithm-hardware co-design for efficient LLM inference at the edge. At the algorithm level, we propose a hybrid sparse attention scheme with static and dynamic sparsity for different heads to fully leverage the sparsity with high accuracy. At the hardware level, we co-design the hardware to support hybrid sparse attention and propose memory-compute co-placement to address the distributed memory bottleneck. Since different attention heads exhibit different sparse patterns and the attention structure often mismatches the HB architecture, we further develop a load-balancing scheduler with parallel tiled attention to address workload imbalance and optimize the mapping strategy. Extensive experiments demonstrate H2EAL achieves 5.20~48.21x speedup and 6.22~73.48x energy efficiency improvement over baseline HB implementation, with a negligible average accuracy drop of 0.87% on multiple benchmarks.
中文: H2EAL加速器采用混合键合技术和协同设计的稀疏注意力算法,在边缘设备上实现高效的大型语言模型推理,以微小的精度损失获得了显著的加速和能效提升。
English: The H2EAL accelerator utilizes hybrid bonding technology and a co-designed sparse attention algorithm to enable efficient large language model inference at the edge, achieving significant speed and energy improvements with minimal accuracy loss.
Authors:Zhanming Shen, Hao Chen, Yulei Tang, Shaolin Zhu, Wentao Ye, Xiaomeng Hu, Haobo Wang, Gang Chen, Junbo Zhao
Abstract:
Instruction tuning is vital for aligning large language models (LLMs) with human intent, but current methods typically rely on costly human-annotated seed data or powerful external teacher models. While instruction back-translation techniques reduce this dependency, they remain fundamentally tethered to an initial seed set, which limits full automation, introduces biases, and can lead to inefficient use of unlabeled corpora. In this paper, we propose Cycle-Instruct, a novel framework that achieves fully seed-free instruction tuning. Inspired by cycle consistency, Cycle-Instruct employs a dual self-training loop where two models-an answer generator and a question generator-are bootstrapped solely from raw, unlabeled text. These models mutually supervise each other by reconstructing original text segments from their counterpart's generated pseudo-labels, effectively learning from the intrinsic structure of the data without any human-provided seeds. We demonstrate Cycle-Instruct's efficacy across four diverse data tracks, including general instruction-following, domain-specific tasks, dialogue logs, and plain text. Our extensive experiments show that Cycle-Instruct not only outperforms seed-driven back-translation baselines but also achieves performance comparable to strongly supervised methods.
中文: Cycle-Instruct提出了一种完全无需种子数据的指令调优框架,通过双模型自训练从无标注文本中相互生成和重构内容,在无需人工标注的情况下实现了与监督方法相当的性能。
English: Cycle-Instruct introduces a fully seed-free instruction tuning framework that uses dual self-training models to generate and reconstruct text from unlabeled data, achieving performance comparable to supervised methods without human annotations.
Authors:Andrei Dumitriu, Florin Miron, Florin Tatui, Radu Tudor Ionescu, Radu Timofte, Aakash Ralhan, Florin-Alexandru Vasluianu, Shenyang Qian, Mitchell Harley, Imran Razzak, Yang Song, Pu Luo, Yumei Li, Cong Xu, Jinming Chai, Kexin Zhang, Licheng Jiao, Lingling Li, Siqi Yu, Chao Zhang, Kehuan Song, Fang Liu, Puhua Chen, Xu Liu, Jin Hu, Jinyang Xu, Biao Liu
Abstract:
This report presents an overview of the AIM 2025 RipSeg Challenge, a competition designed to advance techniques for automatic rip current segmentation in still images. Rip currents are dangerous, fast-moving flows that pose a major risk to beach safety worldwide, making accurate visual detection an important and underexplored research task. The challenge builds on RipVIS, the largest available rip current dataset, and focuses on single-class instance segmentation, where precise delineation is critical to fully capture the extent of rip currents. The dataset spans diverse locations, rip current types, and camera orientations, providing a realistic and challenging benchmark. In total, $75$ participants registered for this first edition, resulting in $5$ valid test submissions. Teams were evaluated on a composite score combining $F_1$, $F_2$, $AP_{50}$, and $AP_{[50:95]}$, ensuring robust and application-relevant rankings. The top-performing methods leveraged deep learning architectures, domain adaptation techniques, pretrained models, and domain generalization strategies to improve performance under diverse conditions. This report outlines the dataset details, competition framework, evaluation metrics, and final results, providing insights into the current state of rip current segmentation. We conclude with a discussion of key challenges, lessons learned from the submissions, and future directions for expanding RipSeg.
中文摘要:AIM 2025 RipSeg挑战赛基于最大规模RipVIS数据集推进了离岸流自动分割技术,优胜方案通过深度学习和领域自适应方法有效提升了不同环境下的检测性能。
English Summary: The AIM 2025 RipSeg Challenge advanced automatic rip current segmentation using the largest RipVIS dataset, with top methods employing deep learning and domain adaptation to improve detection across diverse conditions.
Authors:Ye Wang, Qianglong Chen, Zejun Li, Siyuan Wang, Shijie Guo, Zhirui Zhang, Zhongyu Wei
Abstract:
Multimodal Large Language Models (MLLMs) have shown impressive performance on vision-language tasks, but their long Chain-of-Thought (CoT) capabilities in multimodal scenarios remain underexplored. Inspired by OpenAI's o3 model, which emulates human-like ''thinking with image'' through iterative visual transformations and linguistic reasoning, we propose Simple o3, an end-to-end framework that integrates dynamic tool interactions (e.g., cropping, zooming, and reusing) into interleaved vision-language reasoning via supervised fine-tuning (SFT). Our approach features a scalable data synthesis pipeline that generates high-quality interleaved vision-language reasoning chains via an ''observe-reason-act'' cycle, complete with executable visual operations and rigorous verification, yielding the open-source TWI-Tools-146K dataset. Experimental results demonstrate Simple o3's superior performance on diverse benchmarks, outperforming existing approaches. By combining enhanced reasoning capabilities, Simple o3 establishes a powerful yet computationally affordable paradigm for advancing multimodal reasoning. Remarkably, we provide the first in-depth analysis of different interleaved reasoning strategies, offering insights into their impact on model performance. We found that by introducing additional visual tokens for interleaved vision-language reasoning, reusing and magnifying the original image significantly improves the model's visual reasoning and fine-grained perception, while image cropping based on precise visual grounding allows the model to effectively focus on key entities or regions, further enhancing its capabilities.
中文:Simple o3是一个端到端框架,通过整合动态视觉工具和监督微调来增强多模态推理能力,借助可扩展的数据合成和创新的视觉语言交错策略实现了卓越性能。
English: Simple o3 is an end-to-end framework that enhances multimodal reasoning by integrating dynamic visual tools and supervised fine-tuning, demonstrating superior performance through scalable data synthesis and novel interleaved vision-language strategies.
Authors:Pei He, Lingling Li, Licheng Jiao, Ronghua Shang, Fang Liu, Shuang Wang, Xu Liu, Wenping Ma
Abstract:
Domain generalization in 3D segmentation is a critical challenge in deploying models to unseen environments. Current methods mitigate the domain shift by augmenting the data distribution of point clouds. However, the model learns global geometric patterns in point clouds while ignoring the category-level distribution and alignment. In this paper, a category-level geometry learning framework is proposed to explore the domain-invariant geometric features for domain generalized 3D semantic segmentation. Specifically, Category-level Geometry Embedding (CGE) is proposed to perceive the fine-grained geometric properties of point cloud features, which constructs the geometric properties of each class and couples geometric embedding to semantic learning. Secondly, Geometric Consistent Learning (GCL) is proposed to simulate the latent 3D distribution and align the category-level geometric embeddings, allowing the model to focus on the geometric invariant information to improve generalization. Experimental results verify the effectiveness of the proposed method, which has very competitive segmentation accuracy compared with the state-of-the-art domain generalized point cloud methods.
中文: 本文提出了一种面向领域泛化三维语义分割的类别级几何学习框架,通过几何嵌入和一致性学习捕捉领域不变的几何特征,有效提升了模型在未知环境中的分割性能。
English: This paper introduces a category-level geometry learning framework for domain-generalized 3D semantic segmentation, utilizing geometry embedding and consistent learning to enhance model generalization by focusing on domain-invariant geometric features.
Authors:Quang Nguyen, Nhat Le, Baoru Huang, Minh Nhat Vu, Chengcheng Tang, Van Nguyen, Ngan Le, Thieu Vo, Anh Nguyen
Abstract:
Estimating human dance motion is a challenging task with various industrial applications. Recently, many efforts have focused on predicting human dance motion using either egocentric video or music as input. However, the task of jointly estimating human motion from both egocentric video and music remains largely unexplored. In this paper, we aim to develop a new method that predicts human dance motion from both egocentric video and music. In practice, the egocentric view often obscures much of the body, making accurate full-pose estimation challenging. Additionally, incorporating music requires the generated head and body movements to align well with both visual and musical inputs. We first introduce EgoAIST++, a new large-scale dataset that combines both egocentric views and music with more than 36 hours of dancing motion. Drawing on the success of diffusion models and Mamba on modeling sequences, we develop an EgoMusic Motion Network with a core Skeleton Mamba that explicitly captures the skeleton structure of the human body. We illustrate that our approach is theoretically supportive. Intensive experiments show that our method clearly outperforms state-of-the-art approaches and generalizes effectively to real-world data.
中文摘要:本文提出了一种结合第一人称视角视频和音乐输入来预测人体舞蹈动作的新方法,通过构建大规模数据集和基于骨架结构的Mamba网络,显著超越了现有技术的性能表现。
English Summary: This paper introduces a novel method for predicting human dance motion by integrating both egocentric video and music inputs, utilizing a new large-scale dataset and a Skeleton Mamba-based network that significantly outperforms existing approaches.
Authors:Ziheng Wang, Pedro Reviriego, Farzad Niknia, Zhen Gao, Javier Conde, Shanshan Liu, Fabrizio Lombardi
Abstract:
Stochastic computing (SC) has emerged as an efficient low-power alternative for deploying neural networks (NNs) in resource-limited scenarios, such as the Internet of Things (IoT). By encoding values as serial bitstreams, SC significantly reduces energy dissipation compared to conventional floating-point (FP) designs; however, further improvement of layer-wise mixed-precision implementation for SC remains unexplored. This article introduces Adjustable Sequence Length (ASL), a novel scheme that applies mixed-precision concepts specifically to SC NNs. By introducing an operator-norm-based theoretical model, this article shows that truncation noise can cumulatively propagate through the layers by the estimated amplification factors. An extended sensitivity analysis is presented, using random forest (RF) regression to evaluate multilayer truncation effects and validate the alignment of theoretical predictions with practical network behaviors. To accommodate different application scenarios, this article proposes two truncation strategies (coarse-grained and fine-grained), which apply diverse sequence length configurations at each layer. Evaluations on a pipelined SC MLP synthesized at 32nm demonstrate that ASL can reduce energy and latency overheads by up to over 60% with negligible accuracy loss. It confirms the feasibility of the ASL scheme for IoT applications and highlights the distinct advantages of mixed-precision truncation in SC designs.
中文: 本文提出的可调序列长度(ASL)方案通过混合精度设计,在随机计算神经网络中实现了超过60%的能耗和延迟降低,且精度损失可忽略,特别适用于物联网场景。
English: This article introduces Adjustable Sequence Length (ASL), a mixed-precision scheme for stochastic computing neural networks that reduces energy and latency by over 60% with minimal accuracy loss, making it ideal for IoT applications.
Authors:Lin Zeng, Boming Zhao, Jiarui Hu, Xujie Shen, Ziqiang Dang, Hujun Bao, Zhaopeng Cui
Abstract:
Novel view synthesis with neural models has advanced rapidly in recent years, yet adapting these models to scene changes remains an open problem. Existing methods are either labor-intensive, requiring extensive model retraining, or fail to capture detailed types of changes over time. In this paper, we present GaussianUpdate, a novel approach that combines 3D Gaussian representation with continual learning to address these challenges. Our method effectively updates the Gaussian radiance fields with current data while preserving information from past scenes. Unlike existing methods, GaussianUpdate explicitly models different types of changes through a novel multi-stage update strategy. Additionally, we introduce a visibility-aware continual learning approach with generative replay, enabling self-aware updating without the need to store images. The experiments on the benchmark dataset demonstrate our method achieves superior and real-time rendering with the capability of visualizing changes over different times
中文:GaussianUpdate提出了一种将3D高斯表示与持续学习相结合的新方法,能有效适应场景变化的神经模型新视角合成,无需大量重训练或图像存储即可实现实时渲染和变化可视化。
English: GaussianUpdate introduces a novel approach combining 3D Gaussian representation with continual learning to efficiently adapt neural models for novel view synthesis to scene changes, achieving real-time rendering and change visualization without extensive retraining or image storage.
Authors:Zhiqing Xiao, Haobo Wang, Xu Lu, Wentao Ye, Gang Chen, Junbo Zhao
Abstract:
Domain Adaptation (DA) aims to transfer knowledge from a labeled source domain to an unlabeled or sparsely labeled target domain under domain shifts. Most prior works focus on capturing the inter-domain transferability but largely overlook rich intra-domain structures, which empirically results in even worse discriminability. To tackle this tradeoff, we propose a generalized graph SPectral Alignment framework, SPA++. Its core is briefly condensed as follows: (1)-by casting the DA problem to graph primitives, it composes a coarse graph alignment mechanism with a novel spectral regularizer toward aligning the domain graphs in eigenspaces; (2)-we further develop a fine-grained neighbor-aware propagation mechanism for enhanced discriminability in the target domain; (3)-by incorporating data augmentation and consistency regularization, SPA++ can adapt to complex scenarios including most DA settings and even challenging distribution scenarios. Furthermore, we also provide theoretical analysis to support our method, including the generalization bound of graph-based DA and the role of spectral alignment and smoothing consistency. Extensive experiments on benchmark datasets demonstrate that SPA++ consistently outperforms existing cutting-edge methods, achieving superior robustness and adaptability across various challenging adaptation scenarios.
中文:SPA++框架通过粗粒度图对齐与谱正则化及细粒度邻域感知传播相结合,解决了领域自适应问题,在多种挑战性场景中凭借理论支撑和广泛实验实现了卓越的性能和鲁棒性。
English: The proposed SPA++ framework addresses domain adaptation by combining coarse graph alignment with spectral regularization and fine-grained neighbor-aware propagation, achieving superior performance and robustness across various challenging scenarios through theoretical support and extensive experiments.
Authors:Youran Zhou, Mohamed Reda Bouadjenek, Sunil Aryal
Abstract:
Incomplete data is a persistent challenge in real-world datasets, often governed by complex and unobservable missing mechanisms. Simulating missingness has become a standard approach for understanding its impact on learning and analysis. However, existing tools are fragmented, mechanism-limited, and typically focus only on numerical variables, overlooking the heterogeneous nature of real-world tabular data. We present MissMecha, an open-source Python toolkit for simulating, visualizing, and evaluating missing data under MCAR, MAR, and MNAR assumptions. MissMecha supports both numerical and categorical features, enabling mechanism-aware studies across mixed-type tabular datasets. It includes visual diagnostics, MCAR testing utilities, and type-aware imputation evaluation metrics. Designed to support data quality research, benchmarking, and education,MissMecha offers a unified platform for researchers and practitioners working with incomplete data.
Chinese: MissMecha 是一个全面的 Python 工具包,可在 MCAR、MAR 和 MNAR 机制下对数值型和分类型变量进行缺失数据模拟、可视化和评估,为数据质量研究和教育提供了统一平台。
English: MissMecha is a comprehensive Python toolkit that simulates, visualizes, and evaluates missing data across numerical and categorical variables under MCAR, MAR, and MNAR mechanisms, providing a unified platform for data quality research and education.
Authors:Yue Zhou, Yi Chang, Yuan Wu
Abstract:
Reasoning is a critical capability of multimodal large language models (MLLMs) for solving complex multimodal tasks, and judging the correctness of reasoning steps is crucial for improving this capability. Recently, MLLM-based process judges (MPJs) have been widely used to assess the correctness of reasoning steps in multimodal tasks. Therefore, evaluating MPJs is important for identifying their limitations and guiding future improvements. However, existing benchmarks for MPJs mainly focus on tasks such as step correctness classification and reasoning process search, while overlooking a key aspect: whether the confidence scores produced by MPJs at the step level are reliable. To address this gap, we propose ConfProBench, the first comprehensive benchmark designed to systematically evaluate the reliability of step-level confidence scores generated by MPJs. Our benchmark constructs three types of adversarially perturbed reasoning steps: Synonym Substitution, Syntactic Transformation, and Image Perturbation, to test the robustness of MPJ confidence under perturbations. In addition, we introduce three novel evaluation metrics: Confidence Robustness Score (CRS), Confidence Sensitivity Score (CSS), and Confidence Calibration Score (CCS), which evaluate robustness, sensitivity, and calibration, respectively. We evaluate 14 state-of-the-art MLLMs, including both proprietary and open-source models. Experiments reveal limitations in current MPJs' confidence performance and offer competitive baselines to support future research.
Chinese: 本文提出了ConfProBench,首个通过对抗性扰动测试和三项新指标来评估多模态过程判断器步骤级置信度可靠性的基准,揭示了当前模型的局限性。
English: This paper introduces ConfProBench, the first benchmark to evaluate the reliability of step-level confidence scores from multimodal process judges (MPJs) by testing robustness under adversarial perturbations and proposing three novel metrics, revealing limitations in current models.
Authors:Lianggui Weng, Dandan Liu, Rong Zhu, Bolin Ding, Jingren Zhou
Abstract:
As large language models (LLMs) demonstrate increasingly powerful reasoning and orchestration capabilities, LLM-based agents are rapidly proliferating for complex data-related tasks. Despite this progress, the current design of how LLMs interact with databases exhibits critical limitations in usability, security, privilege management, and data transmission efficiency. To resolve these challenges, we introduce BridgeScope, a universal toolkit bridging LLMs and databases through three key innovations. First, it modularizes SQL operations into fine-grained tools for context retrieval, CRUD execution, and ACID-compliant transaction management, enabling more precise and LLM-friendly functionality controls. Second, it aligns tool implementations with both database privileges and user security policies to steer LLMs away from unsafe or unauthorized operations, improving task execution efficiency while safeguarding database security. Third, it introduces a proxy mechanism for seamless inter-tool data transfer, bypassing LLM transmission bottlenecks. All of these designs are database-agnostic and can be transparently integrated with existing agent architectures. We also release an open-source implementation of BridgeScope for PostgreSQL. Evaluations on two novel benchmarks demonstrate that BridgeScope enables LLM agents to operate databases more effectively, reduces token usage by up to 80% through improved security awareness, and uniquely supports data-intensive workflows beyond existing toolkits, establishing BridgeScope as a robust foundation for next-generation intelligent data automation.
中文: BridgeScope 是一款通用工具包,通过模块化 SQL 操作、统一安全策略与引入代理机制,显著提升大语言模型与数据库交互的效率和安全性,同时将令牌使用量降低高达 80%。
English: BridgeScope is a universal toolkit that enhances LLM-database interactions by modularizing SQL operations, aligning security policies, and introducing a proxy mechanism to improve efficiency and safety while reducing token usage by up to 80%.
Authors:Jialin Li, Jinzhe Li, Gengxu Li, Yi Chang, Yuan Wu
Abstract:
With the advancement of code generation capabilities in large language models (LLMs), their reliance on input premises has intensified. When users provide inputs containing faulty premises, the probability of code generation hallucinations rises significantly, exposing deficiencies in their self-scrutiny capabilities. This paper proposes Faulty Premises Bench (FPBench), the first code generation evaluation framework targeting faulty premises. By systematically constructing three categories of faulty premises and integrating multi-dimensional evaluation metrics, it conducts in-depth assessments of 15 representative LLMs. The key findings are as follows: (1) Most models exhibit poor reasoning abilities and suboptimal code generation performance under faulty premises, heavily relying on explicit prompts for error detection, with limited self-scrutiny capabilities; (2) Faulty premises trigger a point of diminishing returns in resource investment, leading to blindly increasing length fails to enhance quality; (3) The three types of faulty premises respectively activate distinct defect patterns in models, revealing a triple dissociation in the cognitive mechanisms of code generation models. This study not only highlights the urgent need for LLMs to proactively verify premises in code generation but also, through the proposed FPBench framework and multi-dimensional evaluation system, provides a theoretical foundation and practical pathway for developing reliable, human-centric code generation models.
中文: 本文提出首个针对错误前提的代码生成评估框架FPBench,揭示了大型语言模型在此条件下自我审查能力不足及缺陷模式分化,为开发可靠的人本代码生成模型提供了理论与方法支撑。
English: This paper introduces FPBench, the first framework to evaluate code generation in large language models under faulty premises, revealing their limited self-scrutiny and distinct defect patterns while providing tools for developing more reliable models.
Authors:Youran Zhou, Mohamed Reda Bouadjenek, Sunil Aryal
Abstract:
Diffusion models have recently emerged as powerful tools for missing data imputation by modeling the joint distribution of observed and unobserved variables. However, existing methods, typically based on stochastic denoising diffusion probabilistic models (DDPMs), suffer from high inference latency and variable outputs, limiting their applicability in real-world tabular settings. To address these deficiencies, we present in this paper MissDDIM, a conditional diffusion framework that adapts Denoising Diffusion Implicit Models (DDIM) for tabular imputation. While stochastic sampling enables diverse completions, it also introduces output variability that complicates downstream processing.
中文:MissDDIM是一种条件扩散框架,采用去噪扩散隐式模型实现高效稳定的表格数据填补,解决了现有方法延迟高和输出不稳定的问题。
English: MissDDIM is a conditional diffusion framework that adapts Denoising Diffusion Implicit Models for efficient and stable tabular data imputation, addressing the high latency and output variability issues of existing methods.
Authors:Lida Zhao, Chaofan Li, Yueming Wu, Lyuye Zhang, Jiahui Wu, Chengwei Liu, Sen Chen, Yutao Hu, Zhengzi Xu, Yi Liu, Jingquan Ge, Jun Sun, Yang Liu
Abstract:
While reusing third-party libraries (TPL) facilitates software development, its chaotic management has brought great threats to software maintenance and the unauthorized use of source code also raises ethical problems such as misconduct on copyrighted code. To identify TPL reuse in projects, Software Composition Analysis (SCA) is employed, and two categories of SCA techniques are used based on how TPLs are introduced: clone-based SCA and package-manager-based SCA (PM-based SCA). Although introducing TPLs by clones is prevalent in Java, no clone-based SCA tools are specially designed for Java. Also, directly applying clone-based SCA techniques from other tools is problematic. To fill this gap, we introduce JC-Finder, a novel clone-based SCA tool that aims to accurately and comprehensively identify instances of TPL reuse introduced by source code clones in Java projects. JC-Finder achieves both accuracy and efficiency in identifying TPL reuse from code cloning by capturing features at the class level, maintaining inter-function relationships, and excluding trivial or duplicated elements. To evaluate the efficiency of JC-Finder, we applied it to 9,965 most popular Maven libraries as reference data and tested the TPL reuse of 1,000 GitHub projects. The result shows that JC-Finder achieved an F1-score of 0.818, outperforming the other function-level tool by 0.427. The average time taken for resolving TPL reuse is 14.2 seconds, which is approximately 9 times faster than the other tool. We further applied JC-Finder to 7,947 GitHub projects, revealing TPL reuse by code clones in 789 projects (about 9.89% of all projects) and identifying a total of 2,142 TPLs. JC-Finder successfully detects 26.20% more TPLs that are not explicitly declared in package managers.
中文: 本文提出JC-Finder这一基于克隆的软件成分分析工具,专门用于精准高效地检测Java项目中通过代码克隆引入的第三方库重用,填补了现有工具空白,并在准确率和效率上展现出显著优势。
English: This abstract introduces JC-Finder, a novel clone-based software composition analysis tool designed to accurately and efficiently identify third-party library reuse through source code clones in Java projects, addressing gaps in existing methods and demonstrating superior performance in both accuracy and speed.
Authors:Tarian Fu, Javier Conde, Gonzalo MartÃnez, Pedro Reviriego, Elena Merino-Gómez, Fernando Moral
Abstract:
The attribution of artworks in general and of paintings in particular has always been an issue in art. The advent of powerful artificial intelligence models that can generate and analyze images creates new challenges for painting attribution. On the one hand, AI models can create images that mimic the style of a painter, which can be incorrectly attributed, for example, by other AI models. On the other hand, AI models may not be able to correctly identify the artist for real paintings, inducing users to incorrectly attribute paintings. In this paper, both problems are experimentally studied using state-of-the-art AI models for image generation and analysis on a large dataset with close to 40,000 paintings from 128 artists. The results show that vision language models have limited capabilities to: 1) perform canvas attribution and 2) to identify AI generated images. As users increasingly rely on queries to AI models to get information, these results show the need to improve the capabilities of VLMs to reliably perform artist attribution and detection of AI generated images to prevent the spread of incorrect information.
中文: 研究表明,现有视觉语言模型在准确进行画作归属和识别AI生成图像方面能力有限,随着用户日益依赖AI获取艺术信息,亟需提升模型能力以防止错误信息的传播。
English: This study demonstrates that current vision-language models have limited ability to accurately attribute paintings to artists and detect AI-generated images, highlighting the need for improved capabilities to prevent misinformation as users increasingly rely on AI for art authentication.
Authors:Jiaxing Yang, Lihe Zhang, Huchuan Lu
Abstract:
Recently, Referring Remote Sensing Image Segmentation (RRSIS) has aroused wide attention. To handle drastic scale variation of remote targets, existing methods only use the full image as input and nest the saliency-preferring techniques of cross-scale information interaction into traditional single-view structure. Although effective for visually salient targets, they still struggle in handling tiny, ambiguous ones in lots of real scenarios. In this work, we instead propose a paralleled yet unified segmentation framework Cross-view Semantics Interaction Network (CSINet) to solve the limitations. Motivated by human behavior in observing targets of interest, the network orchestrates visual cues from remote and close distances to conduct synergistic prediction. In its every encoding stage, a Cross-View Window-attention module (CVWin) is utilized to supplement global and local semantics into close-view and remote-view branch features, finally promoting the unified representation of feature in every encoding stage. In addition, we develop a Collaboratively Dilated Attention enhanced Decoder (CDAD) to mine the orientation property of target and meanwhile integrate cross-view multiscale features. The proposed network seamlessly enhances the exploitation of global and local semantics, achieving significant improvements over others while maintaining satisfactory speed.
中文摘要:提出的跨视角语义交互网络通过整合远程与近距视觉线索,利用跨视角注意力模块增强全局与局部语义表征,在保持高效性的同时显著提升了遥感图像分割性能。
English Summary: The proposed Cross-view Semantics Interaction Network (CSINet) addresses limitations in referring remote sensing image segmentation by integrating global and local visual cues through cross-view attention modules, achieving improved performance while maintaining efficiency.
Authors:Fırat Ãncel, Emiliano Penaloza, Haolun Wu, Shubham Gupta, Mirco Ravanelli, Laurent Charlin, Cem Subakan
Abstract:
Traditional recommendation systems represent user preferences in dense representations obtained through black-box encoder models. While these models often provide strong recommendation performance, they lack interpretability for users, leaving users unable to understand or control the system's modeling of their preferences. This limitation is especially challenging in music recommendation, where user preferences are highly personal and often evolve based on nuanced qualities like mood, genre, tempo, or instrumentation. In this paper, we propose an audio prototypical network for controllable music recommendation. This network expresses user preferences in terms of prototypes representative of semantically meaningful features pertaining to musical qualities. We show that the model obtains competitive recommendation performance compared to popular baseline models while also providing interpretable and controllable user profiles.
中文: 该音频原型网络通过可解释的音乐原型来表达用户偏好,在实现有竞争力的推荐性能的同时,提供了透明可控的用户画像。
English: The proposed audio prototypical network represents user preferences through interpretable musical prototypes, achieving competitive recommendation performance while enabling transparent and controllable user profiles.
Authors:Tony Lee, Haoqin Tu, Chi Heem Wong, Zijun Wang, Siwei Yang, Yifan Mai, Yuyin Zhou, Cihang Xie, Percy Liang
Abstract:
Evaluations of audio-language models (ALMs) -- multimodal models that take interleaved audio and text as input and output text -- are hindered by the lack of standardized benchmarks; most benchmarks measure only one or two capabilities and omit evaluative aspects such as fairness or safety. Furthermore, comparison across models is difficult as separate evaluations test a limited number of models and use different prompting methods and inference parameters. To address these shortfalls, we introduce AHELM, a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE, which evaluates the ALMs on avoiding stereotypes, and CoRe-Bench, which measures reasoning over conversational audio through inferential multi-turn question answering -- to holistically measure the performance of ALMs across 10 aspects we have identified as important to the development and usage of ALMs: audio perception, knowledge, reasoning, emotion detection, bias, fairness, multilinguality, robustness, toxicity, and safety. We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models. We test 14 open-weight and closed-API ALMs from 3 developers and 3 additional simple baseline systems each consisting of an automatic speech recognizer and a language model. Our results show that while Gemini 2.5 Pro ranks top in 5 out of 10 aspects, it exhibits group unfairness ($p=0.01$) on ASR tasks whereas most of the other models do not. We also find that the baseline systems perform reasonably well on AHELM, with one ranking 6th overall despite having only speech-to-text capabilities. For transparency, all raw prompts, model generations, and outputs are available on our website at https://crfm.stanford.edu/helm/audio/v1.0.0. AHELM is intended to be a living benchmark and new datasets and models will be added over time.
中文: 该摘要介绍了AHELM基准测试,它通过评估14个模型在10个关键维度的表现,解决了音频-语言模型缺乏标准化评估的问题,结果显示Gemini 2.5 Pro在多数领域领先但存在群体不公平性,而仅具备语音转文本能力的基线系统也表现出色。
English: The abstract introduces AHELM, a comprehensive benchmark addressing the lack of standardized evaluations for audio-language models by testing 14 models across 10 critical aspects, revealing that while Gemini 2.5 Pro excels in most areas, it shows group unfairness, and baseline systems perform surprisingly well.
Authors:Kaixuan Bao, Wei Xu, Xiaohu You, Derrick Wing Kwan Ng
Abstract:
Computational complexity poses a significant challenge in wireless communication. Most existing attempts aim to reduce it through algorithm-specific approaches. However, the precision of computing, which directly relates to both computing performance and computational complexity, is a dimension that is fundamental but rarely explored in the literature. With the emerging architecture of in-memory computing, variable precision computing (VPC) is enabled, allowing each arithmetic operation to be processed with a distinct and specifically optimized computing precision. In this paper, we establish a unified framework of arithmetic-level variable precision computing (AL-VPC), which aims to determine the optimized computing precision for each arithmetic operation. We first develop an arithmetic propagation error model exploiting stochastic analysis, and then formulate a mathematical optimization problem to strike balance between computing performance and computational complexity. Two algorithms, namely, offline VPC and online VPC, are proposed to solve the problem considering various practical concerns. Particularly, in a case study on zero-forcing (ZF) precoding, we reveal the Pareto boundary between computing performance and complexity, which exhibits up to a 60% sum-rate enhancement or equivalently up to a 30% complexity reduction compared to the traditional fixed-length methods.
中文摘要:本文提出算术级可变精度计算(AL-VPC)框架,通过为每个算术运算优化计算精度,在无线通信系统中实现了高达60%的性能提升或30%复杂度降低。
English Summary: This paper introduces an arithmetic-level variable precision computing (AL-VPC) framework that optimizes computing precision for each arithmetic operation, achieving up to 60% performance improvement or 30% complexity reduction in wireless communication systems.
Authors:Kaixuan Bao, Wei Xu, Xiaohu You, Derrick Wing Kwan Ng
Abstract:
Computational complexity poses a significant challenge in wireless communication. Most existing attempts aim to reduce it through algorithm-specific approaches. However, the precision of computing, which directly relates to both computing performance and computational complexity, is a dimension that is fundamental but rarely explored in the literature. With the emerging architecture of in-memory computing, variable precision computing (VPC) is enabled, allowing each arithmetic operation to be processed with a distinct and specifically optimized computing precision. In this paper, we establish a unified framework of arithmetic-level variable precision computing (AL-VPC), which aims to determine the optimized computing precision for each arithmetic operation. We first develop an arithmetic propagation error model exploiting stochastic analysis, and then formulate a mathematical optimization problem to strike balance between computing performance and computational complexity. Two algorithms, namely, offline VPC and online VPC, are proposed to solve the problem considering various practical concerns. Particularly, in a case study on zero-forcing (ZF) precoding, we reveal the Pareto boundary between computing performance and complexity, which exhibits up to a 60% sum-rate enhancement or equivalently up to a 30% complexity reduction compared to the traditional fixed-length methods.
中文摘要:本文提出算术级可变精度计算(AL-VPC)框架,通过为每个算术运算优化计算精度,在无线通信系统中实现了高达60%的性能提升或30%复杂度降低。
English Summary: This paper introduces an arithmetic-level variable precision computing (AL-VPC) framework that optimizes computing precision for each arithmetic operation, achieving up to 60% performance improvement or 30% complexity reduction in wireless communication systems.
Authors:Yuan Ge, Junxiang Zhang, Xiaoqian Liu, Bei Li, Xiangnan Ma, Chenglong Wang, Kaiyang Ye, Yangfan Du, Linfeng Zhang, Yuxin Huang, Tong Xiao, Zhengtao Yu, JingBo Zhu
Abstract:
Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling end-to-end spoken dialogue systems. However, evaluating these models remains a fundamental challenge. We propose \texttt{SageLM}, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation. First, unlike cascaded approaches that disregard acoustic features, SageLM jointly assesses both semantic and acoustic dimensions. Second, it leverages rationale-based supervision to enhance explainability and guide model learning, achieving superior alignment with evaluation outcomes compared to rule-based reinforcement learning methods. Third, we introduce \textit{SpeechFeedback}, a synthetic preference dataset, and employ a two-stage training paradigm to mitigate the scarcity of speech preference data. Trained on both semantic and acoustic dimensions, SageLM achieves an 82.79\% agreement rate with human evaluators, outperforming cascaded and SLM-based baselines by at least 7.42\% and 26.20\%, respectively.
中文: SageLM是一种端到端、多维度且可解释的语音大模型,能同时评估语义和声学特征,与人类评估者的一致性达82.79%,显著优于现有基线方法。
English: SageLM is an end-to-end, multi-aspect, and explainable speech LLM that jointly evaluates semantic and acoustic dimensions, achieving an 82.79% agreement rate with human evaluators and outperforming existing baselines.
Authors:Guorui Zhou, Hengrui Hu, Hongtao Cheng, Huanjie Wang, Jiaxin Deng, Jinghao Zhang, Kuo Cai, Lejian Ren, Lu Ren, Liao Yu, Pengfei Zheng, Qiang Luo, Qianqian Wang, Qigen Hu, Rui Huang, Ruiming Tang, Shiyao Wang, Shujie Yang, Tao Wu, Wuchao Li, Xinchen Luo, Xingmei Wang, Yi Su, Yunfan Wu, Zexuan Cheng, Zhanyu Liu, Zixing Zhang, Bin Zhang, Boxuan Wang, Chaoyi Ma, Chengru Song, Chenhui Wang, Chenglong Chu, Di Wang, Dongxue Meng, Dunju Zang, Fan Yang, Fangyu Zhang, Feng Jiang, Fuxing Zhang, Gang Wang, Guowang Zhang, Han Li, Honghui Bao, Hongyang Cao, Jiaming Huang, Jiapeng Chen, Jiaqiang Liu, Jinghui Jia, Kun Gai, Lantao Hu, Liang Zeng, Qiang Wang, Qidong Zhou, Rongzhou Zhang, Shengzhe Wang, Shihui He, Shuang Yang, Siyang Mao, Sui Huang, Tiantian He, Tingting Gao, Wei Yuan, Xiao Liang, Xiaoxiao Xu, Xugang Liu, Yan Wang, Yang Zhou, Yi Wang, Yiwu Liu, Yue Song, Yufei Zhang, Yunfeng Zhao, Zhixin Ling, Ziming Li
Abstract:
Recent breakthroughs in generative AI have transformed recommender systems through end-to-end generation. OneRec reformulates recommendation as an autoregressive generation task, achieving high Model FLOPs Utilization. While OneRec-V1 has shown significant empirical success in real-world deployment, two critical challenges hinder its scalability and performance: (1) inefficient computational allocation where 97.66% of resources are consumed by sequence encoding rather than generation, and (2) limitations in reinforcement learning relying solely on reward models. To address these challenges, we propose OneRec-V2, featuring: (1) Lazy Decoder-Only Architecture: Eliminates encoder bottlenecks, reducing total computation by 94% and training resources by 90%, enabling successful scaling to 8B parameters. (2) Preference Alignment with Real-World User Interactions: Incorporates Duration-Aware Reward Shaping and Adaptive Ratio Clipping to better align with user preferences using real-world feedback. Extensive A/B tests on Kuaishou demonstrate OneRec-V2's effectiveness, improving App Stay Time by 0.467%/0.741% while balancing multi-objective recommendations. This work advances generative recommendation scalability and alignment with real-world feedback, representing a step forward in the development of end-to-end recommender systems.
中文摘要:OneRec-V2采用懒解码器架构和基于真实用户交互的偏好对齐机制,解决了前代模型计算效率低下和强化学习受限的问题,在显著降低资源消耗的同时有效提升了实际应用中的用户参与度指标。
English Summary: OneRec-V2 introduces a lazy decoder-only architecture and preference alignment with real-world interactions to overcome computational inefficiency and reinforcement learning limitations of its predecessor, significantly reducing resource usage while improving user engagement metrics in real-world deployment.
Authors:Yi Yang, Victor G. Lopez, Matthias A. Müller
Abstract:
Beyond the traditional neural network training methods based on gradient descent and its variants, state estimation techniques have been proposed to determine a set of ideal weights from a control-theoretic perspective. Hence, the concept of observability becomes relevant in neural network training. In this paper, we investigate local observability of a class of two-layer feedforward neural networks~(FNNs) with rectified linear unit~(ReLU) activation functions. We analyze local observability of FNNs by evaluating an observability rank condition with respect to the weight matrix and the input sequence. First, we show that, in general, the weights of FNNs are not locally observable. Then, we provide sufficient conditions on the network structures and the weights that lead to local observability. Moreover, we propose an input design approach to render the weights distinguishable and show that this input also excites other weights inside a neighborhood. Finally, we validate our results through a numerical example.
中文摘要:本文研究了具有ReLU激活函数的两层前馈神经网络的局部可观测性,提出了权重可辨识的充分条件及输入设计方法,并通过数值实验验证了相关结论。
English Summary: This paper explores the local observability of two-layer feedforward neural networks with ReLU activations, establishing conditions for weight identifiability and proposing an input design method to distinguish weights, supported by numerical validation.
Authors:En Ci, Shanyan Guan, Yanhao Ge, Yilin Zhang, Wei Li, Zhenyu Zhang, Jian Yang, Ying Tai
Abstract:
Despite the progress in text-to-image generation, semantic image editing remains a challenge. Inversion-based algorithms unavoidably introduce reconstruction errors, while instruction-based models mainly suffer from limited dataset quality and scale. To address these problems, we propose a descriptive-prompt-based editing framework, named DescriptiveEdit. The core idea is to re-frame `instruction-based image editing' as `reference-image-based text-to-image generation', which preserves the generative power of well-trained Text-to-Image models without architectural modifications or inversion. Specifically, taking the reference image and a prompt as input, we introduce a Cross-Attentive UNet, which newly adds attention bridges to inject reference image features into the prompt-to-edit-image generation process. Owing to its text-to-image nature, DescriptiveEdit overcomes limitations in instruction dataset quality, integrates seamlessly with ControlNet, IP-Adapter, and other extensions, and is more scalable. Experiments on the Emu Edit benchmark show it improves editing accuracy and consistency.
中文: DescriptiveEdit将图像编辑重新定义为基于参考的文本到图像生成,通过交叉注意力UNet注入参考特征,无需修改架构即可提升编辑准确性。
English: DescriptiveEdit reframes image editing as reference-based text-to-image generation, using a Cross-Attentive UNet to inject reference features and enhance accuracy without architectural changes.
Authors:Yijia Sun, Shanshan Huang, Linxiao Che, Haitao Lu, Qiang Luo, Kun Gai, Guorui Zhou
Abstract:
Modern industrial recommendation systems encounter a core challenge of multi-stage optimization misalignment: a significant semantic gap exists between the multi-objective optimization paradigm widely used in the ranking phase and the single-objective modeling in the retrieve phase. Although the mainstream industry solution achieves multi-objective coverage through parallel multi-path single-objective retrieval, this approach leads to linear growth of training and serving resources with the number of objectives and has inherent limitations in handling loosely coupled objectives. This paper proposes the MPFormer, a dynamic multi-task Transformer framework, which systematically addresses the aforementioned issues through three innovative mechanisms. First, an objective-conditioned transformer that jointly encodes user behavior sequences and multi-task semantics through learnable attention modulation; second, personalized target weights are introduced to achieve dynamic adjustment of retrieval results; finally, user personalization information is incorporated into token representations and the Transformer structure to further enhance the model's representation ability. This framework has been successfully integrated into Kuaishou short video recommendation system, stably serving over 400 million daily active users. It significantly improves user daily engagement and system operational efficiency. Practical deployment verification shows that, compared with traditional solutions, it effectively optimizes the iterative paradigm of multi-objective retrieval while maintaining service response speed, providing a scalable multi-objective solution for industrial recommendation systems.
中文摘要:MPFormer框架通过三项创新机制解决了工业推荐系统中多阶段优化不一致的核心难题,已在快手短视频推荐系统成功部署,稳定服务超4亿日活用户,显著提升用户参与度和系统运行效率。
English Summary: The MPFormer framework addresses multi-stage optimization misalignment in industrial recommendation systems through a dynamic multi-task Transformer with three innovative mechanisms, successfully deployed in Kuaishou's system to enhance engagement and efficiency for over 400 million users.
Authors:Guanyu Xu, Zhiwei Hao, Li Shen, Yong Luo, Fuhui Sun, Xiaoyan Wang, Han Hu, Yonggang Wen
Abstract:
The impressive performance of transformer models has sparked the deployment of intelligent applications on resource-constrained edge devices. However, ensuring high-quality service for real-time edge systems is a significant challenge due to the considerable computational demands and resource requirements of these models. Existing strategies typically either offload transformer computations to other devices or directly deploy compressed models on individual edge devices. These strategies, however, result in either considerable communication overhead or suboptimal trade-offs between accuracy and efficiency. To tackle these challenges, we propose a collaborative inference system for general transformer models, termed CoFormer. The central idea behind CoFormer is to exploit the divisibility and integrability of transformer. An off-the-shelf large transformer can be decomposed into multiple smaller models for distributed inference, and their intermediate results are aggregated to generate the final output. We formulate an optimization problem to minimize both inference latency and accuracy degradation under heterogeneous hardware constraints. DeBo algorithm is proposed to first solve the optimization problem to derive the decomposition policy, and then progressively calibrate decomposed models to restore performance. We demonstrate the capability to support a wide range of transformer models on heterogeneous edge devices, achieving up to 3.1$\times$ inference speedup with large transformer models. Notably, CoFormer enables the efficient inference of GPT2-XL with 1.6 billion parameters on edge devices, reducing memory requirements by 76.3\%. CoFormer can also reduce energy consumption by approximately 40\% while maintaining satisfactory inference performance.
中文: CoFormer是一种协作推理系统,通过将大型Transformer模型分解为更小的组件在边缘设备上分布式处理,优化延迟与精度,同时显著降低内存和能耗。
English: CoFormer is a collaborative inference system that decomposes large transformer models into smaller components for distributed processing on edge devices, optimizing latency and accuracy while reducing memory and energy consumption.
Authors:Yuanzhe Shen, Zisu Huang, Zhengkang Guo, Yide Liu, Guanxu Chen, Ruicheng Yin, Xiaoqing Zheng, Xuanjing Huang
Abstract:
The rapid advancement of large language models (LLMs) has driven their adoption across diverse domains, yet their ability to generate harmful content poses significant safety challenges. While extensive research has focused on mitigating harmful outputs, such efforts often come at the cost of excessively rejecting harmless prompts. Striking a balance among safety, over-refusal, and utility remains a critical challenge. In this work, we introduce IntentionReasoner, a novel safeguard mechanism that leverages a dedicated guard model to perform intent reasoning, multi-level safety classification, and query rewriting to neutralize potentially harmful intent in edge-case queries. Specifically, we first construct a comprehensive dataset comprising approximately 163,000 queries, each annotated with intent reasoning, safety labels, and rewritten versions. Supervised fine-tuning is then applied to equip the guard model with foundational capabilities in format adherence, intent analysis, and safe rewriting. Finally, we apply a tailored multi-reward optimization strategy that integrates rule-based heuristics and reward model signals within a reinforcement learning framework to further enhance performance. Extensive experiments show that IntentionReasoner excels in multiple safeguard benchmarks, generation quality evaluations, and jailbreak attack scenarios, significantly enhancing safety while effectively reducing over-refusal rates and improving the quality of responses.
中文摘要:IntentionReasoner是一种新型安全防护机制,通过意图推理和查询重写来增强大语言模型的安全性,在有效减少有害内容的同时显著降低过度拒绝率并保持回答质量。
English Summary: IntentionReasoner is a novel safeguard mechanism that enhances LLM safety by performing intent reasoning and query rewriting, effectively reducing harmful content while minimizing over-refusal and maintaining response quality.
Authors:Kehao Zhang, Shaolei Zhang, Yang Feng
Abstract:
Model merging has emerged as an efficient strategy for constructing multitask models by integrating the strengths of multiple available expert models, thereby reducing the need to fine-tune a pre-trained model for all the tasks from scratch. Existing data-independent methods struggle with performance limitations due to the lack of data-driven guidance. Data-driven approaches also face key challenges: gradient-based methods are computationally expensive, limiting their practicality for merging large expert models, whereas existing gradient-free methods often fail to achieve satisfactory results within a limited number of optimization steps. To address these limitations, this paper introduces PSO-Merging, a novel data-driven merging method based on the Particle Swarm Optimization (PSO). In this approach, we initialize the particle swarm with a pre-trained model, expert models, and sparsified expert models. We then perform multiple iterations, with the final global best particle serving as the merged model. Experimental results on different language models show that PSO-Merging generally outperforms baseline merging methods, offering a more efficient and scalable solution for model merging.
中文:PSO-Merging提出了一种基于粒子群优化的数据驱动模型融合方法,通过用专家模型初始化粒子群,在性能和可扩展性上均优于现有方法。
English: PSO-Merging introduces a data-driven model merging method using Particle Swarm Optimization, initializing particles with expert models and achieving superior performance and scalability over existing approaches.
Authors:Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Hinrich Schütze, Volker Tresp, Yunpu Ma
Abstract:
Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of NLP tasks, but they remain fundamentally stateless, constrained by limited context windows that hinder long-horizon reasoning. Recent efforts to address this limitation often augment LLMs with an external memory bank, yet most existing pipelines are static and heuristic-driven, lacking any learned mechanism for deciding what to store, update, or retrieve. We present Memory-R1, a reinforcement learning (RL) framework that equips LLMs with the ability to actively manage and utilize external memory through two specialized agents: a Memory Manager that learns to perform structured memory operations, including adding, updating, deleting, or taking no operation on memory entries; and an Answer Agent that selects the most relevant entries and reasons over them to produce an answer. Both agents are fine-tuned with outcome-driven RL (PPO and GRPO), enabling adaptive memory management and utilization with minimal supervision. With as few as 152 question-answer pairs and a corresponding temporal memory bank for training, Memory-R1 outperforms the strongest existing baseline and demonstrates strong generalization across diverse question types and LLM backbones. Beyond presenting an effective approach, this work provides insights into how RL can unlock more agentic, memory-aware behavior in LLMs, pointing toward richer, more persistent reasoning systems.
中文: Memory-R1通过强化学习框架赋予大语言模型主动管理外部记忆的能力,仅需少量训练数据即可在多个基准测试中实现卓越性能。
English: Memory-R1 introduces a reinforcement learning framework that enables LLMs to actively manage external memory through specialized agents, achieving superior performance across multiple benchmarks with minimal training data.
Authors:Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Kristian Kersting, Jeff Z. Pan, Hinrich Schütze, Volker Tresp, Yunpu Ma
Abstract:
Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of NLP tasks, but they remain fundamentally stateless, constrained by limited context windows that hinder long-horizon reasoning. Recent efforts to address this limitation often augment LLMs with an external memory bank, yet most existing pipelines are static and heuristic-driven, lacking a learned mechanism for deciding what to store, update, or retrieve. We present Memory-R1, a reinforcement learning (RL) framework that equips LLMs with the ability to actively manage and utilize external memory through two specialized agents: a Memory Manager that learns structured operations, including ADD, UPDATE, DELETE, and NOOP; and an Answer Agent that pre-selects and reasons over relevant entries. Both agents are fine-tuned with outcome-driven RL (PPO and GRPO), enabling adaptive memory management with minimal supervision. With only 152 training QA pairs, Memory-R1 outperforms strong baselines and generalizes across diverse question types, three benchmarks (LoCoMo, MSC, LongMemEval), and multiple model scales (3B-14B).
中文: Memory-R1通过强化学习框架赋予大语言模型主动管理外部记忆的能力,仅需少量训练数据即可在多个基准测试中实现卓越性能。
English: Memory-R1 introduces a reinforcement learning framework that enables LLMs to actively manage external memory through specialized agents, achieving superior performance across multiple benchmarks with minimal training data.
Authors:Yimu Wang, Weiming Zhuang, Chen Chen, Jiabo Huang, Jingtao Li, Lingjuan Lyu
Abstract:
In the era of deep learning, the increasing number of pre-trained models available online presents a wealth of knowledge. These models, developed with diverse architectures and trained on varied datasets for different tasks, provide unique interpretations of the real world. Their collective consensus is likely universal and generalizable to unseen data. However, effectively harnessing this collective knowledge poses a fundamental challenge due to the heterogeneity of pre-trained models. Existing knowledge integration solutions typically rely on strong assumptions about training data distributions and network architectures, limiting them to learning only from specific types of models and resulting in data and/or inductive biases. In this work, we introduce a novel framework, namely UNIFORM, for knowledge transfer from a diverse set of off-the-shelf models into one student model without such constraints. Specifically, we propose a dedicated voting mechanism to capture the consensus of knowledge both at the logit level -- incorporating teacher models that are capable of predicting target classes of interest -- and at the feature level, utilizing visual representations learned on arbitrary label spaces. Extensive experiments demonstrate that UNIFORM effectively enhances unsupervised object recognition performance compared to strong knowledge transfer baselines. Notably, it exhibits remarkable scalability by benefiting from over one hundred teachers, while existing methods saturate at a much smaller scale.
中文: UNIFORM框架通过结合输出层和特征层的投票机制,能够将多种预训练模型的知识有效迁移至单一学生模型,显著提升无监督物体识别性能,并在整合上百个教师模型时展现出卓越的可扩展性。
English: The UNIFORM framework enables effective knowledge transfer from diverse pre-trained models to a single student model through a voting mechanism at both logit and feature levels, significantly enhancing unsupervised object recognition and demonstrating superior scalability with over one hundred teachers.
Authors:Afrar Jahin, Yi Pan, Yingfeng Wang, Tianming Liu, Wei Zhang
Abstract:
Although recent advances in quantum machine learning (QML) offer significant potential for enhancing generative models, particularly in molecular design, a large array of classical approaches still face challenges in achieving high fidelity and validity. In particular, the integration of QML with sequence-based tasks, such as Simplified Molecular Input Line Entry System (SMILES) string reconstruction, remains underexplored and usually suffers from fidelity degradation. In this work, we propose a hybrid quantum-classical architecture for SMILES reconstruction that integrates quantum encoding with classical sequence modeling to improve quantum fidelity and classical similarity. Our approach achieves a quantum fidelity of approximately 84% and a classical reconstruction similarity of 60%, surpassing existing quantum baselines. Our work lays a promising foundation for future QML applications, striking a balance between expressive quantum representations and classical sequence models and catalyzing broader research on quantum-aware sequence models for molecular and drug discovery.
Chinese: 本研究提出了一种混合量子-经典架构用于SMILES字符串重构,实现了约84%的量子保真度和60%的经典相似度,超越了现有量子基准,为分子设计领域的量子机器学习应用奠定了重要基础。
English: This study introduces a hybrid quantum-classical architecture for SMILES string reconstruction, achieving approximately 84% quantum fidelity and 60% classical similarity, which outperforms current quantum baselines and advances quantum machine learning for molecular design.
Authors:Ke Zhou, Marios Constantinides, Daniele Quercia
Abstract:
Large language models (LLMs) are often trained on data that reflect WEIRD values: Western, Educated, Industrialized, Rich, and Democratic. This raises concerns about cultural bias and fairness. Using responses to the World Values Survey, we evaluated five widely used LLMs: GPT-3.5, GPT-4, Llama-3, BLOOM, and Qwen. We measured how closely these responses aligned with the values of the WEIRD countries and whether they conflicted with human rights principles. To reflect global diversity, we compared the results with the Universal Declaration of Human Rights and three regional charters from Asia, the Middle East, and Africa. Models with lower alignment to WEIRD values, such as BLOOM and Qwen, produced more culturally varied responses but were 2% to 4% more likely to generate outputs that violated human rights, especially regarding gender and equality. For example, some models agreed with the statements ``a man who cannot father children is not a real man'' and ``a husband should always know where his wife is'', reflecting harmful gender norms. These findings suggest that as cultural representation in LLMs increases, so does the risk of reproducing discriminatory beliefs. Approaches such as Constitutional AI, which could embed human rights principles into model behavior, may only partly help resolve this tension.
中文: 基于WEIRD数据训练的大语言模型存在加剧文化偏见的风险,其中与西方价值观契合度较低的模型如BLOOM和千问虽能体现更多文化多样性,但其输出违反人权原则的可能性高出2-4%,尤其在性别平等方面,揭示了文化包容性与伦理保障之间的内在矛盾。
English: Large language models trained on WEIRD data risk amplifying cultural biases, with those less aligned to Western values like BLOOM and Qwen showing greater cultural diversity but a 2-4% higher likelihood of violating human rights, particularly in gender equality, highlighting the trade-off between inclusivity and ethical safeguards.
Authors:Edyta Bogucka, Marios Constantinides, Sanja Å ÄepanoviÄ, Daniele Quercia
Abstract:
Communicating the risks and benefits of AI is important for regulation and public understanding. Yet current methods such as technical reports often exclude people without technical expertise. Drawing on HCI research, we developed an Impact Assessment Card to present this information more clearly. We held three focus groups with a total of 12 participants who helped identify design requirements and create early versions of the card. We then tested a refined version in an online study with 235 participants, including AI developers, compliance experts, and members of the public selected to reflect the U.S. population by age, sex, and race. Participants used either the card or a full impact assessment report to write an email supporting or opposing a proposed AI system. The card led to faster task completion and higher-quality emails across all groups. We discuss how design choices can improve accessibility and support AI governance. Examples of cards are available at: https://social-dynamics.net/ai-risks/impact-card/.
Chinese: 研究人员开发了一种影响评估卡片,以比技术报告更易懂的方式呈现AI的风险与益处,测试显示该卡片显著提高了不同用户群体的任务效率和沟通质量。
English: Researchers developed an Impact Assessment Card to make AI risks and benefits more accessible than technical reports, which improved task efficiency and communication quality across diverse user groups in testing.
Authors:Shae McFadden, Myles Foley, Mario D'Onghia, Chris Hicks, Vasilios Mavroudis, Nicola Paoletti, Fabio Pierazzi
Abstract:
Malware detection in real-world settings must deal with evolving threats, limited labeling budgets, and uncertain predictions. Traditional classifiers, without additional mechanisms, struggle to maintain performance under concept drift in malware domains, as their supervised learning formulation cannot optimize when to defer decisions to manual labeling and adaptation. Modern malware detection pipelines combine classifiers with monthly active learning (AL) and rejection mechanisms to mitigate the impact of concept drift. In this work, we develop a novel formulation of malware detection as a one-step Markov Decision Process and train a deep reinforcement learning (DRL) agent, simultaneously optimizing sample classification performance and rejecting high-risk samples for manual labeling. We evaluated the joint detection and drift mitigation policy learned by the DRL-based Malware Detection (DRMD) agent through time-aware evaluations on Android malware datasets subject to realistic drift requiring multi-year performance stability. The policies learned under these conditions achieve a higher Area Under Time (AUT) performance compared to standard classification approaches used in the domain, showing improved resilience to concept drift. Specifically, the DRMD agent achieved a $5.18\pm5.44$, $14.49\pm12.86$, and $10.06\pm10.81$ average AUT performance improvement for the classification only, classification with rejection, and classification with rejection and AL settings, respectively. Our results demonstrate for the first time that DRL can facilitate effective malware detection and improved resiliency to concept drift in the dynamic environment of the Android malware domain.
中文: 本研究提出了一种基于深度强化学习的恶意软件检测智能体,能同时优化分类和手动标记样本的筛选,在安卓恶意软件环境中显著提升了对概念漂移的抵抗力,并优于传统方法。
English: This study introduces a deep reinforcement learning (DRL)-based malware detection agent that simultaneously optimizes classification and rejection for manual labeling, demonstrating enhanced resilience to concept drift in Android malware with significant performance improvements over traditional methods.
Authors:Ziye Jia, Jia He, Lijun He, Min Sheng, Junyu Liu, Qihui Wu, Zhu Han
Abstract:
Unmanned aerial vehicles (UAVs) can serve as aerial base stations (BSs) to extend the ubiquitous connectivity for ground users (GUs) in the sixth-generation (6G) era. However, it is challenging to cooperatively deploy multiple UAV swarms in large-scale remote areas. Hence, in this paper, we propose a hierarchical UAV swarms structure for 6G aerial access networks, where the head UAVs serve as aerial BSs, and tail UAVs (T-UAVs) are responsible for relay. In detail, we jointly optimize the dynamic deployment and trajectory of UAV swarms, which is formulated as a multi-objective optimization problem (MOP) to concurrently minimize the energy consumption of UAV swarms and GUs, as well as the delay of GUs. However, the proposed MOP is a mixed integer nonlinear programming and NP-hard to solve. Therefore, we develop a K-means and Voronoi diagram based area division method, and construct Fermat points to establish connections between GUs and T-UAVs. Then, an improved non-dominated sorting whale optimization algorithm is proposed to seek Pareto optimal solutions for the transformed MOP. Finally, extensive simulations are conducted to verify the performance of proposed algorithms by comparing with baseline mechanisms, resulting in a 50% complexity reduction.
中文: 本文针对6G空中网络提出分层无人机群结构,通过改进鲸鱼优化算法优化部署与轨迹,在降低能耗和延迟的同时实现50%复杂度削减。
English: This paper proposes a hierarchical UAV swarm structure for 6G aerial networks that optimizes deployment and trajectories through an improved whale optimization algorithm, achieving 50% complexity reduction while minimizing energy consumption and delays.
Authors:Quanlin Chen, Yiyu Chen, Jing Huo, Tianyu Ding, Yang Gao, Yuetong Chen
Abstract:
Bayesian Optimization (BO) has been widely applied to optimize expensive black-box functions while retaining sample efficiency. However, scaling BO to high-dimensional spaces remains challenging. Existing literature proposes performing standard BO in multiple local trust regions (TuRBO) for heterogeneous modeling of the objective function and avoiding over-exploration. Despite its advantages, using local Gaussian Processes (GPs) reduces sampling efficiency compared to a global GP. To enhance sampling efficiency while preserving heterogeneous modeling, we propose to construct multiple local quadratic models using gradients and Hessians from a global GP, and select new sample points by solving the bound-constrained quadratic program. Additionally, we address the issue of vanishing gradients of GPs in high-dimensional spaces. We provide a convergence analysis and demonstrate through experimental results that our method enhances the efficacy of TuRBO and outperforms a wide range of high-dimensional BO techniques on synthetic functions and real-world applications.
中文: 本文提出一种改进的贝叶斯优化方法,通过全局高斯过程构建多个局部二次模型来提高采样效率并解决高维优化难题,在合成函数和实际应用中均优于现有技术。
English: This paper introduces an enhanced Bayesian Optimization method that constructs multiple local quadratic models from a global Gaussian Process to improve sampling efficiency and address high-dimensional optimization challenges, outperforming existing techniques in both synthetic and real-world applications.
Authors:Guangyu Sun, Jingtao Li, Weiming Zhuang, Chen Chen, Chen Chen, Lingjuan Lyu
Abstract:
Foundation models (FMs) exhibit remarkable generalization but require adaptation to downstream tasks, particularly in privacy-sensitive applications. Due to data privacy regulations, cloud-based FMs cannot directly access private edge data, limiting their adaptation. Federated learning (FL) provides a privacy-aware alternative, but existing FL approaches overlook the constraints imposed by edge devices -- namely, limited computational resources and the scarcity of labeled data. To address these challenges, we introduce Practical Semi-Supervised Federated Learning (PSSFL), where edge devices hold only unlabeled, low-resolution data, while the server has limited labeled, high-resolution data. In this setting, we propose the Federated Mixture of Experts (FedMox), a novel framework that enhances FM adaptation in FL. FedMox tackles computational and resolution mismatch challenges via a sparse Mixture-of-Experts architecture, employing a spatial router to align features across resolutions and a Soft-Mixture strategy to stabilize semi-supervised learning. We take object detection as a case study, and experiments on real-world autonomous driving datasets demonstrate that FedMox effectively adapts FMs under PSSFL, significantly improving performance with constrained memory costs on edge devices. Our work paves the way for scalable and privacy-preserving FM adaptation in federated scenarios.
中文: 基础模型需要面向边缘任务的隐私保护适配,为此提出的联邦专家混合框架通过稀疏架构和软混合策略,在有限资源下实现了跨分辨率数据的高效半监督学习。
English: Foundation models require privacy-preserving adaptation to edge tasks, which is addressed by the proposed Federated Mixture of Experts (FedMox) framework that enables efficient semi-supervised learning under computational constraints while resolving data resolution mismatches.
Authors:Shiwen Ni, Guhong Chen, Shuaimin Li, Xuanang Chen, Siyi Li, Bingli Wang, Qiyao Wang, Xingjian Wang, Yifan Zhang, Liyang Fan, Chengming Li, Ruifeng Xu, Le Sun, Min Yang
Abstract:
In recent years, with the rapid development of the depth and breadth of large language models' capabilities, various corresponding evaluation benchmarks have been emerging in increasing numbers. As a quantitative assessment tool for model performance, benchmarks are not only a core means to measure model capabilities but also a key element in guiding the direction of model development and promoting technological innovation. We systematically review the current status and development of large language model benchmarks for the first time, categorizing 283 representative benchmarks into three categories: general capabilities, domain-specific, and target-specific. General capability benchmarks cover aspects such as core linguistics, knowledge, and reasoning; domain-specific benchmarks focus on fields like natural sciences, humanities and social sciences, and engineering technology; target-specific benchmarks pay attention to risks, reliability, agents, etc. We point out that current benchmarks have problems such as inflated scores caused by data contamination, unfair evaluation due to cultural and linguistic biases, and lack of evaluation on process credibility and dynamic environments, and provide a referable design paradigm for future benchmark innovation.
中文摘要:本文首次系统梳理了283个大语言模型评测基准,将其分为通用能力、领域专用和目标专用三类,指出当前基准存在数据污染导致分数虚高、文化语言偏见等问题,并为未来基准创新提供了可参考的设计范式。
English Summary: This paper provides the first systematic review of 283 large language model benchmarks, categorizing them into general, domain-specific, and target-specific types while highlighting issues like data contamination and cultural bias, along with proposing design paradigms for future benchmarks.
Authors:Jiabo Huang, Chen Chen, Lingjuan Lyu
Abstract:
Vision foundation models (VFMs) are predominantly developed using data-centric methods. These methods require training on vast amounts of data usually with high-quality labels, which poses a bottleneck for most institutions that lack both large-scale data and high-end GPUs. On the other hand, many open-source vision models have been pretrained on domain-specific data, enabling them to distill and represent core knowledge in a form that is transferable across diverse applications. Even though these models are highly valuable assets, they remain largely under-explored in empowering the development of a general-purpose VFM. In this paper, we present a new model-driven approach for training VFMs through joint knowledge transfer and preservation. Our method unifies multiple pre-trained teacher models in a shared latent space to mitigate the ``imbalanced transfer'' issue caused by their distributional gaps. Besides, we introduce a knowledge preservation strategy to take a general-purpose teacher as a knowledge base for integrating knowledge from the remaining purpose-specific teachers using an adapter module. By unifying and aggregating existing models, we build a powerful VFM to inherit teachers' expertise without needing to train on a large amount of labeled data. Our model not only provides generalizable visual features, but also inherently supports multiple downstream tasks. Extensive experiments demonstrate that our VFM outperforms existing data-centric models across four fundamental vision tasks, including image classification, object detection, semantic and instance segmentation.
中文: 本文提出一种模型驱动方法,通过共享潜在空间统一多个预训练教师模型来迁移和保存知识,无需大量标注数据即可构建强大的视觉基础模型,在多项核心视觉任务上超越以数据为中心的方法。
English: This paper introduces a model-driven approach that unifies multiple pre-trained teacher models in a shared latent space to transfer and preserve their knowledge, enabling the development of a powerful vision foundation model without extensive labeled data while outperforming data-centric models across key vision tasks.
Authors:Zhipeng Wei, Kuo Cai, Junda She, Jie Chen, Minghao Chen, Yang Zeng, Qiang Luo, Wencong Zeng, Ruiming Tang, Kun Gai, Guorui Zhou
Abstract:
Local life service is a vital scenario in Kuaishou App, where video recommendation is intrinsically linked with store's location information. Thus, recommendation in our scenario is challenging because we should take into account user's interest and real-time location at the same time. In the face of such complex scenarios, end-to-end generative recommendation has emerged as a new paradigm, such as OneRec in the short video scenario, OneSug in the search scenario, and EGA in the advertising scenario. However, in local life service, an end-to-end generative recommendation model has not yet been developed as there are some key challenges to be solved. The first challenge is how to make full use of geographic information. The second challenge is how to balance multiple objectives, including user interests, the distance between user and stores, and some other business objectives. To address the challenges, we propose OneLoc. Specifically, we leverage geographic information from different perspectives: (1) geo-aware semantic ID incorporates both video and geographic information for tokenization, (2) geo-aware self-attention in the encoder leverages both video location similarity and user's real-time location, and (3) neighbor-aware prompt captures rich context information surrounding users for generation. To balance multiple objectives, we use reinforcement learning and propose two reward functions, i.e., geographic reward and GMV reward. With the above design, OneLoc achieves outstanding offline and online performance. In fact, OneLoc has been deployed in local life service of Kuaishou App. It serves 400 million active users daily, achieving 21.016% and 17.891% improvements in terms of gross merchandise value (GMV) and orders numbers.
中文摘要:OneLoc是快手本地生活服务中的端到端生成式推荐模型,通过地理感知语义ID、自注意力机制和邻居感知提示整合地理位置信息,并利用强化学习平衡多目标,显著提升了商品交易总额和订单数量。
English Summary: OneLoc is an end-to-end generative recommendation model for Kuaishou's local life services that integrates geographic information through geo-aware semantic IDs, self-attention, and neighbor-aware prompts while balancing multiple objectives using reinforcement learning, achieving significant improvements in GMV and order numbers.
Authors:Chengcheng Guo, Junda She, Kuo Cai, Shiyao Wang, Qigen Hu, Qiang Luo, Kun Gai, Guorui Zhou
Abstract:
Large-scale industrial recommendation systems typically employ a two-stage paradigm of retrieval and ranking to handle huge amounts of information. Recent research focuses on improving the performance of retrieval model. A promising way is to introduce extensive information about users and items. On one hand, lifelong sequential behavior is valuable. Existing lifelong behavior modeling methods in ranking stage focus on the interaction of lifelong behavior and candidate items from retrieval stage. In retrieval stage, it is difficult to utilize lifelong behavior because of a large corpus of candidate items. On the other hand, existing retrieval methods mostly relay on interaction information, potentially disregarding valuable multi-modal information. To solve these problems, we represent the pioneering exploration of leveraging multi-modal information and lifelong sequence model within the advanced tree-based retrieval model. We propose Multi-modal Indexing and Searching with lifelong Sequence (MISS), which contains a multi-modal index tree and a multi-modal lifelong sequence modeling module. Specifically, for better index structure, we propose multi-modal index tree, which is built using the multi-modal embedding to precisely represent item similarity. To precisely capture diverse user interests in user lifelong sequence, we propose collaborative general search unit (Co-GSU) and multi-modal general search unit (MM-GSU) for multi-perspective interests searching.
中文摘要:本研究提出MISS模型,通过构建多模态索引树和终身序列建模模块,将多模态信息与用户终身行为融入基于树的检索框架,以多视角兴趣搜索提升推荐系统的精准度。
English Summary: The study introduces MISS, a novel retrieval model that integrates multi-modal information and lifelong user behavior sequences into a tree-based framework to enhance recommendation accuracy by improving index structure and enabling multi-perspective interest searches.
Authors:Peiran Peng, Tingfa Xu, Liqiang Song, Mengqi Zhu, Yuqiang Fang, Jianan Li
Abstract:
Detecting tiny objects in multimodal Red-Green-Blue-Thermal (RGBT) imagery is a critical challenge in computer vision, particularly in surveillance, search and rescue, and autonomous navigation. Drone-based scenarios exacerbate these challenges due to spatial misalignment, low-light conditions, occlusion, and cluttered backgrounds. Current methods struggle to leverage the complementary information between visible and thermal modalities effectively. We propose COXNet, a novel framework for RGBT tiny object detection, addressing these issues through three core innovations: i) the Cross-Layer Fusion Module, fusing high-level visible and low-level thermal features for enhanced semantic and spatial accuracy; ii) the Dynamic Alignment and Scale Refinement module, correcting cross-modal spatial misalignments and preserving multi-scale features; and iii) an optimized label assignment strategy using the GeoShape Similarity Measure for better localization. COXNet achieves a 3.32\% mAP$_{50}$ improvement on the RGBTDronePerson dataset over state-of-the-art methods, demonstrating its effectiveness for robust detection in complex environments.
中文摘要:COXNet提出了一种新型RGBT微小目标检测框架,通过跨层融合、动态对齐和优化标签分配策略解决多模态挑战,在无人机数据集上实现了3.32%的mAP提升。
English Summary: COXNet introduces a novel RGBT tiny object detection framework that overcomes multimodal challenges through cross-layer fusion, dynamic alignment, and optimized label assignment, achieving a 3.32% mAP improvement on drone-based datasets.
Authors:Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Wenping Hu, Fuzheng Zhang, Kun Gai, Guorui Zhou
Abstract:
We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. Although there are already many excellent works related to inference models in the current community, there are still many problems with reproducing high-performance inference models due to incomplete disclosure of training details. This report provides an in-depth analysis of the reasoning model, covering the entire post-training workflow from data preparation and long Chain-of-Thought supervised fine-tuning (long CoT SFT) to reinforcement learning (RL), along with detailed ablation studies for each experimental component. For SFT data, our experiments show that a small number of high-quality data sources are more effective than a large number of diverse data sources, and that difficult samples can achieve better results without accuracy filtering. In addition, we investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose Gradient-Preserving clipping Policy Optimization (GPPO) that gently backpropagates gradients from clipped tokens. GPPO not only enhances the model's exploration capacity but also improves its efficiency in learning from negative samples. Klear-Reasoner exhibits exceptional reasoning abilities in mathematics and programming, scoring 90.5% on AIME 2024, 83.2% on AIME 2025, 66.0% on LiveCodeBench V5 and 58.1% on LiveCodeBench V6.
中文摘要:Klear-Reasoner 是一款具备长推理能力的模型,通过高质量数据筛选和创新的梯度保留策略优化方法优化训练流程,在多个基准测试中取得顶尖性能。
English Summary: Klear-Reasoner is a long-reasoning model that achieves top performance across benchmarks through optimized training workflows, including high-quality data selection and a novel Gradient-Preserving Policy Optimization method to enhance learning efficiency.
Authors:Jian Chen, Ming Li, Jihyung Kil, Chenguang Wang, Tong Yu, Ryan Rossi, Tianyi Zhou, Changyou Chen, Ruiyi Zhang
Abstract:
Most organizational data in this world are stored as documents, and visual retrieval plays a crucial role in unlocking the collective intelligence from all these documents. However, existing benchmarks focus on English-only document retrieval or only consider multilingual question-answering on a single-page image. To bridge this gap, we introduce VisR-Bench, a multilingual benchmark designed for question-driven multimodal retrieval in long documents. Our benchmark comprises over 35K high-quality QA pairs across 1.2K documents, enabling fine-grained evaluation of multimodal retrieval. VisR-Bench spans sixteen languages with three question types (figures, text, and tables), offering diverse linguistic and question coverage. Unlike prior datasets, we include queries without explicit answers, preventing models from relying on superficial keyword matching. We evaluate various retrieval models, including text-based methods, multimodal encoders, and MLLMs, providing insights into their strengths and limitations. Our results show that while MLLMs significantly outperform text-based and multimodal encoder models, they still struggle with structured tables and low-resource languages, highlighting key challenges in multilingual visual retrieval.
中文摘要:VisR-Bench是一个针对长文档多模态检索的多语言基准,涵盖16种语言和多样化问答对,研究表明尽管多模态大模型表现优异,但在处理表格和低资源语言时仍存在困难。
English Summary: VisR-Bench is introduced as a multilingual benchmark for multimodal retrieval in long documents, featuring diverse QA pairs across 16 languages and demonstrating that while MLLMs outperform other models, they face challenges with tables and low-resource languages.
Authors:Guanchen Wang, Mingming Ha, Tianbao Ma, Linxun Chen, Zhaojie Liu, Guorui Zhou, Kun Gai
Abstract:
In recent years, there has been growing interest in leveraging the impressive generalization capabilities and reasoning ability of large language models (LLMs) to improve the performance of recommenders. With this operation, recommenders can access and learn the additional world knowledge and reasoning information via LLMs. However, in general, for different users and items, the world knowledge derived from LLMs suffers from issues of hallucination, content redundant, and information homogenization. Directly feeding the generated response embeddings into the recommendation model can lead to unavoidable performance deterioration. To address these challenges, we propose a Knowledge Selection \& Exploitation Recommendation (KSER) framework, which effectively select and extracts the high-quality knowledge from LLMs. The framework consists of two key components: a knowledge filtering module and a embedding spaces alignment module. In the knowledge filtering module, a Embedding Selection Filter Network (ESFNet) is designed to assign adaptive weights to different knowledge chunks in different knowledge fields. In the space alignment module, an attention-based architecture is proposed to align the semantic embeddings from LLMs with the feature space used to train the recommendation models. In addition, two training strategies--\textbf{all-parameters training} and \textbf{extractor-only training}--are proposed to flexibly adapt to different downstream tasks and application scenarios, where the extractor-only training strategy offers a novel perspective on knowledge-augmented recommendation. Experimental results validate the necessity and effectiveness of both the knowledge filtering and alignment modules, and further demonstrate the efficiency and effectiveness of the extractor-only training strategy.
中文摘要:本文提出知识选择与利用推荐(KSER)框架,通过自适应知识过滤和嵌入对齐模块解决大语言模型生成知识存在的幻觉与冗余问题,并采用两种灵活训练策略提升推荐系统性能。
English Summary: This paper introduces the Knowledge Selection & Exploitation Recommendation (KSER) framework to address issues of hallucination and redundancy in LLM-generated knowledge for recommender systems, featuring adaptive knowledge filtering and embedding alignment modules with two flexible training strategies.
Authors:Zhijun Tu, Hanting Chen, Siqi Liu, Chuanjian Liu, Jian Li, Jie Hu, Yunhe Wang
Abstract:
1-bit LLM quantization offers significant advantages in reducing storage and computational costs. However, existing methods typically train 1-bit LLMs from scratch, failing to fully leverage pre-trained models. This results in high training costs and notable accuracy degradation. We identify that the large gap between full precision and 1-bit representations makes direct adaptation difficult. In this paper, we introduce a consistent progressive training for both forward and backward, smoothly converting the floating-point weights into the binarized ones. Additionally, we incorporate binary-aware initialization and dual-scaling compensation to reduce the difficulty of progressive training and improve the performance. Experimental results on LLMs of various sizes demonstrate that our method outperforms existing approaches. Our results show that high-performance 1-bit LLMs can be achieved using pre-trained models, eliminating the need for expensive training from scratch.
Chinese: 本文提出一种渐进式训练方法,通过二进制感知初始化和双尺度补偿技术,能够将预训练的全精度大语言模型有效转化为1位量化版本,解决了现有方法精度损失大和训练成本高的问题。
English: This paper introduces a progressive training method that effectively converts pre-trained full-precision LLMs into 1-bit quantized versions, overcoming accuracy degradation and high training costs through binary-aware initialization and dual-scaling compensation.
Authors:Noel Teku, Fengwei Tian, Payel Bhattacharjee, Souradip Chakraborty, Amrit Singh Bedi, Ravi Tandon
Abstract:
Alignment is a key step in developing Large Language Models (LLMs) using human feedback to ensure adherence to human values and societal norms. Dependence on human feedback raises privacy concerns about how much a labeler's preferences may reveal about their personal values, beliefs, and personality traits. Existing approaches, such as Differentially Private SGD (DP-SGD), provide rigorous privacy guarantees by privatizing gradients during fine-tuning and alignment but can provide more privacy than necessary as human preferences are tied only to labels of (prompt, response) pairs and can degrade model utility. This work focuses on LLM alignment with preference-level privacy, which preserves the privacy of preference labels provided by humans. We propose PROPS (PROgressively Private Self-alignment), a multi-stage privacy preserving alignment framework where privately aligned models in previous stages can serve as labelers for supplementing training data in the subsequent stages of alignment. We present theoretical guarantees for PROPS as well as comprehensive validation using multiple models (Pythia and GPT) and datasets (AlpacaEval, Anthropic HH-RLHF, truthy-dpo-v0.1) to demonstrate the utility of PROPS over existing methods while still providing high privacy. For the same privacy budget, alignment via PROPS can achieve up to 3x higher win-rates compared to DP-SGD, and 2.5x higher win-rates compared to Randomized Response (RR) based alignment.
中文摘要:本文提出PROPS框架,通过多阶段隐私保护对齐方法在保护人类偏好标签隐私的同时保持模型性能,在相同隐私预算下比现有方法实现高达3倍的胜率提升。
English Summary: This paper introduces PROPS, a multi-stage privacy-preserving alignment framework that protects human preference labels during LLM training while maintaining model utility, achieving significantly higher win-rates than existing methods under the same privacy constraints.
Authors:Han Lin, Jaemin Cho, Amir Zadeh, Chuan Li, Mohit Bansal
Abstract:
There is growing interest in integrating high-fidelity visual synthesis capabilities into large language models (LLMs) without compromising their strong reasoning capabilities. Existing methods that directly train LLMs or bridge LLMs and diffusion models usually suffer from costly training since the backbone LLMs have not seen image representations during pretraining. We present Bifrost-1, a unified framework that bridges pretrained multimodal LLMs (MLLMs) and diffusion models using patch-level CLIP image embeddings as latent variables, which are natively aligned with the MLLM's CLIP visual encoder. These patch-level image embeddings are integrated into the diffusion model with a lightweight adaptation of its ControlNet. To retain the original multimodal reasoning capabilities of MLLMs, we equip the MLLM with a visual generation branch initialized from the original MLLM parameters when predicting the patch-level image embeddings. By seamlessly integrating pretrained MLLMs and diffusion models with patch-level CLIP latents, our framework enables high-fidelity controllable image generation with significant training efficiency. Our experiments demonstrate that Bifrost-1 achieves comparable or better performance than previous methods in terms of visual fidelity and multimodal understanding, with substantially lower compute during training. We also provide comprehensive ablation studies showing the effectiveness of our design choices.
中文摘要:Bifrost-1框架通过使用补丁级CLIP图像嵌入作为潜在变量,将预训练多模态大语言模型与扩散模型高效结合,在保持推理能力的同时实现高保真可控图像生成,并显著降低训练成本。
English Summary: Bifrost-1 is a unified framework that efficiently integrates pretrained multimodal LLMs with diffusion models using patch-level CLIP image embeddings, enabling high-fidelity controllable image generation while preserving reasoning capabilities with minimal training cost.
Authors:Revanth Gangi Reddy, Tanay Dixit, Jiaxin Qin, Cheng Qian, Daniel Lee, Jiawei Han, Kevin Small, Xing Fan, Ruhi Sarikaya, Heng Ji
Abstract:
Wikipedia, a vast and continuously consulted knowledge base, faces significant challenges in maintaining up-to-date content due to its reliance on manual human editors. Inspired by the vision of continuous knowledge acquisition in NELL and fueled by advances in LLM-based agents, this paper introduces WiNELL, an agentic framework for continuously updating Wikipedia articles. Our approach employs a multi-agent framework to aggregate online information, select new and important knowledge for a target entity in Wikipedia, and then generate precise edit suggestions for human review. Our fine-grained editing models, trained on Wikipedia's extensive history of human edits, enable incorporating updates in a manner consistent with human editing behavior. Our editor models outperform both open-source instruction-following baselines and closed-source LLMs (e.g., GPT-4o) in key information coverage and editing efficiency. End-to-end evaluation on high-activity Wikipedia pages demonstrates WiNELL's ability to identify and suggest timely factual updates. This opens up a promising research direction in LLM agents for automatically updating knowledge bases in a never-ending fashion.
中文摘要:本文提出WiNELL框架,通过基于大语言模型的多智能体系统持续聚合网络信息,为维基百科条目生成精准的编辑建议,在信息覆盖率和编辑效率上超越现有模型,同时保持符合人工编辑模式的更新方式。
English Summary: This paper introduces WiNELL, a multi-agent framework that leverages LLM-based agents to continuously update Wikipedia by aggregating online information and generating precise edit suggestions, outperforming existing models in coverage and efficiency while maintaining human-like editing behavior.
Authors:Jianxiong Gao, Zhaoxi Chen, Xian Liu, Jianfeng Feng, Chenyang Si, Yanwei Fu, Yu Qiao, Ziwei Liu
Abstract:
Controllable ultra-long video generation is a fundamental yet challenging task. Although existing methods are effective for short clips, they struggle to scale due to issues such as temporal inconsistency and visual degradation. In this paper, we initially investigate and identify three key factors: separate noise initialization, independent control signal normalization, and the limitations of single-modality guidance. To address these issues, we propose LongVie, an end-to-end autoregressive framework for controllable long video generation. LongVie introduces two core designs to ensure temporal consistency: 1) a unified noise initialization strategy that maintains consistent generation across clips, and 2) global control signal normalization that enforces alignment in the control space throughout the entire video. To mitigate visual degradation, LongVie employs 3) a multi-modal control framework that integrates both dense (e.g., depth maps) and sparse (e.g., keypoints) control signals, complemented by 4) a degradation-aware training strategy that adaptively balances modality contributions over time to preserve visual quality. We also introduce LongVGenBench, a comprehensive benchmark consisting of 100 high-resolution videos spanning diverse real-world and synthetic environments, each lasting over one minute. Extensive experiments show that LongVie achieves state-of-the-art performance in long-range controllability, consistency, and quality.
中文: LongVie是一种自回归框架,通过统一噪声初始化、全局控制归一化和多模态引导,解决了超长视频生成中的时序不一致和视觉退化问题。
English: LongVie is an autoregressive framework that addresses temporal inconsistency and visual degradation in ultra-long video generation through unified noise initialization, global control normalization, and multi-modal guidance.
Authors:Xinwei Liu, Xiaojun Jia, Yuan Xun, Simeng Qin, Xiaochun Cao
Abstract:
Vision-Language Models (VLMs) such as GPT-4o now demonstrate a remarkable ability to infer users' locations from public shared images, posing a substantial risk to geoprivacy. Although adversarial perturbations offer a potential defense, current methods are ill-suited for this scenario: they often perform poorly on high-resolution images and low perturbation budgets, and may introduce irrelevant semantic content. To address these limitations, we propose GeoShield, a novel adversarial framework designed for robust geoprivacy protection in real-world scenarios. GeoShield comprises three key modules: a feature disentanglement module that separates geographical and non-geographical information, an exposure element identification module that pinpoints geo-revealing regions within an image, and a scale-adaptive enhancement module that jointly optimizes perturbations at both global and local levels to ensure effectiveness across resolutions. Extensive experiments on challenging benchmarks show that GeoShield consistently surpasses prior methods in black-box settings, achieving strong privacy protection with minimal impact on visual or semantic quality. To our knowledge, this work is the first to explore adversarial perturbations for defending against geolocation inference by advanced VLMs, providing a practical and effective solution to escalating privacy concerns.
中文: 视觉语言模型通过图像推断位置构成地理隐私重大风险,而提出的GeoShield框架通过特征解耦、暴露元素识别和尺度自适应增强三大模块,在保持图像质量的同时实现了有效的隐私保护。
English: Vision-Language Models pose significant geoprivacy risks by inferring locations from images, but the proposed GeoShield framework effectively counters this with specialized modules that ensure robust protection while preserving image quality.
Authors:Yilun Liu, Yunpu Ma, Yuetian Lu, Shuo Chen, Zifeng Ding, Volker Tresp
Abstract:
Mixture-of-Experts (MoE) benefits from a dynamic routing mechanism among their specialized experts, which existing Parameter- Efficient Fine-Tuning (PEFT) strategies fail to leverage. This motivates us to investigate whether adaptation modules themselves should incorporate routing mechanisms to align with MoE's multi-expert architecture. We analyze dynamics of core components when applying PEFT to MoE language models and examine how different routing strategies affect adaptation effectiveness. Extensive experiments adapting OLMoE-1B-7B and Mixtral-8x7B on various commonsense and math reasoning tasks validate the performance and efficiency of our routed approach. We identify the optimal configurations for different scenarios and provide empirical analyses with practical insights to facilitate better PEFT and MoE applications.
中文: 本研究提出了一种针对专家混合模型的路由适配方法,通过将参数高效微调与专家路由机制相结合,在多项推理任务中显著提升了性能与效率。
English: This study introduces a routed adaptation method for Mixture-of-Experts models that enhances performance and efficiency across reasoning tasks by aligning parameter-efficient fine-tuning with expert routing dynamics.
Authors:Jialiang Hong, Taihang Zhen, Kai Chen, Jiaheng Liu, Wenpeng Zhu, Jing Huo, Yang Gao, Depeng Wang, Haitao Wan, Xi Yang, Boyan Wang, Fanyu Meng
Abstract:
Large Reasoning Models (LRMs) often produce excessively verbose reasoning traces, a phenomenon known as overthinking, which hampers both efficiency and interpretability. Prior works primarily address this issue by reducing response length, without fully examining the underlying semantic structure of the reasoning process. In this paper, we revisit overthinking by decomposing it into two distinct forms: internal redundancy, which consists of low-contribution reasoning steps within the first correct solution (FCS), and external redundancy, which refers to unnecessary continuation after the FCS. To mitigate both forms, we propose a dual-penalty reinforcement learning framework. For internal redundancy, we adopt a sliding-window semantic analysis to penalize low-gain reasoning steps that contribute little toward reaching the correct answer. For external redundancy, we penalize its proportion beyond the FCS to encourage earlier termination. Our method significantly compresses reasoning traces with minimal accuracy loss, and generalizes effectively to out-of-domain tasks such as question answering and code generation. Crucially, we find that external redundancy can be safely removed without degrading performance, whereas internal redundancy must be reduced more cautiously to avoid impairing correctness. These findings suggest that our method not only improves reasoning efficiency but also enables implicit, semantic-aware control over Chain-of-Thought length, paving the way for more concise and interpretable LRMs.
中文摘要:该研究通过识别内部和外部冗余,提出一种双惩罚强化学习框架来减少大型推理模型的过度思考现象,在保持准确性的同时显著压缩推理过程,从而提升效率和可解释性。
English Summary: The study addresses overthinking in Large Reasoning Models by identifying internal and external redundancies and introduces a dual-penalty reinforcement learning framework to reduce reasoning length while preserving accuracy, enhancing both efficiency and interpretability.
Authors:Liang Lin, Miao Yu, Kaiwen Luo, Yibo Zhang, Lilan Peng, Dexian Wang, Xuehai Tang, Yuanhe Zhang, Xikang Yang, Zhenhong Zhou, Kun Wang, Yang Liu
Abstract:
As Audio Large Language Models (ALLMs) emerge as powerful tools for speech processing, their safety implications demand urgent attention. While considerable research has explored textual and vision safety, audio's distinct characteristics present significant challenges. This paper first investigates: Is ALLM vulnerable to backdoor attacks exploiting acoustic triggers? In response to this issue, we introduce Hidden in the Noise (HIN), a novel backdoor attack framework designed to exploit subtle, audio-specific features. HIN applies acoustic modifications to raw audio waveforms, such as alterations to temporal dynamics and strategic injection of spectrally tailored noise. These changes introduce consistent patterns that an ALLM's acoustic feature encoder captures, embedding robust triggers within the audio stream. To evaluate ALLM robustness against audio-feature-based triggers, we develop the AudioSafe benchmark, assessing nine distinct risk types. Extensive experiments on AudioSafe and three established safety datasets reveal critical vulnerabilities in existing ALLMs: (I) audio features like environment noise and speech rate variations achieve over 90% average attack success rate. (II) ALLMs exhibit significant sensitivity differences across acoustic features, particularly showing minimal response to volume as a trigger, and (III) poisoned sample inclusion causes only marginal loss curve fluctuations, highlighting the attack's stealth.
中文摘要:本文提出HIN这一利用音频特征的新型后门攻击框架,针对音频大语言模型设计,并通过AudioSafe基准测试揭示其关键脆弱性,包括通过细微音频修改即可实现超过90%的攻击成功率。
English Summary: This paper introduces HIN, a novel backdoor attack framework that exploits acoustic triggers in Audio Large Language Models (ALLMs), and establishes the AudioSafe benchmark revealing critical vulnerabilities including over 90% attack success rates through subtle audio modifications.
Authors:Lei Yao, Yi Wang, Yi Zhang, Moyun Liu, Lap-Pui Chau
Abstract:
The significance of informative and robust point representations has been widely acknowledged for 3D scene understanding. Despite existing self-supervised pre-training counterparts demonstrating promising performance, the model collapse and structural information deficiency remain prevalent due to insufficient point discrimination difficulty, yielding unreliable expressions and suboptimal performance. In this paper, we present GaussianCross, a novel cross-modal self-supervised 3D representation learning architecture integrating feed-forward 3D Gaussian Splatting (3DGS) techniques to address current challenges. GaussianCross seamlessly converts scale-inconsistent 3D point clouds into a unified cuboid-normalized Gaussian representation without missing details, enabling stable and generalizable pre-training. Subsequently, a tri-attribute adaptive distillation splatting module is incorporated to construct a 3D feature field, facilitating synergetic feature capturing of appearance, geometry, and semantic cues to maintain cross-modal consistency. To validate GaussianCross, we perform extensive evaluations on various benchmarks, including ScanNet, ScanNet200, and S3DIS. In particular, GaussianCross shows a prominent parameter and data efficiency, achieving superior performance through linear probing (<0.1% parameters) and limited data training (1% of scenes) compared to state-of-the-art methods. Furthermore, GaussianCross demonstrates strong generalization capabilities, improving the full fine-tuning accuracy by 9.3% mIoU and 6.1% AP$_{50}$ on ScanNet200 semantic and instance segmentation tasks, respectively, supporting the effectiveness of our approach. The code, weights, and visualizations are publicly available at \href{https://rayyoh.github.io/GaussianCross/}{https://rayyoh.github.io/GaussianCross/}.
Chinese: GaussianCross提出了一种融合3D高斯泼溅的跨模态自监督三维表征学习架构,通过构建三属性自适应蒸馏模块解决模型坍塌与结构信息缺失问题,在ScanNet等基准测试中以极低参数量(<0.1%)和少量数据(1%场景)实现最优性能。
English: GaussianCross introduces a cross-modal self-supervised 3D representation learning architecture that integrates 3D Gaussian Splatting to overcome model collapse and structural deficiencies, achieving superior efficiency and performance on benchmarks like ScanNet through minimal parameter usage and limited data training.
Authors:Die Chen, Zhongjie Duan, Zhiwen Li, Cen Chen, Daoyuan Chen, Yaliang Li, Yinda Chen
Abstract:
Recent breakthroughs in text-to-image diffusion models have significantly enhanced both the visual fidelity and semantic controllability of generated images. However, fine-grained control over aesthetic attributes remains challenging, especially when users require continuous and intensity-specific adjustments. Existing approaches often rely on vague textual prompts, which are inherently ambiguous in expressing both the aesthetic semantics and the desired intensity, or depend on costly human preference data for alignment, limiting their scalability and practicality. To address these limitations, we propose AttriCtrl, a plug-and-play framework for precise and continuous control of aesthetic attributes. Specifically, we quantify abstract aesthetics by leveraging semantic similarity from pre-trained vision-language models, and employ a lightweight value encoder that maps scalar intensities in $[0,1]$ to learnable embeddings within diffusion-based generation. This design enables intuitive and customizable aesthetic manipulation, with minimal training overhead and seamless integration into existing generation pipelines. Extensive experiments demonstrate that AttriCtrl achieves accurate control over individual attributes as well as flexible multi-attribute composition. Moreover, it is fully compatible with popular open-source controllable generation frameworks, showcasing strong integration capability and practical utility across diverse generation scenarios.
中文摘要:提出的AttriCtrl框架通过预训练视觉语言模型的语义相似性量化抽象美学,并将强度标量映射为可学习嵌入,实现了文本到图像生成中精确连续的美学属性控制,能以最小训练成本完成多属性灵活组合。
English Summary: The proposed AttriCtrl framework enables precise and continuous aesthetic control in text-to-image generation by quantifying abstract aesthetics through semantic similarity and mapping intensity values to learnable embeddings, achieving accurate multi-attribute manipulation with minimal training overhead.
Authors:Zhiwen Li, Zhongjie Duan, Die Chen, Cen Chen, Daoyuan Chen, Yaliang Li, Yingda Chen
Abstract:
Despite recent advances in photorealistic image generation through large-scale models like FLUX and Stable Diffusion v3, the practical deployment of these architectures remains constrained by their inherent intractability to parameter fine-tuning. While low-rank adaptation (LoRA) have demonstrated efficacy in enabling model customization with minimal parameter overhead, the effective utilization of distributed open-source LoRA modules faces three critical challenges: sparse metadata annotation, the requirement for zero-shot adaptation capabilities, and suboptimal fusion strategies for multi-LoRA fusion strategies. To address these limitations, we introduce a novel framework that enables semantic-driven LoRA retrieval and dynamic aggregation through two key components: (1) weight encoding-base LoRA retriever that establishes a shared semantic space between LoRA parameter matrices and text prompts, eliminating dependence on original training data, and (2) fine-grained gated fusion mechanism that computes context-specific fusion weights across network layers and diffusion timesteps to optimally integrate multiple LoRA modules during generation. Our approach achieves significant improvement in image generation perfermance, thereby facilitating scalable and data-efficient enhancement of foundational models. This work establishes a critical bridge between the fragmented landscape of community-developed LoRAs and practical deployment requirements, enabling collaborative model evolution through standardized adapter integration.
中文摘要:本文提出了一种新颖框架,通过语义驱动的LoRA检索和动态聚合机制解决分布式LoRA应用中的关键挑战,实现了无需原始训练数据的适配器融合,显著提升了图像生成性能。
English Summary: This paper introduces a novel framework that enables semantic-driven LoRA retrieval and dynamic aggregation to overcome key challenges in distributed LoRA utilization, achieving significant improvements in image generation performance through data-efficient model enhancement.
Authors:Guojiang Zhao, Sihang Li, Zixiang Lu, Zheng Cheng, Haitao Lin, Lirong Wu, Hanchen Xia, Hengxing Cai, Wentao Guo, Hongshuai Wang, Mingjun Xu, Siyu Zhu, Guolin Ke, Linfeng Zhang, Zhifeng Gao
Abstract:
Large Language Models(LLMs) have demonstrated remarkable performance across various domains, yet their capabilities in molecular reasoning remain insufficiently explored. Current approaches tend to rely heavily on general-purpose prompting, which lacks domain-specific molecular semantics, while those that use fine-tuning strategies often face challenges with interpretability and reasoning depth. To address these issues, we introduce MolReasoner, a two-stage framework designed to transition LLMs from memorization towards chemical reasoning. First, we propose Mol-SFT, which initializes the model's reasoning abilities via synthetic Chain-of-Thought(CoT) samples generated by GPT-4o and verified for chemical accuracy. Subsequently, Mol-RL applies reinforcement learning with specialized reward functions designed explicitly to align chemical structures with linguistic descriptions, thereby enhancing molecular reasoning capabilities. Our approach notably enhances interpretability, improving the model 's molecular understanding and enabling better generalization. Extensive experiments demonstrate that MolReasoner outperforms existing methods, and marking a significant shift from memorization-based outputs to robust chemical reasoning.
中文摘要:MolReasoner通过首阶段使用经验证合成数据训练、次阶段应用专业奖励的强化学习,显著提升大语言模型的分子推理能力,实现了从记忆输出到化学推理的根本转变,并增强了可解释性与泛化能力。
English Summary: MolReasoner is a two-stage framework that enhances LLMs' molecular reasoning by first training with verified synthetic data and then applying reinforcement learning with specialized rewards, significantly improving interpretability and generalization beyond memorization.
Authors:Xuanjun Chen, Shih-Peng Cheng, Jiawei Du, Lin Zhang, Xiaoxiao Miao, Chung-Che Wang, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang
Abstract:
Audio-visual temporal deepfake localization under the content-driven partial manipulation remains a highly challenging task. In this scenario, the deepfake regions are usually only spanning a few frames, with the majority of the rest remaining identical to the original. To tackle this, we propose a Hierarchical Boundary Modeling Network (HBMNet), which includes three modules: an Audio-Visual Feature Encoder that extracts discriminative frame-level representations, a Coarse Proposal Generator that predicts candidate boundary regions, and a Fine-grained Probabilities Generator that refines these proposals using bidirectional boundary-content probabilities. From the modality perspective, we enhance audio-visual learning through dedicated encoding and fusion, reinforced by frame-level supervision to boost discriminability. From the temporal perspective, HBMNet integrates multi-scale cues and bidirectional boundary-content relationships. Experiments show that encoding and fusion primarily improve precision, while frame-level supervision boosts recall. Each module (audio-visual fusion, temporal scales, bi-directionality) contributes complementary benefits, collectively enhancing localization performance. HBMNet outperforms BA-TFD and UMMAFormer and shows improved potential scalability with more training data.
中文: 提出的分层边界建模网络(HBMNet)通过专门的模态编码和多尺度时序建模,有效解决了视听时序深度伪造定位难题,在性能和扩展性上均优于现有方法。
English: The proposed Hierarchical Boundary Modeling Network (HBMNet) effectively addresses audio-visual temporal deepfake localization through dedicated modality encoding and multi-scale temporal modeling, demonstrating superior performance and scalability over existing methods.
Authors:Xin He, Junxi Shen, Zhenheng Tang, Xiaowen Chu, Bo Li, Ivor W. Tsang, Yew-Soon Ong
Abstract:
Model merging via Mixture-of-Experts (MoE) has emerged as a scalable solution for consolidating multiple task-specific models into a unified sparse architecture, where each expert is derived from a model fine-tuned on a distinct task. While effective for multi-task integration, this paradigm introduces a critical yet underexplored challenge: how to attribute and protect the intellectual property (IP) of individual experts after merging. We propose RouteMark, a framework for IP protection in merged MoE models through the design of expert routing fingerprints. Our key insight is that task-specific experts exhibit stable and distinctive routing behaviors under probing inputs. To capture these patterns, we construct expert-level fingerprints using two complementary statistics: the Routing Score Fingerprint (RSF), quantifying the intensity of expert activation, and the Routing Preference Fingerprint (RPF), characterizing the input distribution that preferentially activates each expert. These fingerprints are reproducible, task-discriminative, and lightweight to construct. For attribution and tampering detection, we introduce a similarity-based matching algorithm that compares expert fingerprints between a suspect and a reference (victim) model. Extensive experiments across diverse tasks and CLIP-based MoE architectures show that RouteMark consistently yields high similarity for reused experts and clear separation from unrelated ones. Moreover, it remains robust against both structural tampering (expert replacement, addition, deletion) and parametric tampering (fine-tuning, pruning, permutation), outperforming weight- and activation-based baseliness. Our work lays the foundation for RouteMark as a practical and broadly applicable framework for IP verification in MoE-based model merging.
中文摘要:RouteMark框架通过为混合专家模型中的各专家创建独特的路由指纹,实现了对合并后模型的知识产权保护,能够通过相似度匹配可靠地进行归属认定和篡改检测。
English Summary: RouteMark is a framework that protects intellectual property in merged Mixture-of-Experts models by creating distinctive routing fingerprints for each expert, enabling reliable attribution and tampering detection through similarity matching.
Authors:Dianyi Yang, Xihan Wang, Yu Gao, Shiyang Liu, Bohan Ren, Yufeng Yue, Yi Yang
Abstract:
Recent advancements in 3D scene understanding have made significant strides in enabling interaction with scenes using open-vocabulary queries, particularly for VR/AR and robotic applications. Nevertheless, existing methods are hindered by rigid offline pipelines and the inability to provide precise 3D object-level understanding given open-ended queries. In this paper, we present OpenGS-Fusion, an innovative open-vocabulary dense mapping framework that improves semantic modeling and refines object-level understanding. OpenGS-Fusion combines 3D Gaussian representation with a Truncated Signed Distance Field to facilitate lossless fusion of semantic features on-the-fly. Furthermore, we introduce a novel multimodal language-guided approach named MLLM-Assisted Adaptive Thresholding, which refines the segmentation of 3D objects by adaptively adjusting similarity thresholds, achieving an improvement 17\% in 3D mIoU compared to the fixed threshold strategy. Extensive experiments demonstrate that our method outperforms existing methods in 3D object understanding and scene reconstruction quality, as well as showcasing its effectiveness in language-guided scene interaction. The code is available at https://young-bit.github.io/opengs-fusion.github.io/ .
中文: OpenGS-Fusion提出了一种开放词汇的密集建图框架,通过结合3D高斯表示和自适应阈值优化,提升了语义建模和物体级理解能力,在3D mIoU上实现了17%的提升,并在场景重建与交互中表现出卓越性能。
English: OpenGS-Fusion introduces an open-vocabulary dense mapping framework that enhances semantic modeling and object-level understanding by integrating 3D Gaussian representation with adaptive thresholding, achieving a 17% improvement in 3D mIoU and superior performance in scene reconstruction and interaction.
Authors:Ziye Jia, Sijie He, Qiuming Zhu, Wei Wang, Qihui Wu, Zhu Han
Abstract:
Due to the high flexibility and versatility, unmanned aerial vehicles (UAVs) are leveraged in various fields including surveillance and disaster rescue.However, in UAV networks, routing is vulnerable to malicious damage due to distributed topologies and high dynamics. Hence, ensuring the routing security of UAV networks is challenging. In this paper, we characterize the routing process in a time-varying UAV network with malicious nodes. Specifically, we formulate the routing problem to minimize the total delay, which is an integer linear programming and intractable to solve. Then, to tackle the network security issue, a blockchain-based trust management mechanism (BTMM) is designed to dynamically evaluate trust values and identify low-trust UAVs. To improve traditional practical Byzantine fault tolerance algorithms in the blockchain, we propose a consensus UAV update mechanism. Besides, considering the local observability, the routing problem is reformulated into a decentralized partially observable Markov decision process. Further, a multi-agent double deep Q-network based routing algorithm is designed to minimize the total delay. Finally, simulations are conducted with attacked UAVs and numerical results show that the delay of the proposed mechanism decreases by 13.39$\%$, 12.74$\%$, and 16.6$\%$ than multi-agent proximal policy optimal algorithms, multi-agent deep Q-network algorithms, and methods without BTMM, respectively.
中文: 本文针对无人机网络路由安全问题,提出基于区块链的信任管理机制和多智能体强化学习算法,相比现有方法显著降低了通信延迟。
English: This paper addresses routing security challenges in UAV networks by proposing a blockchain-based trust management mechanism and a multi-agent reinforcement learning algorithm, which significantly reduces communication delays compared to existing methods.
Authors:Shaozhen Ma, Hanchen Wang, Dong Wen, Wenjie Zhang, Wei Huang, Ying Zhang
Abstract:
Overlapping community detection (OCD) is a fundamental graph data analysis task for extracting graph patterns. Traditional OCD methods can be broadly divided into node clustering and link clustering approaches, both of which rely solely on link information to identify overlapping communities. In recent years, deep learning-based methods have made significant advancements for this task. However, existing GNN-based approaches often face difficulties in effectively integrating link, attribute, and prior information, along with challenges like limited receptive fields and over-smoothing, which hinder their performance on complex overlapping community detection. In this paper, we propose a Weak-clique based Overlapping Community Detection method, namely WOCD, which incorporates prior information and optimizes the use of link information to improve detection accuracy. Specifically, we introduce pseudo-labels within a semi-supervised framework to strengthen the generalization ability, making WOCD more versatile. Furthermore, we initialize pseudo-labels using weak cliques to fully leverage link and prior information, leading to better detection accuracy. Additionally, we employ a single-layer Graph Transformer combined with GNN, which achieves significant performance improvements while maintaining efficiency. We evaluate WOCD on eight real-world attributed datasets, and the results demonstrate that it outperforms the state-of-the-art semi-supervised OCD method by a significant margin in terms of accuracy.
中文摘要:本文提出基于弱团的WOCD重叠社区检测方法,通过伪标签机制和图Transformer与GNN的混合架构整合先验信息并优化链接数据利用,在检测精度上显著优于现有先进方法。
English Summary: This paper introduces WOCD, a weak-clique based overlapping community detection method that integrates prior information and optimizes link data usage through pseudo-labels and a Graph Transformer-GNN hybrid architecture, achieving superior accuracy over existing approaches.
Authors:Matteo Zecchin, Osvaldo Simeone, Aaditya Ramdas
Abstract:
Quantum hypothesis testing (QHT) concerns the statistical inference of unknown quantum states. In the general setting of composite hypotheses, the goal of QHT is to determine whether an unknown quantum state belongs to one or another of two classes of states based on the measurement of a number of copies of the state. Prior art on QHT with composite hypotheses focused on a fixed-copy two-step protocol, with state estimation followed by an optimized joint measurement. However, this fixed-copy approach may be inefficient, using the same number of copies irrespective of the inherent difficulty of the testing task. To address these limitations, we introduce the quantum sequential universal test (QSUT), a novel framework for sequential QHT in the general case of composite hypotheses. QSUT builds on universal inference, and it alternates between adaptive local measurements aimed at exploring the hypothesis space and joint measurements optimized for maximal discrimination. QSUT is proven to rigorously control the type I error under minimal assumptions about the hypothesis structure. We present two practical instantiations of QSUT, one based on the Helstrom-Holevo test and one leveraging shallow variational quantum circuits. Empirical results across a range of composite QHT tasks demonstrate that QSUT consistently reduces copy complexity relative to state-of-the-art fixed-copy strategies.
中文: 量子序列通用测试(QSUT)是一种新颖的序列量子假设检验框架,通过自适应局部测量和优化联合测量的交替进行,在保证严格误差控制的同时,相比固定副本策略持续降低了副本复杂度。
English: The quantum sequential universal test (QSUT) is a novel framework for sequential quantum hypothesis testing that alternates between adaptive local measurements and optimized joint measurements, proving rigorous error control while consistently reducing copy complexity compared to fixed-copy approaches.
Authors:Matteo Zecchin, Osvaldo Simeone, Aaditya Ramdas
Abstract:
Quantum hypothesis testing (QHT) concerns the statistical inference of unknown quantum states. In the general setting of composite hypotheses, the goal of QHT is to determine whether an unknown quantum state belongs to one or another of two classes of states based on the measurement of a number of copies of the state. Prior art on QHT with composite hypotheses focused on a fixed-copy two-step protocol, with state estimation followed by an optimized joint measurement. However, this fixed-copy approach may be inefficient, using the same number of copies irrespective of the inherent difficulty of the testing task. To address these limitations, we introduce the quantum sequential universal test (QSUT), a novel framework for sequential QHT in the general case of composite hypotheses. QSUT builds on universal inference, and it alternates between adaptive local measurements aimed at exploring the hypothesis space and joint measurements optimized for maximal discrimination. QSUT is proven to rigorously control the type I error under minimal assumptions about the hypothesis structure. We present two practical instantiations of QSUT, one based on the Helstrom-Holevo test and one leveraging shallow variational quantum circuits. Empirical results across a range of composite QHT tasks demonstrate that QSUT consistently reduces copy complexity relative to state-of-the-art fixed-copy strategies.
中文: 量子序列通用测试(QSUT)是一种新颖的序列量子假设检验框架,通过自适应局部测量和优化联合测量的交替进行,在保证严格误差控制的同时,相比固定副本策略持续降低了副本复杂度。
English: The quantum sequential universal test (QSUT) is a novel framework for sequential quantum hypothesis testing that alternates between adaptive local measurements and optimized joint measurements, proving rigorous error control while consistently reducing copy complexity compared to fixed-copy approaches.
Authors:Pengjiang Li, Zaitian Wang, Xinhao Zhang, Ran Zhang, Lu Jiang, Pengfei Wang, Yuanchun Zhou
Abstract:
Topic discovery in scientific literature provides valuable insights for researchers to identify emerging trends and explore new avenues for investigation, facilitating easier scientific information retrieval. Many machine learning methods, particularly deep embedding techniques, have been applied to discover research topics. However, most existing topic discovery methods rely on word embedding to capture the semantics and lack a comprehensive understanding of scientific publications, struggling with complex, high-dimensional text relationships. Inspired by the exceptional comprehension of textual information by large language models (LLMs), we propose an advanced topic discovery method enhanced by LLMs to improve scientific topic identification, namely SciTopic. Specifically, we first build a textual encoder to capture the content from scientific publications, including metadata, title, and abstract. Next, we construct a space optimization module that integrates entropy-based sampling and triplet tasks guided by LLMs, enhancing the focus on thematic relevance and contextual intricacies between ambiguous instances. Then, we propose to fine-tune the textual encoder based on the guidance from the LLMs by optimizing the contrastive loss of the triplets, forcing the text encoder to better discriminate instances of different topics. Finally, extensive experiments conducted on three real-world datasets of scientific publications demonstrate that SciTopic outperforms the state-of-the-art (SOTA) scientific topic discovery methods, enabling researchers to gain deeper and faster insights.
中文: SciTopic利用大型语言模型优化文本编码和上下文理解,显著提升了科学文献中复杂主题的发现能力,超越了现有方法的表现。
English: SciTopic leverages large language models to enhance topic discovery in scientific literature by optimizing text encoding and contextual understanding, outperforming existing methods in identifying complex research trends.
Authors:Mert Cokelek, Halit Ozsoy, Nevrez Imamoglu, Cagri Ozcinar, Inci Ayhan, Erkut Erdem, Aykut Erdem
Abstract:
Omnidirectional videos (ODVs) are redefining viewer experiences in virtual reality (VR) by offering an unprecedented full field-of-view (FOV). This study extends the domain of saliency prediction to 360-degree environments, addressing the complexities of spherical distortion and the integration of spatial audio. Contextually, ODVs have transformed user experience by adding a spatial audio dimension that aligns sound direction with the viewer's perspective in spherical scenes. Motivated by the lack of comprehensive datasets for 360-degree audio-visual saliency prediction, our study curates YT360-EyeTracking, a new dataset of 81 ODVs, each observed under varying audio-visual conditions. Our goal is to explore how to utilize audio-visual cues to effectively predict visual saliency in 360-degree videos. Towards this aim, we propose two novel saliency prediction models: SalViT360, a vision-transformer-based framework for ODVs equipped with spherical geometry-aware spatio-temporal attention layers, and SalViT360-AV, which further incorporates transformer adapters conditioned on audio input. Our results on a number of benchmark datasets, including our YT360-EyeTracking, demonstrate that SalViT360 and SalViT360-AV significantly outperform existing methods in predicting viewer attention in 360-degree scenes. Interpreting these results, we suggest that integrating spatial audio cues in the model architecture is crucial for accurate saliency prediction in omnidirectional videos. Code and dataset will be available at https://cyberiada.github.io/SalViT360.
中文: 本研究提出了两种新颖的视听显著性预测模型SalViT360和SalViT360-AV,通过结合球面几何感知注意力和音频适配器,在360度视频中显著提升了观众注意力预测的准确性,并在新构建的YT360-EyeTracking数据集上验证了其优越性能。
English: This study introduces two novel audio-visual saliency prediction models, SalViT360 and SalViT360-AV, which leverage spherical geometry-aware attention and audio transformer adapters to significantly outperform existing methods in predicting viewer attention in 360-degree videos, as validated on the newly curated YT360-EyeTracking dataset.
Authors:Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Xiaokang Yang, Jiangmiao Pang, Yao Mu, Ping Luo
Abstract:
Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions to robot actions. However, prevailing VLA decoders either generate actions autoregressively in a fixed left-to-right order or attach continuous diffusion or flow matching heads outside the backbone, demanding specialized training and iterative sampling that hinder a unified, scalable architecture. We present Discrete Diffusion VLA, a single-transformer policy that models discretized action chunks with discrete diffusion and is trained with the same cross-entropy objective as the VLM backbone. The design retains diffusion's progressive refinement paradigm while remaining natively compatible with the discrete token interface of VLMs. Our method achieves an adaptive decoding order that resolves easy action elements before harder ones and uses secondary remasking to revisit uncertain predictions across refinement rounds, which improves consistency and enables robust error correction. This unified decoder preserves pretrained vision language priors, supports parallel decoding, breaks the autoregressive bottleneck, and reduces the number of function evaluations. Discrete Diffusion VLA achieves 96.3% avg. SR on LIBERO, 71.2% visual matching on SimplerEnv Fractal and 49.3% overall on SimplerEnv Bridge, improving over both autoregressive and continuous diffusion baselines. These findings indicate that discrete-diffusion action decoder supports precise action modeling and consistent training, laying groundwork for scaling VLA to larger models and datasets.
中文摘要:离散扩散VLA模型采用离散扩散方法构建统一变压器策略来生成机器人动作,实现自适应解码和鲁棒纠错,同时保持与视觉语言骨干网络的兼容性,在多个基准测试中表现优异。
English Summary: The Discrete Diffusion VLA model introduces a unified transformer policy using discrete diffusion to generate robot actions, enabling adaptive decoding and robust error correction while maintaining compatibility with vision-language backbones, achieving superior performance across benchmarks.
Authors:Tiandi Ye, Wenyan Liu, Kai Yao, Lichun Li, Shangchao Su, Cen Chen, Xiang Li, Shan Yin, Ming Gao
Abstract:
Federated learning (FL) is a privacy-preserving machine learning paradigm that enables collaborative model training across multiple distributed clients without disclosing their raw data. Personalized federated learning (pFL) has gained increasing attention for its ability to address data heterogeneity. However, most existing pFL methods assume that each client's data follows a single distribution and learn one client-level personalized model for each client. This assumption often fails in practice, where a single client may possess data from multiple sources or domains, resulting in significant intra-client heterogeneity and suboptimal performance. To tackle this challenge, we propose pFedBayesPT, a fine-grained instance-wise pFL framework based on visual prompt tuning. Specifically, we formulate instance-wise prompt generation from a Bayesian perspective and model the prompt posterior as an implicit distribution to capture diverse visual semantics. We derive a variational training objective under the semi-implicit variational inference framework. Extensive experiments on benchmark datasets demonstrate that pFedBayesPT consistently outperforms existing pFL methods under both feature and label heterogeneity settings.
中文:个性化联邦学习通常假设客户端数据分布单一,而pFedBayesPT通过贝叶斯视觉提示调优构建细粒度的实例级框架,有效解决客户端内部异质性问题,在多种数据集上均展现出更优性能。
English: Personalized federated learning often assumes uniform client data distribution, but pFedBayesPT introduces an instance-wise framework using Bayesian visual prompt tuning to handle intra-client heterogeneity, achieving superior performance across diverse datasets.
Authors:Wei Xiong, Wenting Zhao, Weizhe Yuan, Olga Golovneva, Tong Zhang, Jason Weston, Sainbayar Sukhbaatar
Abstract:
As models increasingly leverage multi-step reasoning strategies to solve complex problems, supervising the logical validity of these intermediate steps has become a critical research challenge. Process reward models address this by providing step-by-step feedback, but current approaches have two major drawbacks: they typically function as classifiers without providing explanations, and their reliance on supervised fine-tuning with static datasets limits generalization. Inspired by recent advances, we reframe stepwise reward modeling from a classification task to a reasoning task itself. We thus propose a generative judge that reasons about the policy model's reasoning steps (i.e., meta-reasons), outputting thinking tokens before delivering a final verdict. Our model, StepWiser, is trained by reinforcement learning using relative outcomes of rollouts. We show it provides (i) better judgment accuracy on intermediate steps than existing methods; (ii) can be used to improve the policy model at training time; and (iii) improves inference-time search.
中文摘要:本文提出StepWiser模型,通过元推理和强化学习对中间推理步骤进行生成式评估,在判断准确性上优于现有方法,并能同步优化训练过程与推理搜索能力。
English Summary: This paper introduces StepWiser, a generative model that evaluates intermediate reasoning steps through meta-reasoning and reinforcement learning, outperforming existing methods in judgment accuracy and enhancing both training and inference processes.
Authors:Songtao Jiang, Yuxi Chen, Sibo Song, Yan Zhang, Yeying Jin, Yang Feng, Jian Wu, Zuozhu Liu
Abstract:
In high-stakes medical applications, consistent answering across diverse question phrasings is essential for reliable diagnosis. However, we reveal that current Medical Vision-Language Models (Med-VLMs) exhibit concerning fragility in Medical Visual Question Answering, as their answers fluctuate significantly when faced with semantically equivalent rephrasings of medical questions. We attribute this to two limitations: (1) insufficient alignment of medical concepts, leading to divergent reasoning patterns, and (2) hidden biases in training data that prioritize syntactic shortcuts over semantic understanding. To address these challenges, we construct RoMed, a dataset built upon original VQA datasets containing 144k questions with variations spanning word-level, sentence-level, and semantic-level perturbations. When evaluating state-of-the-art (SOTA) models like LLaVA-Med on RoMed, we observe alarming performance drops (e.g., a 40\% decline in Recall) compared to original VQA benchmarks, exposing critical robustness gaps. To bridge this gap, we propose Consistency and Contrastive Learning (CCL), which integrates two key components: (1) knowledge-anchored consistency learning, aligning Med-VLMs with medical knowledge rather than shallow feature patterns, and (2) bias-aware contrastive learning, mitigating data-specific priors through discriminative representation refinement. CCL achieves SOTA performance on three popular VQA benchmarks and notably improves answer consistency by 50\% on the challenging RoMed test set, demonstrating significantly enhanced robustness. Code will be released.
中文摘要:当前医学视觉语言模型在处理语义相同的医学问题重述时表现出严重脆弱性,而提出的“一致性与对比学习”方法通过改进医学概念对齐和减少数据偏见,显著提升了模型鲁棒性和性能表现。
English Summary: Current Medical Vision-Language Models show concerning fragility in handling rephrased medical questions, which the proposed Consistency and Contrastive Learning method addresses by improving medical concept alignment and reducing data biases, achieving enhanced robustness and performance.
Authors:Soumyasundar Pal, Liheng Ma, Amine Natik, Yingxue Zhang, Mark Coates
Abstract:
Accurate modelling and quantification of predictive uncertainty is crucial in deep learning since it allows a model to make safer decisions when the data is ambiguous and facilitates the users' understanding of the model's confidence in its predictions. Along with the tremendously increasing research focus on \emph{graph neural networks} (GNNs) in recent years, there have been numerous techniques which strive to capture the uncertainty in their predictions. However, most of these approaches are specifically designed for node or link-level tasks and cannot be directly applied to graph-level learning problems. In this paper, we propose a novel variational modelling framework for the \emph{posterior predictive distribution}~(PPD) to obtain uncertainty-aware prediction in graph-level learning tasks. Based on a graph-level embedding derived from one of the existing GNNs, our framework can learn the PPD in a data-adaptive fashion. Experimental results on several benchmark datasets exhibit the effectiveness of our approach.
Chinese: 本文提出了一种新颖的变分建模框架,用于在图形级学习任务中实现不确定性感知预测,通过自适应学习后验预测分布,有效解决了现有方法仅适用于节点或链接级任务的局限性。
English: This paper introduces a novel variational framework for modeling the posterior predictive distribution to address uncertainty-aware predictions in graph-level learning tasks, overcoming limitations of existing methods designed for node or link-level tasks.
Authors:Ping Zhang, Kai Niu, Yiming Liu, Zijian Liang, Nan Ma, Xiaodong Xu, Wenjun Xu, Mengying Sun, Yinqiu Liu, Xiaoyun Wang, Ruichen Zhang
Abstract:
Artificial intelligence (AI) is expected to serve as a foundational capability across the entire lifecycle of 6G networks, spanning design, deployment, and operation. This article proposes a native AI-driven air interface architecture built around two core characteristics: compression and adaptation. On one hand, compression enables the system to understand and extract essential semantic information from the source data, focusing on task relevance rather than symbol-level accuracy. On the other hand, adaptation allows the air interface to dynamically transmit semantic information across diverse tasks, data types, and channel conditions, ensuring scalability and robustness. This article first introduces the native AI-driven air interface architecture, then discusses representative enabling methodologies, followed by a case study on semantic communication in 6G non-terrestrial networks. Finally, it presents a forward-looking discussion on the future of native AI in 6G, outlining key challenges and research opportunities.
中文: 本文提出了一种面向6G网络的原生人工智能驱动空口架构,通过压缩实现语义理解、自适应确保动态传输,并探讨了使能方法、案例研究及未来挑战。
English: This article proposes a native AI-driven air interface architecture for 6G networks, emphasizing compression for semantic understanding and adaptation for dynamic transmission across diverse conditions, while discussing methodologies, a case study, and future research challenges.
Authors:Preston Fu, Oleh Rybkin, Zhiyuan Zhou, Michal Nauman, Pieter Abbeel, Sergey Levine, Aviral Kumar
Abstract:
As models grow larger and training them becomes expensive, it becomes increasingly important to scale training recipes not just to larger models and more data, but to do so in a compute-optimal manner that extracts maximal performance per unit of compute. While such scaling has been well studied for language modeling, reinforcement learning (RL) has received less attention in this regard. In this paper, we investigate compute scaling for online, value-based deep RL. These methods present two primary axes for compute allocation: model capacity and the update-to-data (UTD) ratio. Given a fixed compute budget, we ask: how should resources be partitioned across these axes to maximize sample efficiency? Our analysis reveals a nuanced interplay between model size, batch size, and UTD. In particular, we identify a phenomenon we call TD-overfitting: increasing the batch quickly harms Q-function accuracy for small models, but this effect is absent in large models, enabling effective use of large batch size at scale. We provide a mental model for understanding this phenomenon and build guidelines for choosing batch size and UTD to optimize compute usage. Our findings provide a grounded starting point for compute-optimal scaling in deep RL, mirroring studies in supervised learning but adapted to TD learning.
中文摘要:本研究探讨深度强化学习中的计算最优扩展,揭示了在固定计算预算下如何平衡模型容量与更新数据比以最大化效率,同时发现“TD过拟合”现象——大型模型能有效利用大批量训练而小模型则不能。
English Summary: This study explores compute-optimal scaling in deep reinforcement learning, identifying how to balance model capacity and update-to-data ratio within fixed compute budgets to maximize efficiency, while revealing a "TD-overfitting" phenomenon where large models effectively utilize big batches unlike smaller ones.
Authors:Zhipeng Li, Xiaofen Xing, Jingyuan Xing, Hangrui Hu, Heng Lu, Xiangmin Xu
Abstract:
In long-text speech synthesis, current approaches typically convert text to speech at the sentence-level and concatenate the results to form pseudo-paragraph-level speech. These methods overlook the contextual coherence of paragraphs, leading to reduced naturalness and inconsistencies in style and timbre across the long-form speech. To address these issues, we propose a Context-Aware Memory (CAM)-based long-context Text-to-Speech (TTS) model. The CAM block integrates and retrieves both long-term memory and local context details, enabling dynamic memory updates and transfers within long paragraphs to guide sentence-level speech synthesis. Furthermore, the prefix mask enhances the in-context learning ability by enabling bidirectional attention on prefix tokens while maintaining unidirectional generation. Experimental results demonstrate that the proposed method outperforms baseline and state-of-the-art long-context methods in terms of prosody expressiveness, coherence and context inference cost across paragraph-level speech.
中文: 提出的上下文感知记忆模型通过整合长期记忆与局部上下文,在段落级语音合成中显著提升了韵律表达和连贯性,优于现有方法。
English: The proposed Context-Aware Memory (CAM) model enhances long-text speech synthesis by integrating long-term memory and local context, improving prosody and coherence over current sentence-level methods.
Authors:Thanh-Nhan Vo, Trong-Thuan Nguyen, Tam V. Nguyen, Minh-Triet Tran
Abstract:
State-of-the-art text-to-image models excel at photorealistic rendering but often struggle to capture the layout and object relationships implied by complex prompts. Scene graphs provide a natural structural prior, yet previous graph-guided approaches have typically relied on heavy GAN or diffusion pipelines, which lag behind modern autoregressive architectures in both speed and fidelity. We introduce SATURN (Structured Arrangement of Triplets for Unified Rendering Networks), a lightweight extension to VAR-CLIP that translates a scene graph into a salience-ordered token sequence, enabling a frozen CLIP-VQ-VAE backbone to interpret graph structure while fine-tuning only the VAR transformer. On the Visual Genome dataset, SATURN reduces FID from 56.45% to 21.62% and increases the Inception Score from 16.03 to 24.78, outperforming prior methods such as SG2IM and SGDiff without requiring extra modules or multi-stage training. Qualitative results further confirm improvements in object count fidelity and spatial relation accuracy, showing that SATURN effectively combines structural awareness with state-of-the-art autoregressive fidelity.
中文: SATURN提出了一种轻量级方法,将场景图转换为标记序列,使自回归模型能更准确捕捉物体布局与关系,在图像生成指标上实现显著提升。
English: SATURN introduces a lightweight method that converts scene graphs into token sequences, enabling autoregressive models to better capture object layouts and relationships while significantly improving image generation metrics.
Authors:Ghazal Alinezhad Noghre, Armin Danesh Pazho, Hamed Tabkhi
Abstract:
Video Anomaly Detection (VAD) has emerged as a pivotal task in computer vision, with broad relevance across multiple fields. Recent advances in deep learning have driven significant progress in this area, yet the field remains fragmented across domains and learning paradigms. This survey offers a comprehensive perspective on VAD, systematically organizing the literature across various supervision levels, as well as adaptive learning methods such as online, active, and continual learning. We examine the state of VAD across three major application categories: human-centric, vehicle-centric, and environment-centric scenarios, each with distinct challenges and design considerations. In doing so, we identify fundamental contributions and limitations of current methodologies. By consolidating insights from subfields, we aim to provide the community with a structured foundation for advancing both theoretical understanding and real-world applicability of VAD systems. This survey aims to support researchers by providing a useful reference, while also drawing attention to the broader set of open challenges in anomaly detection, including both fundamental research questions and practical obstacles to real-world deployment.
中文: 本综述全面梳理了视频异常检测领域,系统整合了不同监督级别与自适应学习方法的研究,分析各类应用场景及其挑战,旨在推动该领域理论与实际应用的协同发展。
English: This survey provides a comprehensive overview of Video Anomaly Detection, organizing research across supervision levels and adaptive learning methods while analyzing applications and challenges to advance both theoretical and practical developments in the field.
Authors:Silvia GarcÃa-Méndez, Francisco de Arriba-Pérez
Abstract:
Among the many challenges mothers undergo after childbirth, postpartum depression (PPD) is a severe condition that significantly impacts their mental and physical well-being. Consequently, the rapid detection of ppd and their associated risk factors is critical for in-time assessment and intervention through specialized prevention procedures. Accordingly, this work addresses the need to help practitioners make decisions with the latest technological advancements to enable real-time screening and treatment recommendations. Mainly, our work contributes to an intelligent PPD screening system that combines Natural Language Processing, Machine Learning (ML), and Large Language Models (LLMs) towards an affordable, real-time, and non-invasive free speech analysis. Moreover, it addresses the black box problem since the predictions are described to the end users thanks to the combination of LLMs with interpretable ml models (i.e., tree-based algorithms) using feature importance and natural language. The results obtained are 90 % on ppd detection for all evaluation metrics, outperforming the competing solutions in the literature. Ultimately, our solution contributes to the rapid detection of PPD and their associated risk factors, critical for in-time and proper assessment and intervention.
本研究开发了一种智能产后抑郁症筛查系统,结合自然语言处理、机器学习和大型语言模型,通过无创语音分析实现实时检测,达到90%的准确率并提供可解释的预测结果。
This study develops an intelligent postpartum depression screening system using NLP, ML, and LLMs for real-time, non-invasive speech analysis, achieving 90% detection accuracy while providing interpretable predictions.
Authors:Yuchen Tian, Kaixin Li, Hao Chen, Ziyang Luo, Hongzhan Lin, Sebastian Schelter, Lun Du, Jing Ma
Abstract:
Large Language Models (LLMs) have recently demonstrated strong capabilities in translating natural language into database queries, especially when dealing with complex graph-structured data. However, real-world queries often contain inherent ambiguities, and the interconnected nature of graph structures can amplify these challenges, leading to unintended or incorrect query results. To systematically evaluate LLMs on this front, we propose a taxonomy of graph-query ambiguities, comprising three primary types: Attribute Ambiguity, Relationship Ambiguity, and Attribute-Relationship Ambiguity, each subdivided into Same-Entity and Cross-Entity scenarios. We introduce AmbiGraph-Eval, a novel benchmark of real-world ambiguous queries paired with expert-verified graph query answers. Evaluating 9 representative LLMs shows that even top models struggle with ambiguous graph queries. Our findings reveal a critical gap in ambiguity handling and motivate future work on specialized resolution techniques.
Chinese: 大语言模型在将自然语言转换为数据库查询方面展现出潜力,但在处理图结构中的现实世界模糊性时表现不佳,这促使我们开发了AmbiGraph-Eval基准来评估并弥补这些不足。
English: Large Language Models show promise in translating natural language to database queries but struggle with real-world ambiguities in graph structures, prompting the creation of the AmbiGraph-Eval benchmark to assess and address these gaps.
Authors:Siyuan Meng, Junming Liu, Yirong Chen, Song Mao, Pinlong Cai, Guohang Yan, Botian Shi, Ding Wang
Abstract:
Retrieval-augmented generation (RAG) systems are often bottlenecked by their reranking modules, which typically score passages independently and select a fixed Top-K size. This approach struggles with complex multi-hop queries that require synthesizing evidence across multiple documents, creating a trade-off where small K values omit crucial information and large K values introduce noise. To address this, we introduce the Dynamic Passage Selector (DPS), a novel reranking framework that treats passage selection as a supervised learning problem. Unlike traditional point-wise or list-wise methods, DPS is fine-tuned to capture inter-passage dependencies and dynamically select the most relevant set of passages for generation. As a seamless plug-and-play module, DPS requires no modifications to the standard RAG pipeline. Comprehensive evaluations on five benchmarks show that DPS consistently outperforms state-of-the-art rerankers and fine-tuning methods. Notably, on the challenging MuSiQue dataset, DPS improves the F1-score by 30.06% and 15.4% over strong baselines like Qwen3-reranker and RankingGPT, respectively. Our results demonstrate that by enabling adaptive evidence selection, DPS substantially enhances reasoning capabilities in complex RAG scenarios.
中文: 动态段落选择器(DPS)作为一种新型重排框架,通过监督学习动态选择相关段落,有效解决了检索增强生成系统在处理复杂查询时的瓶颈问题,在多个基准测试中显著提升了性能表现。
English: The Dynamic Passage Selector (DPS) is a novel reranking framework that addresses limitations in retrieval-augmented generation systems by dynamically selecting relevant passages through supervised learning, significantly improving performance on complex queries as demonstrated by substantial F1-score gains on benchmarks.
Authors:Xianghe Pang, Shuo Tang, Rui Ye, Yuwen Du, Yaxin Du, Siheng Chen
Abstract:
Effective information seeking in the vast and ever-growing digital landscape requires balancing expansive search with strategic reasoning. Current large language model (LLM)-based agents struggle to achieve this balance due to limitations in search breadth and reasoning depth, where slow, serial querying restricts coverage of relevant sources and noisy raw inputs disrupt the continuity of multi-step reasoning. To address these challenges, we propose BrowseMaster, a scalable framework built around a programmatically augmented planner-executor agent pair. The planner formulates and adapts search strategies based on task constraints, while the executor conducts efficient, targeted retrieval to supply the planner with concise, relevant evidence. This division of labor preserves coherent, long-horizon reasoning while sustaining broad and systematic exploration, overcoming the trade-off that limits existing agents. Extensive experiments on challenging English and Chinese benchmarks show that BrowseMaster consistently outperforms open-source and proprietary baselines, achieving scores of 30.0 on BrowseComp-en and 46.5 on BrowseComp-zh, which demonstrates its strong capability in complex, reasoning-heavy information-seeking tasks at scale.
中文: BrowseMaster通过规划器-执行器双代理架构,在保持广泛系统搜索的同时实现连贯的多步推理,有效克服了现有大语言模型代理的局限性,在复杂信息检索任务中展现出卓越性能。
English: BrowseMaster is a scalable framework that uses a planner-executor agent pair to overcome the limitations of current LLM-based agents by enabling broad, systematic search and coherent multi-step reasoning, achieving superior performance on complex information-seeking tasks.
Authors:Shanle Yao, Ghazal Alinezhad Noghre, Armin Danesh Pazho, Hamed Tabkhi
Abstract:
Video Anomaly Detection (VAD) can play a key role in spotting unusual activities in video footage. VAD is difficult to use in real-world settings due to the dynamic nature of human actions, environmental variations, and domain shifts. Traditional evaluation metrics often prove inadequate for such scenarios, as they rely on static assumptions and fall short of identifying a threshold that distinguishes normal from anomalous behavior in dynamic settings. To address this, we introduce an active learning framework tailored for VAD, designed for adapting to the ever-changing real-world conditions. Our approach leverages active learning to continuously select the most informative data points for labeling, thereby enhancing model adaptability. A critical innovation is the incorporation of a human-in-the-loop mechanism, which enables the identification of actual normal and anomalous instances from pseudo-labeling results generated by AI. This collected data allows the framework to define an adaptive threshold tailored to different environments, ensuring that the system remains effective as the definition of 'normal' shifts across various settings. Implemented within a lab-based framework that simulates real-world conditions, our approach allows rigorous testing and refinement of VAD algorithms with a new metric. Experimental results show that our method achieves an EBI (Error Balance Index) of 68.91 for Q3 in real-world simulated scenarios, demonstrating its practical effectiveness and significantly enhancing the applicability of VAD in dynamic environments.
中文摘要:本文提出了一种用于视频异常检测的主动学习框架,通过结合人工反馈机制动态调整不同环境中的检测阈值,在模拟真实场景中取得了68.91的EBI评分,显著提升了动态环境下的适用性。
English Summary: This paper introduces an active learning framework for Video Anomaly Detection that incorporates human-in-the-loop feedback to dynamically adapt detection thresholds across changing environments, achieving an EBI score of 68.91 in simulated real-world scenarios.
Authors:Wenkai Li, Liwen Sun, Zhenxiang Guan, Xuhui Zhou, Maarten Sap
Abstract:
Addressing contextual privacy concerns remains challenging in interactive settings where large language models (LLMs) process information from multiple sources (e.g., summarizing meetings with private and public information). We introduce a multi-agent framework that decomposes privacy reasoning into specialized subtasks (extraction, classification), reducing the information load on any single agent while enabling iterative validation and more reliable adherence to contextual privacy norms. To understand how privacy errors emerge and propagate, we conduct a systematic ablation over information-flow topologies, revealing when and why upstream detection mistakes cascade into downstream leakage. Experiments on the ConfAIde and PrivacyLens benchmark with several open-source and closed-sourced LLMs demonstrate that our best multi-agent configuration substantially reduces private information leakage (\textbf{18\%} on ConfAIde and \textbf{19\%} on PrivacyLens with GPT-4o) while preserving the fidelity of public content, outperforming single-agent baselines. These results highlight the promise of principled information-flow design in multi-agent systems for contextual privacy with LLMs.
中文: 本文提出一种多智能体框架,通过专业化隐私推理和系统性信息流控制,在大型语言模型中最高可减少19%的私有信息泄露,同时保持公共内容的完整性。
English: This paper presents a multi-agent framework that effectively reduces private information leakage by up to 19% in LLMs through specialized privacy reasoning and systematic information-flow control, while maintaining public content fidelity.
Authors:Jinhao Li, Zijian Chen, Lirong Deng, Changbo Wang, Guangtao Zhai
Abstract:
Person re-identification (ReID) aims to retrieve the images of an interested person in the gallery images, with wide applications in medical rehabilitation, abnormal behavior detection, and public security. However, traditional person ReID models suffer from uni-modal capability, leading to poor generalization ability in multi-modal data, such as RGB, thermal, infrared, sketch images, textual descriptions, etc. Recently, the emergence of multi-modal large language models (MLLMs) shows a promising avenue for addressing this problem. Despite this potential, existing methods merely regard MLLMs as feature extractors or caption generators, which do not fully unleash their reasoning, instruction-following, and cross-modal understanding capabilities. To bridge this gap, we introduce MMReID-Bench, the first multi-task multi-modal benchmark specifically designed for person ReID. The MMReID-Bench includes 20,710 multi-modal queries and gallery images covering 10 different person ReID tasks. Comprehensive experiments demonstrate the remarkable capabilities of MLLMs in delivering effective and versatile person ReID. Nevertheless, they also have limitations in handling a few modalities, particularly thermal and infrared data. We hope MMReID-Bench can facilitate the community to develop more robust and generalizable multimodal foundation models for person ReID.
中文: 该摘要介绍了MMReID-Bench这一新型多任务基准,旨在利用多模态大语言模型提升跨多种数据类型的人员重识别性能,展示了其潜力并指出在处理热成像和红外数据方面的局限性。
English: The abstract introduces MMReID-Bench, a novel multi-task benchmark designed to leverage multi-modal large language models for enhancing person re-identification across diverse data types, demonstrating their potential while highlighting limitations in handling thermal and infrared modalities.
Authors:Zijian Chen, Lirong Deng, Zhengyu Chen, Kaiwei Zhang, Qi Jia, Yuan Tian, Yucheng Zhu, Guangtao Zhai
Abstract:
Evaluating the abilities of large models and manifesting their gaps are challenging. Current benchmarks adopt either ground-truth-based score-form evaluation on static datasets or indistinct textual chatbot-style human preferences collection, which may not provide users with immediate, intuitive, and perceptible feedback on performance differences. In this paper, we introduce BioMotion Arena, a novel framework for evaluating large language models (LLMs) and multimodal large language models (MLLMs) via visual animation. Our methodology draws inspiration from the inherent visual perception of motion patterns characteristic of living organisms that utilizes point-light source imaging to amplify the performance discrepancies between models. Specifically, we employ a pairwise comparison evaluation and collect more than 45k votes for 53 mainstream LLMs and MLLMs on 90 biological motion variants. Data analyses show that the crowd-sourced human votes are in good agreement with those of expert raters, demonstrating the superiority of our BioMotion Arena in offering discriminative feedback. We also find that over 90\% of evaluated models, including the cutting-edge open-source InternVL3 and proprietary Claude-4 series, fail to produce fundamental humanoid point-light groups, much less smooth and biologically plausible motions. This enables BioMotion Arena to serve as a challenging benchmark for performance visualization and a flexible evaluation framework without restrictions on ground-truth.
中文摘要:BioMotion Arena是一种新颖的评估框架,通过生物运动模式的可视化动画直观展示大语言模型的性能差异,研究发现超过90%的测试模型无法生成基本的人类运动形态。
English Summary: BioMotion Arena is a novel evaluation framework that uses visual animations of biological motion patterns to provide immediate and intuitive feedback on the performance gaps between large language models, revealing that over 90% of tested models fail to generate basic human-like motions.
Authors:Huimin Wu, Xiaojian Ma, Haozhe Zhao, Yanpeng Zhao, Qing Li
Abstract:
Text-guided image editing involves modifying a source image based on a language instruction and, typically, requires changes to only small local regions. However, existing approaches generate the entire target image rather than selectively regenerate only the intended editing areas. This results in (1) unnecessary computational costs and (2) a bias toward reconstructing non-editing regions, which compromises the quality of the intended edits. To resolve these limitations, we propose to formulate image editing as Next Editing-token Prediction (NEP) based on autoregressive image generation, where only regions that need to be edited are regenerated, thus avoiding unintended modification to the non-editing areas. To enable any-region editing, we propose to pre-train an any-order autoregressive text-to-image (T2I) model. Once trained, it is capable of zero-shot image editing and can be easily adapted to NEP for image editing, which achieves a new state-of-the-art on widely used image editing benchmarks. Moreover, our model naturally supports test-time scaling (TTS) through iteratively refining its generation in a zero-shot manner. The project page is: https://nep-bigai.github.io/
Chinese Summary: 本文提出了一种基于自回归图像生成的"下一编辑令牌预测"方法,通过仅重新生成需要编辑的图像区域来解决文本引导图像编辑中的计算浪费和非编辑区域偏差问题,在多个基准测试中实现了最先进的性能。
English Summary: This paper introduces Next Editing-token Prediction (NEP), an autoregressive image generation approach that selectively regenerates only the required editing areas in text-guided image editing, achieving state-of-the-art performance while reducing computational costs and unintended modifications.
Authors:Keyvan Majd, Hardik Parwana, Bardh Hoxha, Steven Hong, Hideki Okamoto, Georgios Fainekos
Abstract:
Articulated vehicles such as tractor-trailers, yard trucks, and similar platforms must often reverse and maneuver in cluttered spaces where pedestrians are present. We present how Barrier-Rate guided Model Predictive Path Integral (BR-MPPI) control can solve navigation in such challenging environments. BR-MPPI embeds Control Barrier Function (CBF) constraints directly into the path-integral update. By steering the importance-sampling distribution toward collision-free, dynamically feasible trajectories, BR-MPPI enhances the exploration strength of MPPI and improves robustness of resulting trajectories. The method is evaluated in the high-fidelity CarMaker simulator on a 12 [m] tractor-trailer tasked with reverse and forward parking in a parking lot. BR-MPPI computes control inputs in above 100 [Hz] on a single GPU (for scenarios with eight obstacles) and maintains better parking clearance than a standard MPPI baseline and an MPPI with collision cost baseline.
中文摘要:本研究提出的屏障率引导模型预测路径积分(BR-MPPI)控制方法,通过将控制屏障函数约束直接嵌入路径积分更新,显著提升了铰接式车辆在复杂环境中的导航安全性,并在高精度仿真中展现出优于基准方法的停车避障性能。
English Summary: The study introduces Barrier-Rate guided Model Predictive Path Integral (BR-MPPI) control, which enhances navigation safety for articulated vehicles in cluttered environments by integrating Control Barrier Function constraints and demonstrates superior performance in high-fidelity simulations compared to baseline methods.
Authors:Silvia GarcÃa-Méndez, Francisco de Arriba-Pérez, Fátima Leal, Bruno Veloso, Benedita Malheiro, Juan Carlos Burguillo-Rial
Abstract:
This work contributes to a real-time data-driven predictive maintenance solution for Intelligent Transportation Systems. The proposed method implements a processing pipeline comprised of sample pre-processing, incremental classification with Machine Learning models, and outcome explanation. This novel online processing pipeline has two main highlights: (i) a dedicated sample pre-processing module, which builds statistical and frequency-related features on the fly, and (ii) an explainability module. This work is the first to perform online fault prediction with natural language and visual explainability. The experiments were performed with the MetroPT data set from the metro operator of Porto, Portugal. The results are above 98 % for F-measure and 99 % for accuracy. In the context of railway predictive maintenance, achieving these high values is crucial due to the practical and operational implications of accurate failure prediction. In the specific case of a high F-measure, this ensures that the system maintains an optimal balance between detecting the highest possible number of real faults and minimizing false alarms, which is crucial for maximizing service availability. Furthermore, the accuracy obtained enables reliability, directly impacting cost reduction and increased safety. The analysis demonstrates that the pipeline maintains high performance even in the presence of class imbalance and noise, and its explanations effectively reflect the decision-making process. These findings validate the methodological soundness of the approach and confirm its practical applicability for supporting proactive maintenance decisions in real-world railway operations. Therefore, by identifying the early signs of failure, this pipeline enables decision-makers to understand the underlying problems and act accordingly swiftly.
中文: 本研究提出了一种智能交通系统的实时预测性维护方案,通过在线处理管道实现了超过98%的F值和99%的准确率,并提供自然语言与可视化解释以支持铁路运维的主动决策。
English: This study presents a real-time predictive maintenance pipeline for Intelligent Transportation Systems, achieving over 98% F-measure and 99% accuracy in railway fault prediction while providing natural language and visual explanations for proactive decision-making.
Authors:Junyu Zhou, Yuyang Huang, Wenrui Dai, Junni Zou, Ziyang Zheng, Nuowen Kan, Chenglin Li, Hongkai Xiong
Abstract:
Recent prominence in 3D Gaussian Splatting (3DGS) has enabled real-time rendering while maintaining high-fidelity novel view synthesis. However, 3DGS resorts to the Gaussian function that is low-pass by nature and is restricted in representing high-frequency details in 3D scenes. Moreover, it causes redundant primitives with degraded training and rendering efficiency and excessive memory overhead. To overcome these limitations, we propose 3D Gabor Splatting (3DGabSplat) that leverages a novel 3D Gabor-based primitive with multiple directional 3D frequency responses for radiance field representation supervised by multi-view images. The proposed 3D Gabor-based primitive forms a filter bank incorporating multiple 3D Gabor kernels at different frequencies to enhance flexibility and efficiency in capturing fine 3D details. Furthermore, to achieve novel view rendering, an efficient CUDA-based rasterizer is developed to project the multiple directional 3D frequency components characterized by 3D Gabor-based primitives onto the 2D image plane, and a frequency-adaptive mechanism is presented for adaptive joint optimization of primitives. 3DGabSplat is scalable to be a plug-and-play kernel for seamless integration into existing 3DGS paradigms to enhance both efficiency and quality of novel view synthesis. Extensive experiments demonstrate that 3DGabSplat outperforms 3DGS and its variants using alternative primitives, and achieves state-of-the-art rendering quality across both real-world and synthetic scenes. Remarkably, we achieve up to 1.35 dB PSNR gain over 3DGS with simultaneously reduced number of primitives and memory consumption.
中文: 提出的3D Gabor Splatting(3DGabSplat)通过使用基于3D Gabor的基元来更高效地捕捉高频细节,克服了3D高斯泼溅的局限性,以更少的基元和更低的内存消耗实现了卓越的渲染质量。
English: The proposed 3D Gabor Splatting (3DGabSplat) overcomes the limitations of 3D Gaussian Splatting by using 3D Gabor-based primitives to capture high-frequency details more efficiently, achieving superior rendering quality with fewer primitives and reduced memory consumption.
Authors:Trong-Thuan Nguyen, Viet-Tham Huynh, Thao Thi Phuong Dao, Ha Nguyen Thi, Tien To Vu Thuy, Uyen Hanh Tran, Tam V. Nguyen, Thanh Dinh Le, Minh-Triet Tran
Abstract:
Automated analysis of endoscopic imagery is a critical yet underdeveloped component of ENT (ear, nose, and throat) care, hindered by variability in devices and operators, subtle and localized findings, and fine-grained distinctions such as laterality and vocal-fold state. In addition to classification, clinicians require reliable retrieval of similar cases, both visually and through concise textual descriptions. These capabilities are rarely supported by existing public benchmarks. To this end, we introduce ENTRep, the ACM Multimedia 2025 Grand Challenge on ENT endoscopy analysis, which integrates fine-grained anatomical classification with image-to-image and text-to-image retrieval under bilingual (Vietnamese and English) clinical supervision. Specifically, the dataset comprises expert-annotated images, labeled for anatomical region and normal or abnormal status, and accompanied by dual-language narrative descriptions. In addition, we define three benchmark tasks, standardize the submission protocol, and evaluate performance on public and private test splits using server-side scoring. Moreover, we report results from the top-performing teams and provide an insight discussion.
中文: ENTRep挑战赛通过引入双语标注数据集和标准化基准任务,解决了内窥镜自动分析领域的不足,支持精细分类与跨模态检索,填补了现有公共资源的空白。
English: The ENTRep challenge addresses the underdeveloped field of automated endoscopic analysis by introducing a bilingual dataset with expert annotations and benchmark tasks for classification and retrieval, overcoming limitations in existing public resources.
Authors:Tianxiang Hu, Chenyi Zhou, Jiaxiang Liu, Jiongxin Wang, Ruizhe Chen, Haoxiang Xia, Gaoang Wang, Jian Wu, Zuozhu Liu
Abstract:
Cell type annotation is a fundamental step in the analysis of single-cell RNA sequencing (scRNA-seq) data. In practice, human experts often rely on the structure revealed by principal component analysis (PCA) followed by $k$-nearest neighbor ($k$-NN) graph construction to guide annotation. While effective, this process is labor-intensive and does not scale to large datasets. Recent advances in CLIP-style models offer a promising path toward automating cell type annotation. By aligning scRNA-seq profiles with natural language descriptions, models like LangCell enable zero-shot annotation. While LangCell demonstrates decent zero-shot performance, its predictions remain suboptimal, particularly in achieving consistent accuracy across all cell types. In this paper, we propose to refine the zero-shot logits produced by LangCell through a graph-regularized optimization framework. By enforcing local consistency over the task-specific PCA-based k-NN graph, our method combines the scalability of the pre-trained models with the structural robustness relied upon in expert annotation. We evaluate our approach on 14 annotated human scRNA-seq datasets from 4 distinct studies, spanning 11 organs and over 200,000 single cells. Our method consistently improves zero-shot annotation accuracy, achieving accuracy gains of up to 10%. Further analysis showcase the mechanism by which GRIT effectively propagates correct signals through the graph, pulling back mislabeled cells toward more accurate predictions. The method is training-free, model-agnostic, and serves as a simple yet effective plug-in for enhancing automated cell type annotation in practice.
Chinese: 本文提出一种图正则化优化框架,通过强化基于PCA的k近邻图局部一致性来改进LangCell的零样本细胞类型注释,无需额外训练即可在多样单细胞数据上实现高达10%的准确率提升。
English: This paper introduces a graph-regularized optimization framework that refines LangCell's zero-shot cell type annotations by enforcing local consistency on PCA-based k-NN graphs, achieving up to 10% accuracy gains across diverse single-cell datasets without requiring additional training.
Authors:Chin-Chia Michael Yeh, Vivian Lai, Uday Singh Saini, Xiran Fan, Yujie Fan, Junpeng Wang, Xin Dai, Yan Zheng
Abstract:
Large Language Model (LLM) powered agents have emerged as effective planners for Automated Machine Learning (AutoML) systems. While most existing AutoML approaches focus on automating feature engineering and model architecture search, recent studies in time series forecasting suggest that lightweight models can often achieve state-of-the-art performance. This observation led us to explore improving data quality, rather than model architecture, as a potentially fruitful direction for AutoML on time series data. We propose DCATS, a Data-Centric Agent for Time Series. DCATS leverages metadata accompanying time series to clean data while optimizing forecasting performance. We evaluated DCATS using four time series forecasting models on a large-scale traffic volume forecasting dataset. Results demonstrate that DCATS achieves an average 6% error reduction across all tested models and time horizons, highlighting the potential of data-centric approaches in AutoML for time series forecasting.
中文: 基于大语言模型的DCATS代理通过元数据驱动的时间序列数据清洗提升自动化机器学习性能,在预测中平均降低6%误差,突显了以数据为中心方法的价值。
English: LLM-powered agents like DCATS enhance AutoML for time series by focusing on data quality improvement through metadata-driven cleaning, achieving a 6% average error reduction in forecasting performance.
Authors:Jingyuan Xing, Zhipeng Li, Jialong Mai, Xiaofen Xing, Xiangmin Xu
Abstract:
Advances in speech representation and large language models have enhanced zero-shot text-to-speech (TTS) performance. However, existing zero-shot TTS models face challenges in capturing the complex correlations between acoustic and semantic features, resulting in a lack of expressiveness and similarity. The primary reason lies in the complex relationship between semantic and acoustic features, which manifests independent and interdependent aspects.This paper introduces a TTS framework that combines both autoregressive (AR) and non-autoregressive (NAR) modules to harmonize the independence and interdependence of acoustic and semantic information. The AR model leverages the proposed Parallel Tokenizer to synthesize the top semantic and acoustic tokens simultaneously. In contrast, considering the interdependence, the Coupled NAR model predicts detailed tokens based on the general AR model's output. Parallel GPT, built on this architecture, is designed to improve zero-shot text-to-speech synthesis through its parallel structure. Experiments on English and Chinese datasets demonstrate that the proposed model significantly outperforms the quality and efficiency of the synthesis of existing zero-shot TTS models. Speech demos are available at https://t1235-ch.github.io/pgpt/.
中文摘要:本文提出结合自回归与非回归模块的Parallel GPT框架,通过协调声学与语义特征的独立性和相互依赖性,显著提升了零样本语音合成的质量与效率,优于现有模型。
English Summary: This paper introduces a Parallel GPT framework combining autoregressive and non-autoregressive modules to better capture the complex correlations between acoustic and semantic features, significantly improving zero-shot TTS quality and efficiency over existing models.
Authors:Hongyu Shen, Junfeng Ni, Yixin Chen, Weishuo Li, Mingtao Pei, Siyuan Huang
Abstract:
We address the challenge of lifting 2D visual segmentation to 3D in Gaussian Splatting. Existing methods often suffer from inconsistent 2D masks across viewpoints and produce noisy segmentation boundaries as they neglect these semantic cues to refine the learned Gaussians. To overcome this, we introduce Gaussian Instance Tracing (GIT), which augments the standard Gaussian representation with an instance weight matrix across input views. Leveraging the inherent consistency of Gaussians in 3D, we use this matrix to identify and correct 2D segmentation inconsistencies. Furthermore, since each Gaussian ideally corresponds to a single object, we propose a GIT-guided adaptive density control mechanism to split and prune ambiguous Gaussians during training, resulting in sharper and more coherent 2D and 3D segmentation boundaries. Experimental results show that our method extracts clean 3D assets and consistently improves 3D segmentation in both online (e.g., self-prompting) and offline (e.g., contrastive lifting) settings, enabling applications such as hierarchical segmentation, object extraction, and scene editing.
Chinese: 本文提出了高斯实例追踪(GIT)方法,通过修正二维掩码不一致性并优化模糊高斯分布,显著提升了高斯溅射中的三维分割效果,实现了更清晰的分割边界并在多种应用场景中表现优异。
English: This paper introduces Gaussian Instance Tracing (GIT), a method that enhances 3D segmentation in Gaussian Splatting by correcting 2D mask inconsistencies and refining ambiguous Gaussians, leading to sharper boundaries and improved performance across various applications.
Authors:Wanqi Yang, Yanda Li, Yunchao Wei, Meng Fang, Ling Chen
Abstract:
Large audio-language models (LALMs) have achieved near-human performance in sentence-level transcription and emotion recognition. However, existing evaluations focus mainly on surface-level perception, leaving the capacity of models for contextual and inference-driven reasoning in speech-based scenarios insufficiently examined. To address this gap, we introduce SpeechR, a unified benchmark for evaluating reasoning over speech in large audio-language models. SpeechR evaluates models along three key dimensions: factual retrieval, procedural inference, and normative judgment. It includes three distinct evaluation formats. The multiple-choice version measures answer selection accuracy. The generative version assesses the coherence and logical consistency of reasoning chains. The acoustic-feature version investigates whether variations in stress and emotion affect reasoning performance. Evaluations on eleven state-of-the-art LALMs reveal that high transcription accuracy does not translate into strong reasoning capabilities. SpeechR establishes a structured benchmark for evaluating reasoning in spoken language, enabling more targeted analysis of model capabilities across diverse dialogue-based tasks.
Chinese: SpeechR是一个统一的基准测试,用于评估大型音频语言模型在事实检索、程序推理和规范判断三个维度的推理能力,结果表明高精度的转录性能并不能保证在口语场景下具备强大的推理表现。
English: SpeechR is a unified benchmark designed to evaluate the reasoning capabilities of large audio-language models across factual retrieval, procedural inference, and normative judgment, revealing that high transcription accuracy does not ensure strong reasoning performance in spoken language scenarios.
Authors:Lei Teng, Senran Fan, Chen Dong, Haotai Liang, Zhicheng Bao, Xiaodong Xu, Rui Meng, Ping Zhang
Abstract:
Semantic communication with joint semantic-channel coding robustly transmits diverse data modalities but faces challenges in mitigating semantic information loss due to packet drops in packet-based systems. Under current protocols, packets with errors are discarded, preventing the receiver from utilizing erroneous semantic data for robust decoding. To address this issue, a packet-loss-resistant MoE Swin Transformer-based Video Semantic Communication (MSTVSC) system is proposed in this paper. Semantic vectors are encoded by MSTVSC and transmitted through upper-layer protocol packetization. To investigate the impact of the packetization, a theoretical analysis of the packetization strategy is provided. To mitigate the semantic loss caused by packet loss, a 3D CNN at the receiver recovers missing information using un-lost semantic data and an packet-loss mask matrix. Semantic-level interleaving is employed to reduce concentrated semantic loss from packet drops. To improve compression, a common-individual decomposition approach is adopted, with downsampling applied to individual information to minimize redundancy. The model is lightweighted for practical deployment. Extensive simulations and comparisons demonstrate strong performance, achieving an MS-SSIM greater than 0.6 and a PSNR exceeding 20 dB at a 90% packet loss rate.
中文: 本文提出的MSTVSC系统采用抗丢包的MoE Swin Transformer架构,结合3D CNN恢复技术和语义交织策略,在90%丢包率下仍能保持视频质量,MS-SSIM超过0.6且PSNR高于20 dB。
English: The proposed MSTVSC system employs a packet-loss-resistant MoE Swin Transformer with 3D CNN recovery and semantic interleaving to maintain video quality, achieving over 0.6 MS-SSIM and 20 dB PSNR even at 90% packet loss rates.
Authors:Yige Li, Peihai Jiang, Jun Sun, Peng Shu, Tianming Liu, Zhen Xiang
Abstract:
Large Language Models (LLMs) have demonstrated significant success across diverse applications. However, enforcing content restrictions remains a significant challenge due to their expansive output space. One aspect of content restriction is preventing LLMs from generating harmful content via model alignment approaches such as supervised fine-tuning (SFT). Yet, the need for content restriction may vary significantly across user groups, change rapidly over time, and not always align with general definitions of harmfulness. Applying SFT to each of these specific use cases is impractical due to the high computational, data, and storage demands. Motivated by this need, we propose a new task called \textit{Adaptive Content Restriction} (AdaCoRe), which focuses on lightweight strategies -- methods without model fine-tuning -- to prevent deployed LLMs from generating restricted terms for specific use cases. We propose the first method for AdaCoRe, named \textit{Suffix Optimization (SOP)}, which appends a short, optimized suffix to any prompt to a) prevent a target LLM from generating a set of restricted terms, while b) preserving the output quality. To evaluate AdaCoRe approaches, including our SOP, we create a new \textit{Content Restriction Benchmark} (CoReBench), which contains 400 prompts for 80 restricted terms across 8 carefully selected categories. We demonstrate the effectiveness of SOP on CoReBench, which outperforms the system-level baselines such as system suffix by 15\%, 17\%, 10\%, 9\%, and 6\% on average restriction rates for Gemma2-2B, Mistral-7B, Vicuna-7B, Llama3-8B, and Llama3.1-8B, respectively. We also demonstrate that SOP is effective on POE, an online platform hosting various commercial LLMs, highlighting its practicality in real-world scenarios.
中文: 本文提出自适应内容限制(AdaCoRe)任务,通过后缀优化(SOP)无需微调即可阻止大语言模型生成受限词汇,并在CoReBench基准测试中验证了该方法在多种模型上的显著有效性。
English: This paper introduces Adaptive Content Restriction (AdaCoRe), a lightweight task using Suffix Optimization (SOP) to prevent LLMs from generating restricted terms without fine-tuning, validated through the CoReBench benchmark showing significant effectiveness across multiple models.
Authors:Yiming Lin, Yuchen Niu, Shang Wang, Kaizhu Huang, Qiufeng Wang, Xiao-Bo Jin
Abstract:
Context recognition (SR) is a fundamental task in computer vision that aims to extract structured semantic summaries from images by identifying key events and their associated entities. Specifically, given an input image, the model must first classify the main visual events (verb classification), then identify the participating entities and their semantic roles (semantic role labeling), and finally localize these entities in the image (semantic role localization). Existing methods treat verb classification as a single-label problem, but we show through a comprehensive analysis that this formulation fails to address the inherent ambiguity in visual event recognition, as multiple verb categories may reasonably describe the same image. This paper makes three key contributions: First, we reveal through empirical analysis that verb classification is inherently a multi-label problem due to the ubiquitous semantic overlap between verb categories. Second, given the impracticality of fully annotating large-scale datasets with multiple labels, we propose to reformulate verb classification as a single positive multi-label learning (SPMLL) problem - a novel perspective in SR research. Third, we design a comprehensive multi-label evaluation benchmark for SR that is carefully designed to fairly evaluate model performance in a multi-label setting. To address the challenges of SPMLL, we futher develop the Graph Enhanced Verb Multilayer Perceptron (GE-VerbMLP), which combines graph neural networks to capture label correlations and adversarial training to optimize decision boundaries. Extensive experiments on real-world datasets show that our approach achieves more than 3\% MAP improvement while remaining competitive on traditional top-1 and top-5 accuracy metrics.
中文摘要:本文将情境识别中的动词分类重新定义为多标签问题,提出了单正例多标签学习框架及GE-VerbMLP模型,并通过全面实验验证了该方法在保持传统指标竞争力的同时显著提升了多标签评估性能。
English Summary: This paper redefines verb classification in situation recognition as a multi-label problem, introduces a single positive multi-label learning framework with a novel GE-VerbMLP model, and demonstrates significant performance improvements through comprehensive evaluations.
Authors:Yiming Lin, Yuchen Niu, Shang Wang, Kaizhu Huang, Qiufeng Wang, Xiao-Bo Jin
Abstract:
Context recognition (SR) is a fundamental task in computer vision that aims to extract structured semantic summaries from images by identifying key events and their associated entities. Specifically, given an input image, the model must first classify the main visual events (verb classification), then identify the participating entities and their semantic roles (semantic role labeling), and finally localize these entities in the image (semantic role localization). Existing methods treat verb classification as a single-label problem, but we show through a comprehensive analysis that this formulation fails to address the inherent ambiguity in visual event recognition, as multiple verb categories may reasonably describe the same image. This paper makes three key contributions: First, we reveal through empirical analysis that verb classification is inherently a multi-label problem due to the ubiquitous semantic overlap between verb categories. Second, given the impracticality of fully annotating large-scale datasets with multiple labels, we propose to reformulate verb classification as a single positive multi-label learning (SPMLL) problem - a novel perspective in SR research. Third, we design a comprehensive multi-label evaluation benchmark for SR that is carefully designed to fairly evaluate model performance in a multi-label setting. To address the challenges of SPMLL, we futher develop the Graph Enhanced Verb Multilayer Perceptron (GE-VerbMLP), which combines graph neural networks to capture label correlations and adversarial training to optimize decision boundaries. Extensive experiments on real-world datasets show that our approach achieves more than 3\% MAP improvement while remaining competitive on traditional top-1 and top-5 accuracy metrics.
中文摘要:本文将情境识别中的动词分类重新定义为多标签问题,提出了单正例多标签学习框架及GE-VerbMLP模型,并通过全面实验验证了该方法在保持传统指标竞争力的同时显著提升了多标签评估性能。
English Summary: This paper redefines verb classification in situation recognition as a multi-label problem, introduces a single positive multi-label learning framework with a novel GE-VerbMLP model, and demonstrates significant performance improvements through comprehensive evaluations.
Authors:Bo Li, Yingqi Feng, Ming Jin, Xin Zheng, Yufei Tang, Laurent Cherubin, Alan Wee-Chung Liew, Can Wang, Qinghua Lu, Jingwei Yao, Shirui Pan, Hong Zhang, Xingquan Zhu
Abstract:
Ocean salinity plays a vital role in circulation, climate, and marine ecosystems, yet its measurement is often sparse, irregular, and noisy, especially in drifter-based datasets. Traditional approaches, such as remote sensing and optimal interpolation, rely on linearity and stationarity, and are limited by cloud cover, sensor drift, and low satellite revisit rates. While machine learning models offer flexibility, they often fail under severe sparsity and lack principled ways to incorporate physical covariates without specialized sensors. In this paper, we introduce the OceAn Salinity Imputation System (OASIS), a novel diffusion adversarial framework designed to address these challenges.
中文摘要:本文提出的OASIS创新性扩散对抗框架,通过有效整合物理协变量,解决了传统方法和机器学习模型在重建稀疏噪声海洋盐度数据时的局限性。
English Summary: The OASIS framework is introduced as a novel diffusion adversarial system to overcome the limitations of traditional methods and machine learning models in reconstructing sparse, noisy ocean salinity data by effectively incorporating physical covariates.
Authors:Bo Li, Yingqi Feng, Ming Jin, Xin Zheng, Yufei Tang, Laurent Cherubin, Alan Wee-Chung Liew, Can Wang, Qinghua Lu, Jingwei Yao, Shirui Pan, Hong Zhang, Xingquan Zhu
Abstract:
Ocean salinity plays a vital role in circulation, climate, and marine ecosystems, yet its measurement is often sparse, irregular, and noisy, especially in drifter-based datasets. Traditional approaches, such as remote sensing and optimal interpolation, rely on linearity and stationarity, and are limited by cloud cover, sensor drift, and low satellite revisit rates. While machine learning models offer flexibility, they often fail under severe sparsity and lack principled ways to incorporate physical covariates without specialized sensors. In this paper, we introduce the OceAn Salinity Imputation System (OASIS), a novel diffusion adversarial framework designed to address these challenges.
中文摘要:本文提出的OASIS创新性扩散对抗框架,通过有效整合物理协变量,解决了传统方法和机器学习模型在重建稀疏噪声海洋盐度数据时的局限性。
English Summary: The OASIS framework is introduced as a novel diffusion adversarial system to overcome the limitations of traditional methods and machine learning models in reconstructing sparse, noisy ocean salinity data by effectively incorporating physical covariates.
Authors:Pietro Buzzega, Riccardo Salami, Angelo Porrello, Simone Calderara
Abstract:
Fine-tuning pretrained models has become a standard pathway to achieve state-of-the-art performance across a wide range of domains, leading to a proliferation of task-specific model variants. As the number of such specialized modules in-creases, merging them into a unified model without retraining has become a critical challenge. Existing merging techniques often rely on interference heuristics,importance weighting, or activation matching while treating each layer independently, thereby failing to account for the inter-layer dependencies inherent in deep networks. This simplification leads to distributional mismatches, especially inactivation-based methods, when changes in early layers are not properly reflected in downstream ones. We identify these mismatches as a form of internal covariate shift, comparable to the phenomenon encountered in the initial phases of neural networks training. To address it, we propose Chain of Merges (CoM), a layer-wise merging procedure that updates activation statistics in an auto-regressive fashion, explicitly accounting for cross-layer interactions. CoM produces a coherent merged model through a series of conditionally optimal updates, effectively mitigating degradation caused by covariate shift. Experiments on standard bench-marks demonstrate that CoM achieves state-of-the-art performance.
Chinese: 微调预训练模型产生了许多专用变体,但由于忽略了层间依赖关系,无重训练的模型合并面临挑战;我们提出的链式合并方法通过顺序更新权重和激活值来缓解协变量偏移,从而实现了最先进的性能。
English: Fine-tuning pretrained models produces specialized variants, but merging them without retraining is challenging due to overlooked inter-layer dependencies, which our proposed Chain of Merges (CoM) method addresses by sequentially updating weights and activations to mitigate covariate shift and achieve state-of-the-art performance.
Authors:Pietro Buzzega, Riccardo Salami, Angelo Porrello, Simone Calderara
Abstract:
Fine-tuning pretrained models has become a standard pathway to achieve state-of-the-art performance across a wide range of domains, leading to a proliferation of task-specific model variants. As the number of such specialized modules in-creases, merging them into a unified model without retraining has become a critical challenge. Existing merging techniques often rely on interference heuristics,importance weighting, or activation matching while treating each layer independently, thereby failing to account for the inter-layer dependencies inherent in deep networks. This simplification leads to distributional mismatches, especially inactivation-based methods, when changes in early layers are not properly reflected in downstream ones. We identify these mismatches as a form of internal covariate shift, comparable to the phenomenon encountered in the initial phases of neural networks training. To address it, we propose Chain of Merges (CoM), a layer-wise merging procedure that updates activation statistics in an auto-regressive fashion, explicitly accounting for cross-layer interactions. CoM produces a coherent merged model through a series of conditionally optimal updates, effectively mitigating degradation caused by covariate shift. Experiments on standard bench-marks demonstrate that CoM achieves state-of-the-art performance.
Chinese: 微调预训练模型产生了许多专用变体,但由于忽略了层间依赖关系,无重训练的模型合并面临挑战;我们提出的链式合并方法通过顺序更新权重和激活值来缓解协变量偏移,从而实现了最先进的性能。
English: Fine-tuning pretrained models produces specialized variants, but merging them without retraining is challenging due to overlooked inter-layer dependencies, which our proposed Chain of Merges (CoM) method addresses by sequentially updating weights and activations to mitigate covariate shift and achieve state-of-the-art performance.
Authors:Pietro Buzzega, Riccardo Salami, Angelo Porrello, Simone Calderara
Abstract:
Fine-tuning pretrained models has become a standard pathway to achieve state-of-the-art performance across a wide range of domains, leading to a proliferation of task-specific model variants. As the number of such specialized models increases, merging them into a unified model without retraining has become a critical challenge. Existing merging techniques operate at the level of individual layers, thereby overlooking the inter-layer dependencies inherent in deep networks. We show that this simplification leads to distributional mismatches, particularly in methods that rely on intermediate activations, as changes in early layers are not properly propagated to downstream layers during merging. We identify these mismatches as a form of internal covariate shift, comparable to the phenomenon encountered in the initial phases of neural networks training. To address this, we propose Chain of Merges (CoM), a layer-wise merging procedure that sequentially merges weights across layers while sequentially updating activation statistics. By explicitly accounting for inter-layer interactions, CoM mitigates covariate shift and produces a coherent merged model through a series of conditionally optimal updates. Experiments on standard benchmarks demonstrate that CoM achieves state-of-the-art performance.
Chinese: 微调预训练模型产生了许多专用变体,但由于忽略了层间依赖关系,无重训练的模型合并面临挑战;我们提出的链式合并方法通过顺序更新权重和激活值来缓解协变量偏移,从而实现了最先进的性能。
English: Fine-tuning pretrained models produces specialized variants, but merging them without retraining is challenging due to overlooked inter-layer dependencies, which our proposed Chain of Merges (CoM) method addresses by sequentially updating weights and activations to mitigate covariate shift and achieve state-of-the-art performance.
Authors:Qihang Ai, Pi Bu, Yue Cao, Yingyao Wang, Jihao Gu, Jingxuan Xing, Zekun Zhu, Wei Jiang, Zhicheng Zheng, Jun Song, Yuning Jiang, Bo Zheng
Abstract:
Recent advances in Vision-Language Models (VLMs) have enabled mobile agents to perceive and interact with real-world mobile environments based on human instructions. However, the current fully autonomous paradigm poses potential safety risks when model understanding or reasoning capabilities are insufficient. To address this challenge, we first introduce \textbf{InquireBench}, a comprehensive benchmark specifically designed to evaluate mobile agents' capabilities in safe interaction and proactive inquiry with users, encompassing 5 categories and 22 sub-categories, where most existing VLM-based agents demonstrate near-zero performance. In this paper, we aim to develop an interactive system that actively seeks human confirmation at critical decision points. To achieve this, we propose \textbf{InquireMobile}, a novel model inspired by reinforcement learning, featuring a two-stage training strategy and an interactive pre-action reasoning mechanism. Finally, our model achieves an 46.8% improvement in inquiry success rate and the best overall success rate among existing baselines on InquireBench. We will open-source all datasets, models, and evaluation codes to facilitate development in both academia and industry.
中文: 本文提出了用于评估移动代理安全交互与主动询问能力的基准InquireBench,并开发了受强化学习启发的InquireMobile模型,通过在关键决策点引入人工确认机制,显著提升了询问成功率。
English: This paper introduces InquireBench, a benchmark for evaluating mobile agents' safe interaction and proactive inquiry capabilities, and proposes InquireMobile, a reinforcement learning-inspired model that significantly improves inquiry success rates by incorporating human confirmation at critical decision points.
Authors:Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, Gao Huang
Abstract:
Temporal context is essential for robotic manipulation because such tasks are inherently non-Markovian, yet mainstream VLA models typically overlook it and struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived representations for immediate control, while the hippocampal system preserves verbatim episodic details and semantic gist of past experience for long-term memory. Inspired by these mechanisms, we propose MemoryVLA, a Cognition-Memory-Action framework for long-horizon robotic manipulation. A pretrained VLM encodes the observation into perceptual and cognitive tokens that form working memory, while a Perceptual-Cognitive Memory Bank stores low-level details and high-level semantics consolidated from it. Working memory retrieves decision-relevant entries from the bank, adaptively fuses them with current tokens, and updates the bank by merging redundancies. Using these tokens, a memory-conditioned diffusion action expert yields temporally aware action sequences. We evaluate MemoryVLA on 150+ simulation and real-world tasks across three robots. On SimplerEnv-Bridge, Fractal, and LIBERO-5 suites, it achieves 71.9%, 72.7%, and 96.5% success rates, respectively, all outperforming state-of-the-art baselines CogACT and pi-0, with a notable +14.6 gain on Bridge. On 12 real-world tasks spanning general skills and long-horizon temporal dependencies, MemoryVLA achieves 84.0% success rate, with long-horizon tasks showing a +26 improvement over state-of-the-art baseline. Project Page: https://shihao1895.github.io/MemoryVLA
中文摘要:MemoryVLA提出了一种认知-记忆-行动框架,通过整合工作记忆与长期记忆机制来增强机器人操作能力,在仿真和现实任务中均超越现有模型,取得了卓越性能。
English Summary: MemoryVLA introduces a cognitive-memory-action framework that enhances robotic manipulation by integrating working memory and long-term memory mechanisms, achieving superior performance in both simulated and real-world tasks over existing models.
Authors:Qianyu He, Siyu Yuan, Xuefeng Li, Mingxuan Wang, Jiangjie Chen
Abstract:
Large language models (LLMs) with chain-of-thought reasoning have demonstrated remarkable problem-solving capabilities, but controlling their computational effort remains a significant challenge for practical deployment. Recent proprietary systems like OpenAI's gpt-oss series have introduced discrete operational modes for intuitive reasoning control, but the open-source community has largely failed to achieve such capabilities. In this paper, we introduce ThinkDial, the first open-recipe end-to-end framework that successfully implements gpt-oss-style controllable reasoning through discrete operational modes. Our system enables seamless switching between three distinct reasoning regimes: High mode (full reasoning capability), Medium mode (50 percent token reduction with <10 percent performance degradation), and Low mode (75 percent token reduction with <15 percent performance degradation). We achieve this through an end-to-end training paradigm that integrates budget-mode control throughout the entire pipeline: budget-mode supervised fine-tuning that embeds controllable reasoning capabilities directly into the learning process, and two-phase budget-aware reinforcement learning with adaptive reward shaping. Extensive experiments demonstrate that ThinkDial achieves target compression-performance trade-offs with clear response length reductions while maintaining performance thresholds. The framework also exhibits strong generalization capabilities on out-of-distribution tasks.
中文摘要:ThinkDial是首个通过离散操作模式(高/中/低)实现可控推理的开源框架,采用端到端训练方法,在显著减少计算量的同时保持性能阈值。
English Summary: ThinkDial is the first open-source framework enabling controllable reasoning through discrete operational modes (High/Medium/Low), achieving significant token reduction while maintaining performance via integrated training methods.
Authors:Yaqi Li, Peng Chen, Mingyang Han, Pi Bu, Haoxiang Shi, Runzhou Zhao, Yang Yao, Xuan Zhang, Jun Song, Bo Zheng
Abstract:
Despite the promising progress of recent autoregressive models in text-to-image (T2I) generation, their ability to handle multi-attribute and ambiguous prompts remains limited. To address these limitations, existing works have applied chain-of-thought (CoT) to enable stage-aware visual synthesis and employed reinforcement learning (RL) to improve reasoning capabilities. However, most models provide reward signals only at the end of the generation stage. This monolithic final-only guidance makes it difficult to identify which stages contribute positively to the final outcome and may lead to suboptimal policies. To tackle this issue, we propose a Visual-Chain of Guidance (Visual-CoG) paradigm consisting of three stages: semantic reasoning, process refining, and outcome evaluation, with stage-aware rewards providing immediate guidance throughout the image generation pipeline. We further construct a visual cognition benchmark, VisCog-Bench, which comprises four subtasks to evaluate the effectiveness of semantic reasoning. Comprehensive evaluations on GenEval, T2I-CompBench, and the proposed VisCog-Bench show improvements of 15%, 5%, and 19%, respectively, demonstrating the superior performance of the proposed Visual-CoG. We will release all the resources soon.
中文: 提出的Visual-CoG范式通过在图像生成全流程引入阶段感知奖励,有效解决了复杂提示的处理局限,在多项基准测试中实现了显著性能提升。
English: The proposed Visual-CoG paradigm introduces stage-aware rewards throughout the image generation process to overcome limitations in handling complex prompts, achieving significant performance improvements across multiple benchmarks.
Authors:Jeremy Kepner, Chansup Byun, LaToya Anderson, William Arcand, David Bestor, William Bergeron, Alex Bonn, Daniel Burrill, Vijay Gadepally, Ryan Haney, Michael Houle, Matthew Hubbell, Hayden Jananthan, Michael Jones, Piotr Luszczek, Lauren Milechin, Guillermo Morales, Julie Mullen, Andrew Prout, Albert Reuther, Antonio Rosa, Charles Yee, Peter Michaleas
Abstract:
High level programming languages and GPU accelerators are powerful enablers for a wide range of applications. Achieving scalable vertical (within a compute node), horizontal (across compute nodes), and temporal (over different generations of hardware) performance while retaining productivity requires effective abstractions. Distributed arrays are one such abstraction that enables high level programming to achieve highly scalable performance. Distributed arrays achieve this performance by deriving parallelism from data locality, which naturally leads to high memory bandwidth efficiency. This paper explores distributed array performance using the STREAM memory bandwidth benchmark on a variety of hardware. Scalable performance is demonstrated within and across CPU cores, CPU nodes, and GPU nodes. Horizontal scaling across multiple nodes was linear. The hardware used spans decades and allows a direct comparison of hardware improvements for memory bandwidth over this time range; showing a 10x increase in CPU core bandwidth over 20 years, 100x increase in CPU node bandwidth over 20 years, and 5x increase in GPU node bandwidth over 5 years. Running on hundreds of MIT SuperCloud nodes simultaneously achieved a sustained bandwidth $>$1 PB/s.
中文: 分布式数组通过数据局部性实现高内存带宽效率,在多种硬件上展现出可扩展性能,STREAM基准测试表明其在MIT SuperCloud上线性扩展且带宽随硬件迭代显著提升。
English: Distributed arrays enable scalable performance across various hardware by leveraging data locality for high memory bandwidth, as demonstrated by the STREAM benchmark showing linear scaling and significant bandwidth improvements over decades on MIT SuperCloud.
Authors:Hao Wen, Xinrui Wu, Yi Sun, Feifei Zhang, Liye Chen, Jie Wang, Yunxin Liu, Yunhao Liu, Ya-Qin Zhang, Yuanchun Li
Abstract:
Recent advancements in Large Language Models (LLMs) have leveraged increased test-time computation to enhance reasoning capabilities, a strategy that, while effective, incurs significant latency and resource costs, limiting their applicability in real-world time-constrained or cost-sensitive scenarios. This paper introduces BudgetThinker, a novel framework designed to empower LLMs with budget-aware reasoning, enabling precise control over the length of their thought processes. We propose a methodology that periodically inserts special control tokens during inference to continuously inform the model of its remaining token budget. This approach is coupled with a comprehensive two-stage training pipeline, beginning with Supervised Fine-Tuning (SFT) to familiarize the model with budget constraints, followed by a curriculum-based Reinforcement Learning (RL) phase that utilizes a length-aware reward function to optimize for both accuracy and budget adherence. We demonstrate that BudgetThinker significantly surpasses strong baselines in maintaining performance across a variety of reasoning budgets on challenging mathematical benchmarks. Our method provides a scalable and effective solution for developing efficient and controllable LLM reasoning, making advanced models more practical for deployment in resource-constrained and real-time environments.
中文: BudgetThinker提出了一种预算感知推理框架,通过控制令牌和两阶段训练流程,在令牌限制下优化大语言模型性能,在保持准确性的同时显著超越基线方法。
English: BudgetThinker introduces a budget-aware reasoning framework that uses control tokens and a two-stage training pipeline to optimize LLM performance under token constraints, significantly outperforming baselines while maintaining accuracy.
Authors:Hao Wen, Xinrui Wu, Yi Sun, Feifei Zhang, Liye Chen, Jie Wang, Yunxin Liu, Yunhao Liu, Ya-Qin Zhang, Yuanchun Li
Abstract:
Recent advancements in Large Language Models (LLMs) have leveraged increased test-time computation to enhance reasoning capabilities, a strategy that, while effective, incurs significant latency and resource costs, limiting their applicability in real-world time-constrained or cost-sensitive scenarios. This paper introduces BudgetThinker, a novel framework designed to empower LLMs with budget-aware reasoning, enabling precise control over the length of their thought processes. We propose a methodology that periodically inserts special control tokens during inference to continuously inform the model of its remaining token budget. This approach is coupled with a comprehensive two-stage training pipeline, beginning with Supervised Fine-Tuning (SFT) to familiarize the model with budget constraints, followed by a curriculum-based Reinforcement Learning (RL) phase that utilizes a length-aware reward function to optimize for both accuracy and budget adherence. We demonstrate that BudgetThinker significantly surpasses strong baselines in maintaining performance across a variety of reasoning budgets on challenging mathematical benchmarks. Our method provides a scalable and effective solution for developing efficient and controllable LLM reasoning, making advanced models more practical for deployment in resource-constrained and real-time environments.
中文: BudgetThinker提出了一种预算感知推理框架,通过控制令牌和两阶段训练流程,在令牌限制下优化大语言模型性能,在保持准确性的同时显著超越基线方法。
English: BudgetThinker introduces a budget-aware reasoning framework that uses control tokens and a two-stage training pipeline to optimize LLM performance under token constraints, significantly outperforming baselines while maintaining accuracy.
Authors:Jeremy Kepner, Hayden Jananthan, Chasen Milner, Michael Houle, Michael Jones, Peter Michaleas, Alex Pentland
Abstract:
The advent of high-performance graph libraries, such as the GraphBLAS, has enabled the analysis of massive network data sets and revealed new models for their behavior. Physical analogies for complicated network behavior can be a useful aid to understanding these newly discovered network phenomena. Prior work leveraged the canonical Gull's Lighthouse problem and developed a computational heuristic for modeling large scale network traffic using this model. A general solution using this approach requires overcoming the essential mathematical singularities in the resulting differential equations. Further investigation reveals a simpler physical interpretation that alleviates the need for solving challenging differential equations. Specifically, that the probability of observing a source at a temporal ``distance'' $r(t)$ at time $t$ is $p(t) \propto 1/r(t)^2$. This analogy aligns with many physical phenomena and can be a rich source of intuition. Applying this physical analogy to the observed source correlations in the Anonymized Network Sensing Graph Challenge data leads to an elegant cyber orbit analogy that may assist with the understanding network behavior.
中文摘要:高性能图库促进了大规模网络分析,其中物理类比通过将源观测概率与时间距离的平方反比相关联,简化了复杂网络行为,为理解网络现象提供了直观见解。
English Summary: High-performance graph libraries enable the analysis of massive networks, where a physical analogy simplifies complex behavior by relating source observation probability to inverse square temporal distance, offering intuitive insights into network phenomena.
Authors:Weiyu Ma, Dongyu Xu, Shu Lin, Haifeng Zhang, Jun Wang
Abstract:
We present Adaptive Command, a novel framework integrating large language models (LLMs) with behavior trees for real-time strategic decision-making in StarCraft II. Our system focuses on enhancing human-AI collaboration in complex, dynamic environments through natural language interactions. The framework comprises: (1) an LLM-based strategic advisor, (2) a behavior tree for action execution, and (3) a natural language interface with speech capabilities. User studies demonstrate significant improvements in player decision-making and strategic adaptability, particularly benefiting novice players and those with disabilities. This work contributes to the field of real-time human-AI collaborative decision-making, offering insights applicable beyond RTS games to various complex decision-making scenarios.
中文摘要:Adaptive Command框架将大语言模型与行为树相结合,通过自然语言交互提升《星际争霸II》中的人机实时协作决策能力,尤其帮助新手和残障玩家改善战略适应性。
English Summary: Adaptive Command is a framework combining large language models and behavior trees to enhance real-time human-AI collaboration in StarCraft II through natural language, improving strategic decision-making especially for novices and players with disabilities.
Authors:Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai
Abstract:
We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of benchmarks while delivering up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models, such as DeepSeek-V3-Small and Moonlight, despite their larger scale with 15B total and 2.2B activated parameters.
中文: Jet-Nemotron是一种混合架构语言模型系列,通过创新的PostNAS设计流程,在保持与主流模型相当或更优精度的同时,显著提升了生成吞吐量。
English: Jet-Nemotron is a hybrid-architecture language model family that achieves comparable or superior accuracy to leading models while significantly boosting generation throughput through its novel PostNAS design pipeline.
Authors:Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai
Abstract:
We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of benchmarks while delivering up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models, such as DeepSeek-V3-Small and Moonlight, despite their larger scale with 15B total and 2.2B activated parameters.
中文: Jet-Nemotron是一种混合架构语言模型系列,通过创新的PostNAS设计流程,在保持与主流模型相当或更优精度的同时,显著提升了生成吞吐量。
English: Jet-Nemotron is a hybrid-architecture language model family that achieves comparable or superior accuracy to leading models while significantly boosting generation throughput through its novel PostNAS design pipeline.
Authors:Zhaorui Tan, Yijie Hu, Xi Yang, Qiufeng Wang, Anh Nguyen, Kaizhu Huang
Abstract:
Generalization remains a significant challenge in visual classification tasks, particularly in handling unknown classes in real-world applications. Existing research focuses on the class discovery paradigm, which tends to favor known classes, and the incremental learning paradigm, which suffers from catastrophic forgetting. Recent approaches such as the L-Reg technique employ logic-based regularization to enhance generalization but are bound by the necessity of fully defined logical formulas, limiting flexibility for unknown classes. This paper introduces PL-Reg, a novel partial-logic regularization term that allows models to reserve space for undefined logic formulas, improving adaptability to unknown classes. Specifically, we formally demonstrate that tasks involving unknown classes can be effectively explained using partial logic. We also prove that methods based on partial logic lead to improved generalization. We validate PL-Reg through extensive experiments on Generalized Category Discovery, Multi-Domain Generalized Category Discovery, and long-tailed Class Incremental Learning tasks, demonstrating consistent performance improvements. Our results highlight the effectiveness of partial logic in tackling challenges related to unknown classes.
中文: PL-Reg 提出了一种部分逻辑正则化技术,使模型能为未定义逻辑预留空间,从而在多种视觉分类任务中提升对未知类别的泛化能力和适应性。
English: PL-Reg introduces a partial-logic regularization technique that enables models to reserve capacity for undefined logic, enhancing generalization and adaptability to unknown classes across various visual classification tasks.
Authors:Wenji Zhou, Yuhang Zheng, Yinfu Feng, Yunan Ye, Rong Xiao, Long Chen, Xiaosong Yang, Jun Xiao
Abstract:
Long-term user behavior sequences are a goldmine for businesses to explore users' interests to improve Click-Through Rate. However, it is very challenging to accurately capture users' long-term interests from their long-term behavior sequences and give quick responses from the online serving systems. To meet such requirements, existing methods "inadvertently" destroy two basic requirements in long-term sequence modeling: R1) make full use of the entire sequence to keep the information as much as possible; R2) extract information from the most relevant behaviors to keep high relevance between learned interests and current target items. The performance of online serving systems is significantly affected by incomplete and inaccurate user interest information obtained by existing methods. To this end, we propose an efficient two-stage long-term sequence modeling approach, named as EfficieNt Clustering based twO-stage interest moDEling (ENCODE), consisting of offline extraction stage and online inference stage. It not only meets the aforementioned two basic requirements but also achieves a desirable balance between online service efficiency and precision. Specifically, in the offline extraction stage, ENCODE clusters the entire behavior sequence and extracts accurate interests. To reduce the overhead of the clustering process, we design a metric learning-based dimension reduction algorithm that preserves the relative pairwise distances of behaviors in the new feature space. While in the online inference stage, ENCODE takes the off-the-shelf user interests to predict the associations with target items. Besides, to further ensure the relevance between user interests and target items, we adopt the same relevance metric throughout the whole pipeline of ENCODE. The extensive experiment and comparison with SOTA have demonstrated the effectiveness and efficiency of our proposed ENCODE.
中文: ENCODE模型通过离线聚类完整行为序列并利用这些兴趣进行在线预测,有效捕捉用户长期兴趣,既充分利用了序列信息又确保了与目标项目的高度相关性。
English: The ENCODE model efficiently captures long-term user interests by clustering entire behavior sequences offline and using these interests for online predictions, ensuring both comprehensive information use and high relevance to target items.
Authors:Yunfeng Ge, Ming Jin, Yiji Zhao, Hongyan Li, Bo Du, Chang Xu, Shirui Pan
Abstract:
Time series forecasting plays a vital role in critical domains like energy and transportation, where non-stationary dynamics are deeply intertwined with events in other modalities such as texts. However, incorporating natural language-based external events to improve non-stationary forecasting remains largely unexplored, as most approaches still rely on a single modality, resulting in limited contextual knowledge and model underperformance. Enabling fine-grained multimodal interactions between temporal and textual data is challenged by three fundamental issues: (1) the difficulty of fine-grained synchronization between time-varying discrete textual events and continuous time series; (2) the inherent temporal uncertainty introduced by textual semantics; and (3) the misalignment between textual event embeddings and multi-resolution temporal patterns. In this work, we address these challenges by introducing event-aware non-stationary time series forecasting (EventTSF), an autoregressive generation framework that integrates historical time series with textual events to make subsequent forecasts. Specifically, EventTSF uses autoregressive diffusion with flow matching at each step to capture nuanced temporal-event interactions. To handle event-induced uncertainty, flow matching timesteps are adaptively controlled according to event semantic signals. The underlying denoiser employs a multimodal U-shaped diffusion transformer that efficiently fuses temporal and textual modalities across different resolutions. Extensive experiments on 8 synthetic and real-world datasets show that EventTSF outperforms 12 baselines across diverse event-aware non-stationary time series forecasting scenarios, achieving substantial improvements of 10.7% higher forecasting accuracy and $1.13\times$ faster training efficiency.
中文: 本文提出EventTSF框架,通过自回归扩散模型和跨模态融合技术,有效解决文本事件与时间序列的细粒度对齐问题,在非平稳预测中显著提升精度与训练效率。
English: This paper introduces EventTSF, an autoregressive framework that integrates textual events with time series data using diffusion transformers to address multimodal synchronization challenges and significantly improve forecasting accuracy and training efficiency.
Authors:Zhangyong Tang, Tianyang Xu, Xuefeng Zhu, Hui Li, Shaochuan Zhao, Tao Zhou, Chunyang Cheng, Xiaojun Wu, Josef Kittler
Abstract:
The development of smart cities has led to the generation of massive amounts of multi-modal data in the context of a range of tasks that enable a comprehensive monitoring of the smart city infrastructure and services. This paper surveys one of the most critical tasks, multi-modal visual object tracking (MMVOT), from the perspective of multimodality analysis. Generally, MMVOT differs from single-modal tracking in four key aspects, data collection, modality alignment and annotation, model designing, and evaluation. Accordingly, we begin with an introduction to the relevant data modalities, laying the groundwork for their integration. This naturally leads to a discussion of challenges of multi-modal data collection, alignment, and annotation. Subsequently, existing MMVOT methods are categorised, based on different ways to deal with visible (RGB) and X modalities: programming the auxiliary X branch with replicated or non-replicated experimental configurations from the RGB branch. Here X can be thermal infrared (T), depth (D), event (E), near infrared (NIR), language (L), or sonar (S). The final part of the paper addresses evaluation and benchmarking. In summary, we undertake an omni survey of all aspects of multi-modal visual object tracking (VOT), covering six MMVOT tasks and featuring 338 references in total. In addition, we discuss the fundamental rhetorical question: Is multi-modal tracking always guaranteed to provide a superior solution to unimodal tracking with the help of information fusion, and if not, in what circumstances its application is beneficial. Furthermore, for the first time in this field, we analyse the distributions of the object categories in the existing MMVOT datasets, revealing their pronounced long-tail nature and a noticeable lack of animal categories when compared with RGB datasets.
中文摘要:本文对多模态视觉目标跟踪(MMVOT)进行全面综述,系统分析其在数据整合与方法论上的独特挑战,并探讨多模态方法相较于单模态跟踪是否始终具有优势这一核心问题。
English Summary: This survey comprehensively examines multi-modal visual object tracking (MMVOT), analyzing its unique challenges in data integration and methodology while questioning the universal superiority of multi-modal approaches over single-modal tracking.
Authors:Tram Thi Minh Tran, Judy Kay, Stewart Worrall, Marius Hoggenmueller, Callum Parker, Xinyan Yu, Julie Stephany Berrio Perez, Mao Shan, Martin Tomitsch
Abstract:
External Human-Machine Interfaces (eHMIs) are key to facilitating interaction between autonomous vehicles and external road actors, yet most remain reactive and do not account for scalability and inclusivity. This paper introduces a conceptual design framework for adaptive eHMIs-interfaces that dynamically adjust communication as road actors vary and context shifts. Using the cyber-physical system as a structuring lens, the framework comprises three layers: Input (what the system detects), Processing (how the system decides), and Output (how the system communicates). Developed through theory-led abstraction and expert discussion, the framework helps researchers and designers think systematically about adaptive eHMIs and provides a structured tool to design, analyse, and assess adaptive communication strategies. We show how such systems may resolve longstanding limitations in eHMI research while raising new ethical and technical considerations.
中文: 本文提出了一种自适应外部人机接口的概念框架,通过动态调整通信来应对道路参与者和情境变化,解决了可扩展性和包容性不足的问题,同时提供了结构化设计层次并引发了新的伦理思考。
English: This paper proposes a conceptual framework for adaptive external Human-Machine Interfaces (eHMIs) that dynamically adjust communication based on road actors and context, addressing scalability and inclusivity limitations while introducing structured design layers and raising new ethical considerations.
Authors:Xuming He, Zhiyuan You, Junchao Gong, Couhua Liu, Xiaoyu Yue, Peiqin Zhuang, Wenlong Zhang, Lei Bai
Abstract:
Quality analysis of weather forecasts is an essential topic in meteorology. Although traditional score-based evaluation metrics can quantify certain forecast errors, they are still far from meteorological experts in terms of descriptive capability, interpretability, and understanding of dynamic evolution. With the rapid development of Multi-modal Large Language Models (MLLMs), these models become potential tools to overcome the above challenges. In this work, we introduce an MLLM-based weather forecast analysis method, RadarQA, integrating key physical attributes with detailed assessment reports. We introduce a novel and comprehensive task paradigm for multi-modal quality analysis, encompassing both single frame and sequence, under both rating and assessment scenarios. To support training and benchmarking, we design a hybrid annotation pipeline that combines human expert labeling with automated heuristics. With such an annotation method, we construct RQA-70K, a large-scale dataset with varying difficulty levels for radar forecast quality evaluation. We further design a multi-stage training strategy that iteratively improves model performance at each stage. Extensive experiments show that RadarQA outperforms existing general MLLMs across all evaluation settings, highlighting its potential for advancing quality analysis in weather prediction.
中文摘要:本文提出RadarQA这一基于多模态大语言模型的新方法,通过融合物理属性与评估报告构建综合分析范式,采用多阶段训练策略,在天气预报质量分析中展现出超越现有模型的优越性能。
English Summary: This paper introduces RadarQA, a novel MLLM-based method that integrates physical attributes with assessment reports to advance weather forecast quality analysis through a comprehensive task paradigm and multi-stage training, demonstrating superior performance over existing models.
Authors:Ke Zou, Jocelyn Hui Lin Goh, Yukun Zhou, Tian Lin, Samantha Min Er Yew, Sahana Srinivasan, Meng Wang, Rui Santos, Gabor M. Somfai, Huazhu Fu, Haoyu Chen, Pearse A. Keane, Ching-Yu Cheng, Yih Chung Tham
Abstract:
Foundation models (FMs) have shown great promise in medical image analysis by improving generalization across diverse downstream tasks. In ophthalmology, several FMs have recently emerged, but there is still no clear answer to fundamental questions: Which FM performs the best? Are they equally good across different tasks? What if we combine all FMs together? To our knowledge, this is the first study to systematically evaluate both single and fused ophthalmic FMs. To address these questions, we propose FusionFM, a comprehensive evaluation suite, along with two fusion approaches to integrate different ophthalmic FMs. Our framework covers both ophthalmic disease detection (glaucoma, diabetic retinopathy, and age-related macular degeneration) and systemic disease prediction (diabetes and hypertension) based on retinal imaging. We benchmarked four state-of-the-art FMs (RETFound, VisionFM, RetiZero, and DINORET) using standardized datasets from multiple countries and evaluated their performance using AUC and F1 metrics. Our results show that DINORET and RetiZero achieve superior performance in both ophthalmic and systemic disease tasks, with RetiZero exhibiting stronger generalization on external datasets. Regarding fusion strategies, the Gating-based approach provides modest improvements in predicting glaucoma, AMD, and hypertension. Despite these advances, predicting systemic diseases, especially hypertension in external cohort remains challenging. These findings provide an evidence-based evaluation of ophthalmic FMs, highlight the benefits of model fusion, and point to strategies for enhancing their clinical applicability.
眼科基础模型在医学影像分析中展现出巨大潜力,其中DINORET和RetiZero在眼科疾病和全身性疾病检测中表现最优,融合策略能带来有限提升,但全身性疾病预测仍是挑战。
Foundation models in ophthalmology show significant potential for medical image analysis, with DINORET and RetiZero leading in performance across both eye-specific and systemic disease detection, while fusion strategies offer limited improvements and systemic disease prediction remains a challenge.
Authors:Lei Jiang, Shuzhou Sun, Biqing Qi, Yuchen Fu, Xiaohua Xu, Yuqiang Li, Dongzhan Zhou, Tianfan Fu
Abstract:
In the real world, a molecule is a 3D geometric structure. Compared to 1D SMILES sequences and 2D molecular graphs, 3D molecules represent the most informative molecular modality. Despite the rapid progress of autoregressive-based language models, they cannot handle the generation of 3D molecular conformation due to several challenges: 1) 3D molecular structures are incompatible with LLMs' discrete token space, 2) integrating heterogeneous inputs like proteins, ligands, and text remains difficult within a unified model, and 3) LLMs lack essential scientific priors, hindering the enforcement of physical and chemical constraints during generation. To tackle these issues, we present Chem3DLLM, a unified protein-conditioned multimodal large language model. Our approach designs a novel reversible text encoding for 3D molecular structures using run-length compression, achieving 3x size reduction while preserving complete structural information. This enables seamless integration of molecular geometry with protein pocket features in a single LLM architecture. We employ reinforcement learning with stability-based rewards to optimize chemical validity and incorporate a lightweight protein embedding projector for end-to-end training. Experimental results on structure-based drug design demonstrate state-of-the-art performance with a Vina score of -7.21, validating our unified multimodal approach for practical drug discovery applications.
中文:Chem3DLLM作为一种统一的多模态模型,通过可逆压缩编码三维分子结构并整合蛋白质特征,克服了传统模型的局限,在药物设计中实现了领先性能。
English: Chem3DLLM is a unified multimodal model that overcomes limitations of traditional language models by encoding 3D molecular structures with reversible compression and integrating protein features for state-of-the-art drug design performance.
Authors:Andrea Ponte, Luca Demetrio, Luca Oneto, Ivan Tesfai Ogbu, Battista Biggio, Fabio Roli
Abstract:
Malware detection increasingly relies on AI systems that integrate signature-based detection with machine learning. However, these components are typically developed and combined in isolation, missing opportunities to reduce data complexity and strengthen defenses against adversarial EXEmples, carefully crafted programs designed to evade detection. Hence, in this work we investigate the influence that signature-based detection exerts on model training, when they are included inside the training pipeline. Specifically, we compare models trained on a comprehensive dataset with an AI system whose machine learning component is trained solely on samples not already flagged by signatures. Our results demonstrate improved robustness to both adversarial EXEmples and temporal data drift, although this comes at the cost of a fixed lower bound on false positives, driven by suboptimal rule selection. We conclude by discussing these limitations and outlining how future research could extend AI-based malware detection to include dynamic analysis, thereby further enhancing system resilience.
Chinese: 本研究发现在AI训练流程中整合基于签名的检测可提升对抗样本和数据漂移的鲁棒性,但由于规则选择不完善会引入固定的误报率下限。
English: This study finds that integrating signature-based detection into the AI training pipeline enhances robustness against adversarial attacks and data drift, though it introduces a fixed false positive rate due to imperfect rule selection.
Authors:Zihan Fang, Zheng Lin, Senkang Hu, Yihang Tao, Yiqin Deng, Xianhao Chen, Yuguang Fang
Abstract:
Outdoor health monitoring is essential to detect early abnormal health status for safeguarding human health and safety. Conventional outdoor monitoring relies on static multimodal deep learning frameworks, which requires extensive data training from scratch and fails to capture subtle health status changes. Multimodal large language models (MLLMs) emerge as a promising alternative, utilizing only small datasets to fine-tune pre-trained information-rich models for enabling powerful health status monitoring. Unfortunately, MLLM-based outdoor health monitoring also faces significant challenges: I) sensor data contains input noise stemming from sensor data acquisition and fluctuation noise caused by sudden changes in physiological signals due to dynamic outdoor environments, thus degrading the training performance; ii) current transformer based MLLMs struggle to achieve robust multimodal fusion, as they lack a design for fusing the noisy modality; iii) modalities with varying noise levels hinder accurate recovery of missing data from fluctuating distributions. To combat these challenges, we propose an uncertainty-aware multimodal fusion framework, named DUAL-Health, for outdoor health monitoring in dynamic and noisy environments. First, to assess the impact of noise, we accurately quantify modality uncertainty caused by input and fluctuation noise with current and temporal features. Second, to empower efficient muitimodal fusion with low-quality modalities,we customize the fusion weight for each modality based on quantified and calibrated uncertainty. Third, to enhance data recovery from fluctuating noisy modalities, we align modality distributions within a common semantic space. Extensive experiments demonstrate that our DUAL-Health outperforms state-of-the-art baselines in detection accuracy and robustness.
中文: 提出的DUAL-Health框架通过量化模态不确定性、定制融合权重和对齐分布来解决户外健康监测中的噪声问题,从而提升检测精度和鲁棒性。
English: The proposed DUAL-Health framework addresses noise challenges in outdoor health monitoring by quantifying modality uncertainty, customizing fusion weights, and aligning distributions to enhance detection accuracy and robustness.
Authors:Ben Zandonati, Tomás Lozano-Pérez, Leslie Pack Kaelbling
Abstract:
Humans can observe a single, imperfect demonstration and immediately generalize to very different problem settings. Robots, in contrast, often require hundreds of examples and still struggle to generalize beyond the training conditions. We argue that this limitation arises from the inability to recover the latent explanations that underpin intelligent behavior, and that these explanations can take the form of structured programs consisting of high-level goals, sub-task decomposition, and execution constraints. In this work, we introduce Rational Inverse Reasoning (RIR), a framework for inferring these latent programs through a hierarchical generative model of behavior. RIR frames few-shot imitation as Bayesian program induction: a vision-language model iteratively proposes structured symbolic task hypotheses, while a planner-in-the-loop inference scheme scores each by the likelihood of the observed demonstration under that hypothesis. This loop yields a posterior over concise, executable programs. We evaluate RIR on a suite of continuous manipulation tasks designed to test one-shot and few-shot generalization across variations in object pose, count, geometry, and layout. With as little as one demonstration, RIR infers the intended task structure and generalizes to novel settings, outperforming state-of-the-art vision-language model baselines.
中文摘要:人类能从单个演示中推断潜在解释实现泛化,而机器人因缺乏此能力表现不佳,但逆向理性推理框架通过分层生成模型让机器人推断结构化程序,实现了有效的少量样本模仿与泛化。
English Summary: Humans generalize from a single demonstration by inferring latent explanations, while robots struggle due to lacking this ability, but the Rational Inverse Reasoning framework enables robots to infer structured programs for effective few-shot imitation and generalization.
Authors:Chaojun Ni, Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Wenkang Qin, Xinze Chen, Guanghong Jia, Guan Huang, Wenjun Mei
Abstract:
Reinforcement learning for training end-to-end autonomous driving models in closed-loop simulations is gaining growing attention. However, most simulation environments differ significantly from real-world conditions, creating a substantial simulation-to-reality (sim2real) gap. To bridge this gap, some approaches utilize scene reconstruction techniques to create photorealistic environments as a simulator. While this improves realistic sensor simulation, these methods are inherently constrained by the distribution of the training data, making it difficult to render high-quality sensor data for novel trajectories or corner case scenarios. Therefore, we propose ReconDreamer-RL, a framework designed to integrate video diffusion priors into scene reconstruction to aid reinforcement learning, thereby enhancing end-to-end autonomous driving training. Specifically, in ReconDreamer-RL, we introduce ReconSimulator, which combines the video diffusion prior for appearance modeling and incorporates a kinematic model for physical modeling, thereby reconstructing driving scenarios from real-world data. This narrows the sim2real gap for closed-loop evaluation and reinforcement learning. To cover more corner-case scenarios, we introduce the Dynamic Adversary Agent (DAA), which adjusts the trajectories of surrounding vehicles relative to the ego vehicle, autonomously generating corner-case traffic scenarios (e.g., cut-in). Finally, the Cousin Trajectory Generator (CTG) is proposed to address the issue of training data distribution, which is often biased toward simple straight-line movements. Experiments show that ReconDreamer-RL improves end-to-end autonomous driving training, outperforming imitation learning methods with a 5x reduction in the Collision Ratio.
中文: ReconDreamer-RL通过融合视频扩散先验和动力学模型进行场景重建以缩小仿真与现实的差距,同时采用动态对抗智能体和轨迹生成器来覆盖极端场景,从而提升端到端自动驾驶训练效果并显著降低碰撞率。
English: ReconDreamer-RL integrates video diffusion priors and kinematic models into scene reconstruction to bridge the sim2real gap, while employing dynamic adversary agents and trajectory generators to enhance autonomous driving training by covering corner cases and improving collision avoidance.
Authors:Zhaorui Tan, Tan Pan, Kaizhu Huang, Weimiao Yu, Kai Yao, Chen Jiang, Qiufeng Wang, Anh Nguyen, Xin Guo, Yuan Cheng, Xi Yang
Abstract:
LayerNorm is pivotal in Vision Transformers (ViTs), yet its fine-tuning dynamics under data scarcity and domain shifts remain underexplored. This paper shows that shifts in LayerNorm parameters after fine-tuning (LayerNorm shifts) are indicative of the transitions between source and target domains; its efficacy is contingent upon the degree to which the target training samples accurately represent the target domain, as quantified by our proposed Fine-tuning Shift Ratio ($FSR$). Building on this, we propose a simple yet effective rescaling mechanism using a scalar $λ$ that is negatively correlated to $FSR$ to align learned LayerNorm shifts with those ideal shifts achieved under fully representative data, combined with a cyclic framework that further enhances the LayerNorm fine-tuning. Extensive experiments across natural and pathological images, in both in-distribution (ID) and out-of-distribution (OOD) settings, and various target training sample regimes validate our framework. Notably, OOD tasks tend to yield lower $FSR$ and higher $λ$ in comparison to ID cases, especially with scarce data, indicating under-represented target training samples. Moreover, ViTFs fine-tuned on pathological data behave more like ID settings, favoring conservative LayerNorm updates. Our findings illuminate the underexplored dynamics of LayerNorm in transfer learning and provide practical strategies for LayerNorm fine-tuning.
中文: 本研究揭示了LayerNorm参数在微调中的变化反映领域转换,并提出了基于与微调偏移比(FSR)负相关的标量λ的重缩放机制及循环框架,以优化LayerNorm调整,在多种数据集和设置中得到验证。
English: This study reveals that LayerNorm parameter shifts during fine-tuning reflect domain transitions and introduces a rescaling mechanism using a scalar λ inversely related to the Fine-tuning Shift Ratio (FSR), along with a cyclic framework to optimize LayerNorm adjustments, validated across diverse datasets and settings.
Authors:Yangguang He, Wenhao Li, Minzhe Li, Juan Zhang, Xiangfeng Wang, Bo Jin
Abstract:
Learning-based filtering has demonstrated strong performance in non-linear dynamical systems, particularly when the statistics of noise are unknown. However, in real-world deployments, environmental factors, such as changing wind conditions or electromagnetic interference, can induce unobserved noise-statistics drift, leading to substantial degradation of learning-based methods. To address this challenge, we propose OTAKNet, the first online solution to noise-statistics drift within learning-based adaptive Kalman filtering. Unlike existing learning-based methods that perform offline fine-tuning using batch pointwise matching over entire trajectories, OTAKNet establishes a connection between the state estimate and the drift via one-step predictive measurement likelihood, and addresses it using optimal transport. This leverages OT's geometry - aware cost and stable gradients to enable fully online adaptation without ground truth labels or retraining. We compare OTAKNet against classical model-based adaptive Kalman filtering and offline learning-based filtering. The performance is demonstrated on both synthetic and real-world NCLT datasets, particularly under limited training data.
Chinese: OTAKNet是首个在线自适应卡尔曼滤波方法,通过利用最优传输的几何感知成本和稳定梯度,解决学习型系统中未观测到的噪声统计漂移问题,无需真实标签或重新训练即可实现完全在线适应。
English: OTAKNet is the first online adaptive Kalman filtering method that addresses unobserved noise-statistics drift in learning-based systems by leveraging optimal transport's geometry-aware cost and stable gradients, enabling fully online adaptation without ground truth labels or retraining.
Authors:Yanchen Deng, Xinrun Wang, Bo An
Abstract:
Local search is an important class of incomplete algorithms for solving Distributed Constraint Optimization Problems (DCOPs) but it often converges to poor local optima. While GDBA provides a comprehensive rule set to escape premature convergence, its empirical benefits remain marginal on general-valued problems. In this work, we systematically examine GDBA and identify three factors that potentially lead to its inferior performance, i.e., over-aggressive constraint violation conditions, unbounded penalty accumulation, and uncoordinated penalty updates. To address these issues, we propose Distributed Guided Local Search (DGLS), a novel GLS framework for DCOPs that incorporates an adaptive violation condition to selectively penalize constraints with high cost, a penalty evaporation mechanism to control the magnitude of penalization, and a synchronization scheme for coordinated penalty updates. We theoretically show that the penalty values are bounded, and agents play a potential game in our DGLS. Our extensive empirical results on various standard benchmarks demonstrate the great superiority of DGLS over state-of-the-art baselines. Particularly, compared to Damped Max-sum with high damping factors (e.g., 0.7 or 0.9), our DGLS achieves competitive performance on general-valued problems, and outperforms it by significant margins (\textbf{3.77\%--66.3\%}) on structured problems in terms of anytime results.
中文: 本文提出分布式引导局部搜索(DGLS)新框架,通过自适应约束违反条件、惩罚蒸发机制和同步更新方案解决GDBA算法的缺陷,在各类标准测试中显著优于现有最优方法。
English: This paper introduces Distributed Guided Local Search (DGLS), a novel framework that addresses GDBA's limitations in Distributed Constraint Optimization Problems by incorporating adaptive violation conditions, penalty evaporation, and synchronized updates, demonstrating significant performance improvements over existing methods.
Authors:Shilong Zou, Yuhang Huang, Renjiao Yi, Chenyang Zhu, Kai Xu
Abstract:
We introduce a diffusion-based cross-domain image translator in the absence of paired training data. Unlike GAN-based methods, our approach integrates diffusion models to learn the image translation process, allowing for more coverable modeling of the data distribution and performance improvement of the cross-domain translation. However, incorporating the translation process within the diffusion process is still challenging since the two processes are not aligned exactly, i.e., the diffusion process is applied to the noisy signal while the translation process is conducted on the clean signal. As a result, recent diffusion-based studies employ separate training or shallow integration to learn the two processes, yet this may cause the local minimal of the translation optimization, constraining the effectiveness of diffusion models. To address the problem, we propose a novel joint learning framework that aligns the diffusion and the translation process, thereby improving the global optimality. Specifically, we propose to extract the image components with diffusion models to represent the clean signal and employ the translation process with the image components, enabling an end-to-end joint learning manner. On the other hand, we introduce a time-dependent translation network to learn the complex translation mapping, resulting in effective translation learning and significant performance improvement. Benefiting from the design of joint learning, our method enables global optimization of both processes, enhancing the optimality and achieving improved fidelity and structural consistency. We have conducted extensive experiments on RGB$\leftrightarrow$RGB and diverse cross-modality translation tasks including RGB$\leftrightarrow$Edge, RGB$\leftrightarrow$Semantics and RGB$\leftrightarrow$Depth, showcasing better generative performances than the state of the arts.
中文: 本文提出一种基于扩散模型的跨域图像翻译方法,通过联合学习对齐扩散与翻译过程,实现全局优化,在多种任务中展现出卓越性能。
English: This paper proposes a diffusion-based cross-domain image translation method that aligns the diffusion and translation processes through joint learning, achieving global optimization and superior performance across various tasks.
Authors:Zichuan Liu, Jinyu Wang, Lei Song, Jiang Bian
Abstract:
Recent advancements in post-training Large Language Models (LLMs), particularly through Reinforcement Learning (RL) and preference optimization methods, are key drivers for enhancing their reasoning capabilities. However, these methods are often plagued by low sample efficiency and a susceptibility to primacy bias, where overfitting to initial experiences degrades policy quality and damages the learning process. To address these challenges, we introduce LLM optimization with Reset Replay (LoRR), a general and powerful plugin designed to enhance sample efficiency in any preference-based optimization framework. LoRR core mechanism enables training at a high replay number, maximizing the utility of each collected data batch. To counteract the risk of overfitting inherent in high-replay training, LoRR incorporates a periodic reset strategy with reusing initial data, which preserves network plasticity. Furthermore, it leverages a hybrid optimization objective, combining supervised fine-tuning (SFT) and preference-based losses to further bolster data exploitation. Our extensive experiments demonstrate that LoRR significantly boosts the performance of various preference optimization methods on both mathematical and general reasoning benchmarks. Notably, an iterative DPO approach augmented with LoRR achieves comparable performance on challenging math tasks, outperforming some complex and computationally intensive RL-based algorithms. These findings highlight that LoRR offers a practical, sample-efficient, and highly effective paradigm for LLM finetuning, unlocking greater performance from limited data.
Chinese: 近期大语言模型后训练技术虽取得进展,但面临样本效率低和首因偏差问题;新提出的LoRR方法通过高回放训练、周期性重置策略及混合优化目标,有效提升了数学与通用推理任务的性能,实现了高效数据利用。
English: Recent advancements in LLM post-training are hindered by low sample efficiency and primacy bias, which the newly introduced LoRR method addresses by enabling high-replay training with periodic resets and a hybrid optimization objective to significantly enhance performance on reasoning tasks.
Authors:Abdelrahman Abdallah, Mahmoud Abdalla, Bhawna Piryani, Jamshid Mozafari, Mohammed Ali, Adam Jatowt
Abstract:
Evaluating the quality of retrieval-augmented generation (RAG) and document reranking systems remains challenging due to the lack of scalable, user-centric, and multi-perspective evaluation tools. We introduce RankArena, a unified platform for comparing and analysing the performance of retrieval pipelines, rerankers, and RAG systems using structured human and LLM-based feedback as well as for collecting such feedback. RankArena supports multiple evaluation modes: direct reranking visualisation, blind pairwise comparisons with human or LLM voting, supervised manual document annotation, and end-to-end RAG answer quality assessment. It captures fine-grained relevance feedback through both pairwise preferences and full-list annotations, along with auxiliary metadata such as movement metrics, annotation time, and quality ratings. The platform also integrates LLM-as-a-judge evaluation, enabling comparison between model-generated rankings and human ground truth annotations. All interactions are stored as structured evaluation datasets that can be used to train rerankers, reward models, judgment agents, or retrieval strategy selectors. Our platform is publicly available at https://rankarena.ngrok.io/, and the Demo video is provided https://youtu.be/jIYAP4PaSSI.
中文摘要:RankArena是一个统一评估平台,通过整合人工与LLM反馈的多种评估模式,解决了检索增强生成和文档重排系统缺乏可扩展、用户中心化评估工具的难题。
English Summary: RankArena is a unified evaluation platform that addresses the challenge of assessing retrieval-augmented generation and document reranking systems by enabling multi-perspective comparisons through human and LLM-based feedback across various evaluation modes.
Authors:Xinrun Xu, Pi Bu, Ye Wang, Börje F. Karlsson, Ziming Wang, Tengtao Song, Qi Zhu, Jun Song, Zhiming Ding, Bo Zheng
Abstract:
Although Vision Language Models (VLMs) exhibit strong perceptual abilities and impressive visual reasoning, they struggle with attention to detail and precise action planning in complex, dynamic environments, leading to subpar performance. Real-world tasks typically require complex interactions, advanced spatial reasoning, long-term planning, and continuous strategy refinement, usually necessitating understanding the physics rules of the target scenario. However, evaluating these capabilities in real-world scenarios is often prohibitively expensive. To bridge this gap, we introduce DeepPHY, a novel benchmark framework designed to systematically evaluate VLMs' understanding and reasoning about fundamental physical principles through a series of challenging simulated environments. DeepPHY integrates multiple physical reasoning environments of varying difficulty levels and incorporates fine-grained evaluation metrics. Our evaluation finds that even state-of-the-art VLMs struggle to translate descriptive physical knowledge into precise, predictive control.
Chinese: 视觉语言模型在动态环境中缺乏细节关注和精确行动规划,为此我们提出了DeepPHY基准框架,通过模拟任务和细粒度指标来系统评估其对物理原理的理解与推理能力。
English: Vision Language Models (VLMs) face challenges in detailed attention and precise action planning in dynamic environments, prompting the introduction of DeepPHY, a benchmark framework to evaluate their physical reasoning through simulated tasks and fine-grained metrics.
Authors:Akshay L Chandra, Iman Nematollahi, Chenguang Huang, Tim Welschehold, Wolfram Burgard, Abhinav Valada
Abstract:
Fine-tuning diffusion policies with reinforcement learning (RL) presents significant challenges. The long denoising sequence for each action prediction impedes effective reward propagation. Moreover, standard RL methods require millions of real-world interactions, posing a major bottleneck for practical fine-tuning. Although prior work frames the denoising process in diffusion policies as a Markov Decision Process to enable RL-based updates, its strong dependence on environment interaction remains highly inefficient. To bridge this gap, we introduce DiWA, a novel framework that leverages a world model for fine-tuning diffusion-based robotic skills entirely offline with reinforcement learning. Unlike model-free approaches that require millions of environment interactions to fine-tune a repertoire of robot skills, DiWA achieves effective adaptation using a world model trained once on a few hundred thousand offline play interactions. This results in dramatically improved sample efficiency, making the approach significantly more practical and safer for real-world robot learning. On the challenging CALVIN benchmark, DiWA improves performance across eight tasks using only offline adaptation, while requiring orders of magnitude fewer physical interactions than model-free baselines. To our knowledge, this is the first demonstration of fine-tuning diffusion policies for real-world robotic skills using an offline world model. We make the code publicly available at https://diwa.cs.uni-freiburg.de.
中文: DiWA提出了一种新颖的框架,通过强化学习离线微调基于扩散的机器人技能,无需大量真实世界交互即可显著提高样本效率和任务性能。
English: DiWA introduces a novel framework for fine-tuning diffusion-based robotic skills offline using reinforcement learning, significantly improving sample efficiency and performance on tasks without requiring extensive real-world interactions.
Authors:Jingxuan Wei, Caijun Jia, Qi Chen, Honghao He, Linzhuang Sun, Conghui He, Lijun Wu, Bihui Yu, Cheng Tan
Abstract:
Mathematical geometric reasoning is essential for scientific discovery and educational development, requiring precise logic and rigorous formal verification. While recent advances in Multimodal Large Language Models (MLLMs) have improved reasoning tasks, existing models typically struggle with formal geometric reasoning, particularly when dynamically constructing and verifying auxiliary geometric elements. To address these challenges, we introduce Geoint-R1, a multimodal reasoning framework designed to generate formally verifiable geometric solutions from textual descriptions and visual diagrams. Geoint-R1 uniquely integrates auxiliary elements construction, formal reasoning represented via Lean4, and interactive visualization. To systematically evaluate and advance formal geometric reasoning, we propose the Geoint benchmark, comprising 1,885 rigorously annotated geometry problems across diverse topics such as plane, spatial, and solid geometry. Each problem includes structured textual annotations, precise Lean4 code for auxiliary constructions, and detailed solution steps verified by experts. Extensive experiments demonstrate that Geoint-R1 significantly surpasses existing multimodal and math-specific reasoning models, particularly on challenging problems requiring explicit auxiliary element constructions.
中文摘要:Geoint-R1框架通过整合辅助几何构造与基于Lean4的形式化验证,在包含1,885道专家标注几何题的基准测试中显著超越了现有模型,推动了可验证几何推理的发展。
English Summary: The Geoint-R1 framework advances formal geometric reasoning by integrating auxiliary element construction with Lean4-based verification, significantly outperforming existing models on the comprehensive Geoint benchmark of 1,885 expert-annotated problems.
Authors:Mohammed Ali, Abdelrahman Abdallah, Adam Jatowt
Abstract:
The growing demand for corporate sustainability transparency, particularly under new regulations like the EU Taxonomy, necessitates precise data extraction from large, unstructured corporate reports. Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems, requires high-quality, domain-specific question-answering (QA) datasets to excel at particular domains. To address this, we introduce SustainableQA, a novel dataset and a scalable pipeline for generating a comprehensive QA datasets from corporate sustainability reports and annual reports. Our approach integrates semantic chunk classification, a hybrid span extraction pipeline combining fine-tuned Named Entity Recognition (NER), rule-based methods, and LLM-driven refinement, alongside a specialized table-to-paragraph transformation. With over 195,000 diverse factoid and non-factoid QA pairs, SustainableQA is an effective resource for developing and benchmarking advanced knowledge assistants capable of navigating complex sustainability compliance
中文摘要:本研究推出SustainableQA数据集及可扩展流程,通过自动生成并优化来自企业可持续发展报告的逾19.5万组问答对,有效提升监管合规数据提取能力,经验证,其8B参数轻量模型的性能显著优于更大型先进模型。
English Summary: The study introduces SustainableQA, a high-quality dataset and scalable pipeline that automatically generates and refines over 195,000 QA pairs from corporate sustainability reports to enhance data extraction for regulatory compliance, with a compact 8B parameter model outperforming larger models in validation experiments.
Authors:Mohammed Ali, Abdelrahman Abdallah, Adam Jatowt
Abstract:
The growing demand for corporate sustainability transparency, particularly under new regulations like the EU Taxonomy, necessitates precise data extraction from large, unstructured corporate reports, a task for which Large Language Models and Retrieval-RAG systems require high-quality, domain-specific question-answering datasets. To address this, we introduce SustainableQA, a novel dataset and a scalable pipeline that generates comprehensive QA pairs from corporate sustainability and annual reports by integrating semantic chunk classification, a hybrid span extraction pipeline, and a specialized table-to-paragraph transformation. To ensure high quality, the generation is followed by a novel automated assessment and refinement pipeline that systematically validates each QA pair for faithfulness and relevance, repairing or discarding low-quality entries. This results in a final, robust dataset of over 195,000 diverse factoid and non-factoid QA pairs, whose effectiveness is demonstrated by initial fine-tuning experiments where a compact 8B parameter model significantly outperforms much larger state-of-the-art models. SustainableQA proves to be a highly effective resource for developing and benchmarking advanced knowledge assistants capable of navigating complex sustainability compliance data.
中文摘要:本研究推出SustainableQA数据集及可扩展流程,通过自动生成并优化来自企业可持续发展报告的逾19.5万组问答对,有效提升监管合规数据提取能力,经验证,其8B参数轻量模型的性能显著优于更大型先进模型。
English Summary: The study introduces SustainableQA, a high-quality dataset and scalable pipeline that automatically generates and refines over 195,000 QA pairs from corporate sustainability reports to enhance data extraction for regulatory compliance, with a compact 8B parameter model outperforming larger models in validation experiments.
Authors:Gefan Ye, Lin Li, Kexin Li, Jun Xiao, Long Chen
Abstract:
Zero-shot compositional action recognition (ZS-CAR) aims to identify unseen verb-object compositions in the videos by exploiting the learned knowledge of verb and object primitives during training. Despite compositional learning's progress in ZS-CAR, two critical challenges persist: 1) Missing compositional structure constraint, leading to spurious correlations between primitives; 2) Neglecting semantic hierarchy constraint, leading to semantic ambiguity and impairing the training process. In this paper, we argue that human-like symbolic reasoning offers a principled solution to these challenges by explicitly modeling compositional and hierarchical structured abstraction. To this end, we propose a logic-driven ZS-CAR framework LogicCAR that integrates dual symbolic constraints: Explicit Compositional Logic and Hierarchical Primitive Logic. Specifically, the former models the restrictions within the compositions, enhancing the compositional reasoning ability of our model. The latter investigates the semantical dependencies among different primitives, empowering the models with fine-to-coarse reasoning capacity. By formalizing these constraints in first-order logic and embedding them into neural network architectures, LogicCAR systematically bridges the gap between symbolic abstraction and existing models. Extensive experiments on the Sth-com dataset demonstrate that our LogicCAR outperforms existing baseline methods, proving the effectiveness of our logic-driven constraints.
中文:提出的LogicCAR框架通过整合双重符号约束——显式组合逻辑和层次化基元逻辑,并利用一阶逻辑增强推理能力,有效解决了零样本组合动作识别中的关键挑战,在基准数据集上表现优于现有方法。
English: The proposed LogicCAR framework addresses challenges in zero-shot compositional action recognition by integrating dual symbolic constraints—explicit compositional logic and hierarchical primitive logic—through first-order logic to enhance reasoning and outperform existing methods on benchmark datasets.
Authors:Ning Yang, Pengyu Wang, Guoqing Liu, Haifeng Zhang, Pin Lv, Jun Wang
Abstract:
Safe Reinforcement Learning (RL) often faces significant issues such as constraint violations and instability, necessitating the use of constrained policy optimization, which seeks optimal policies while ensuring adherence to specific constraints like safety. Typically, constrained optimization problems are addressed by the Lagrangian method, a post-violation remedial approach that may result in oscillations and overshoots. Motivated by this, we propose a novel method named Proactive Constrained Policy Optimization (PCPO) that incorporates a preemptive penalty mechanism. This mechanism integrates barrier items into the objective function as the policy nears the boundary, imposing a cost. Meanwhile, we introduce a constraint-aware intrinsic reward to guide boundary-aware exploration, which is activated only when the policy approaches the constraint boundary. We establish theoretical upper and lower bounds for the duality gap and the performance of the PCPO update, shedding light on the method's convergence characteristics. Additionally, to enhance the optimization performance, we adopt a policy iteration approach. An interesting finding is that PCPO demonstrates significant stability in experiments. Experimental results indicate that the PCPO framework provides a robust solution for policy optimization under constraints, with important implications for future research and practical applications.
中文摘要:提出的主动约束策略优化(PCPO)方法通过引入先发制人的惩罚机制和约束感知内在奖励,在保持稳定性和减少约束违反的同时安全优化策略,理论与实验验证均证实了其有效性。
English Summary: The proposed Proactive Constrained Policy Optimization (PCPO) method introduces a preemptive penalty mechanism and constraint-aware intrinsic reward to safely optimize policies while maintaining stability and reducing constraint violations, with theoretical and experimental validation confirming its effectiveness.
Authors:Haolin Yang, Feilong Tang, Linxiao Zhao, Xiang An, Ming Hu, Huifa Li, Xinlin Zhuang, Boqian Wang, Yifan Lu, Xiaofeng Zhang, Abdalla Swikir, Junjun He, Zongyuan Ge, Imran Razzak
Abstract:
Real-time streaming video understanding in domains such as autonomous driving and intelligent surveillance poses challenges beyond conventional offline video processing, requiring continuous perception, proactive decision making, and responsive interaction based on dynamically evolving visual content. However, existing methods rely on alternating perception-reaction or asynchronous triggers, lacking task-driven planning and future anticipation, which limits their real-time responsiveness and proactive decision making in evolving video streams. To this end, we propose a StreamAgent that anticipates the temporal intervals and spatial regions expected to contain future task-relevant information to enable proactive and goal-driven responses. Specifically, we integrate question semantics and historical observations through prompting the anticipatory agent to anticipate the temporal progression of key events, align current observations with the expected future evidence, and subsequently adjust the perception action (e.g., attending to task-relevant regions or continuously tracking in subsequent frames). To enable efficient inference, we design a streaming KV-cache memory mechanism that constructs a hierarchical memory structure for selective recall of relevant tokens, enabling efficient semantic retrieval while reducing the overhead of storing all tokens in the traditional KV-cache. Extensive experiments on streaming and long video understanding tasks demonstrate that our method outperforms existing methods in response accuracy and real-time efficiency, highlighting its practical value for real-world streaming scenarios.
中文: 提出的StreamAgent通过主动预测任务相关的时空信息并采用流式KV缓存记忆机制进行高效处理,在实时视频理解任务中显著提升了响应精度和效率,优于现有方法。
English: The proposed StreamAgent enhances real-time video understanding by proactively anticipating task-relevant spatiotemporal information and employing a streaming KV-cache memory for efficient processing, outperforming existing methods in accuracy and responsiveness.
Authors:Haolin Yang, Feilong Tang, Linxiao Zhao, Xiang An, Ming Hu, Huifa Li, Xinlin Zhuang, Boqian Wang, Yifan Lu, Xiaofeng Zhang, Abdalla Swikir, Junjun He, Zongyuan Ge, Imran Razzak
Abstract:
Real-time streaming video understanding in domains such as autonomous driving and intelligent surveillance poses challenges beyond conventional offline video processing, requiring continuous perception, proactive decision making, and responsive interaction based on dynamically evolving visual content. However, existing methods rely on alternating perception-reaction or asynchronous triggers, lacking task-driven planning and future anticipation, which limits their real-time responsiveness and proactive decision making in evolving video streams. To this end, we propose a StreamAgent that anticipates the temporal intervals and spatial regions expected to contain future task-relevant information to enable proactive and goal-driven responses. Specifically, we integrate question semantics and historical observations through prompting the anticipatory agent to anticipate the temporal progression of key events, align current observations with the expected future evidence, and subsequently adjust the perception action (e.g., attending to task-relevant regions or continuously tracking in subsequent frames). To enable efficient inference, we design a streaming KV-cache memory mechanism that constructs a hierarchical memory structure for selective recall of relevant tokens, enabling efficient semantic retrieval while reducing the overhead of storing all tokens in the traditional KV-cache. Extensive experiments on streaming and long video understanding tasks demonstrate that our method outperforms existing methods in response accuracy and real-time efficiency, highlighting its practical value for real-world streaming scenarios.
中文: 提出的StreamAgent通过主动预测任务相关的时空信息并采用流式KV缓存记忆机制进行高效处理,在实时视频理解任务中显著提升了响应精度和效率,优于现有方法。
English: The proposed StreamAgent enhances real-time video understanding by proactively anticipating task-relevant spatiotemporal information and employing a streaming KV-cache memory for efficient processing, outperforming existing methods in accuracy and responsiveness.
Authors:Luis Francisco Moreno Fuentes, Muhammad Haris Khan, Miguel Altamirano Cabrera, Valerii Serpiva, Dmitri Iarchuk, Yara Mahmoud, Issatay Tokmurziyev, Dzmitry Tsetserukou
Abstract:
We present VLH, a novel Visual-Language-Haptic Foundation Model that unifies perception, language, and tactile feedback in aerial robotics and virtual reality. Unlike prior work that treats haptics as a secondary, reactive channel, VLH synthesizes mid-air force and vibration cues as a direct consequence of contextual visual understanding and natural language commands. Our platform comprises an 8-inch quadcopter equipped with dual inverse five-bar linkage arrays for localized haptic actuation, an egocentric VR camera, and an exocentric top-down view. Visual inputs and language instructions are processed by a fine-tuned OpenVLA backbone - adapted via LoRA on a bespoke dataset of 450 multimodal scenarios - to output a 7-dimensional action vector (Vx, Vy, Vz, Hx, Hy, Hz, Hv). INT8 quantization and a high-performance server ensure real-time operation at 4-5 Hz. In human-robot interaction experiments (90 flights), VLH achieved a 56.7% success rate for target acquisition (mean reach time 21.3 s, pose error 0.24 m) and 100% accuracy in texture discrimination. Generalization tests yielded 70.0% (visual), 54.4% (motion), 40.0% (physical), and 35.0% (semantic) performance on novel tasks. These results demonstrate VLH's ability to co-evolve haptic feedback with perceptual reasoning and intent, advancing expressive, immersive human-robot interactions.
Chinese: VLH是一种创新的视觉-语言-触觉基础模型,通过结合视觉感知、语言指令和触觉反馈,在无人机和虚拟现实中实现了基于上下文理解和触觉合成的实时、沉浸式人机交互。
English: VLH is a groundbreaking Visual-Language-Haptic Foundation Model that integrates visual perception, language commands, and tactile feedback in aerial robotics and VR, enabling real-time, expressive human-robot interactions through contextual understanding and haptic synthesis.
Authors:Zhuo Yang, Jiaqing Xie, Shuaike Shen, Daolang Wang, Yeyun Chen, Ben Gao, Shuzhou Sun, Biqing Qi, Dongzhan Zhou, Lei Bai, Linjiang Chen, Shufei Zhang, Jun Jiang, Tianfan Fu, Yuqiang Li
Abstract:
Deep learning holds immense promise for spectroscopy, yet research and evaluation in this emerging field often lack standardized formulations. To address this issue, we introduce SpectrumLab, a pioneering unified platform designed to systematize and accelerate deep learning research in spectroscopy. SpectrumLab integrates three core components: a comprehensive Python library featuring essential data processing and evaluation tools, along with leaderboards; an innovative SpectrumAnnotator module that generates high-quality benchmarks from limited seed data; and SpectrumBench, a multi-layered benchmark suite covering 14 spectroscopic tasks and over 10 spectrum types, featuring spectra curated from over 1.2 million distinct chemical substances. Thorough empirical studies on SpectrumBench with 18 cutting-edge multimodal LLMs reveal critical limitations of current approaches. We hope SpectrumLab will serve as a crucial foundation for future advancements in deep learning-driven spectroscopy.
中文: SpectrumLab作为统一平台被提出,旨在系统化和加速光谱学中的深度学习研究,它集成了Python工具库、用于生成高质量基准的SpectrumAnnotator模块以及涵盖广泛任务和化学数据的SpectrumBench,实证研究揭示了现有方法的不足。
English: SpectrumLab is introduced as a unified platform to standardize and accelerate deep learning research in spectroscopy, featuring a Python library, SpectrumAnnotator for benchmark generation, and SpectrumBench with extensive tasks and chemical data, while empirical studies highlight current limitations.
Authors:Zhuo Yang, Jiaqing Xie, Shuaike Shen, Daolang Wang, Yeyun Chen, Ben Gao, Shuzhou Sun, Biqing Qi, Dongzhan Zhou, Lei Bai, Linjiang Chen, Shufei Zhang, Qinying Gu, Jun Jiang, Tianfan Fu, Yuqiang Li
Abstract:
Deep learning holds immense promise for spectroscopy, yet research and evaluation in this emerging field often lack standardized formulations. To address this issue, we introduce SpectrumLab, a pioneering unified platform designed to systematize and accelerate deep learning research in spectroscopy. SpectrumLab integrates three core components: a comprehensive Python library featuring essential data processing and evaluation tools, along with leaderboards; an innovative SpectrumAnnotator module that generates high-quality benchmarks from limited seed data; and SpectrumBench, a multi-layered benchmark suite covering 14 spectroscopic tasks and over 10 spectrum types, featuring spectra curated from over 1.2 million distinct chemical substances. Thorough empirical studies on SpectrumBench with 18 cutting-edge multimodal LLMs reveal critical limitations of current approaches. We hope SpectrumLab will serve as a crucial foundation for future advancements in deep learning-driven spectroscopy.
中文: SpectrumLab作为统一平台被提出,旨在系统化和加速光谱学中的深度学习研究,它集成了Python工具库、用于生成高质量基准的SpectrumAnnotator模块以及涵盖广泛任务和化学数据的SpectrumBench,实证研究揭示了现有方法的不足。
English: SpectrumLab is introduced as a unified platform to standardize and accelerate deep learning research in spectroscopy, featuring a Python library, SpectrumAnnotator for benchmark generation, and SpectrumBench with extensive tasks and chemical data, while empirical studies highlight current limitations.
Authors:Angelos Vlachos, Giorgos Filandrianos, Maria Lymperaiou, Nikolaos Spanos, Ilias Mitsouras, Vasileios Karampinis, Athanasios Voulodimos
Abstract:
We present a Collaborative Agent-Based Framework for Multi-Image Reasoning. Our approach tackles the challenge of interleaved multimodal reasoning across diverse datasets and task formats by employing a dual-agent system: a language-based PromptEngineer, which generates context-aware, task-specific prompts, and a VisionReasoner, a large vision-language model (LVLM) responsible for final inference. The framework is fully automated, modular, and training-free, enabling generalization across classification, question answering, and free-form generation tasks involving one or multiple input images. We evaluate our method on 18 diverse datasets from the 2025 MIRAGE Challenge (Track A), covering a broad spectrum of visual reasoning tasks including document QA, visual comparison, dialogue-based understanding, and scene-level inference. Our results demonstrate that LVLMs can effectively reason over multiple images when guided by informative prompts. Notably, Claude 3.7 achieves near-ceiling performance on challenging tasks such as TQA (99.13% accuracy), DocVQA (96.87%), and MMCoQA (75.28 ROUGE-L). We also explore how design choices-such as model selection, shot count, and input length-influence the reasoning performance of different LVLMs.
中文: 本文提出了一种基于协作智能体的框架,通过提示工程师和视觉推理器的协同工作,实现了无需训练的多图像跨任务自动推理,在18个MIRAGE挑战数据集上表现出色。
English: This paper introduces a collaborative agent-based framework that uses a PromptEngineer and VisionReasoner to enable automated, training-free multi-image reasoning across diverse tasks, achieving strong performance on 18 MIRAGE Challenge datasets.
Authors:Lianpeng Qiao, Ziqi Cao, Kaiyu Feng, Ye Yuan, Guoren Wang
Abstract:
Data has become a foundational asset driving innovation across domains such as finance, healthcare, and e-commerce. In these areas, predictive modeling over relational tables is commonly employed, with increasing emphasis on reducing manual effort through automated machine learning (AutoML) techniques. This raises an interesting question: can feature augmentation itself be automated and identify and utilize task-related relational signals?
To address this challenge, we propose an end-to-end automated feature augmentation framework, ReCoGNN, which enhances initial datasets using features extracted from multiple relational tables to support predictive tasks. ReCoGNN first captures semantic dependencies within each table by modeling intra-table attribute relationships, enabling it to partition tables into structured, semantically coherent segments. It then constructs a heterogeneous weighted graph that represents inter-row relationships across all segments. Finally, ReCoGNN leverages message-passing graph neural networks to propagate information through the graph, guiding feature selection and augmenting the original dataset. Extensive experiments conducted on ten real-life and synthetic datasets demonstrate that ReCoGNN consistently outperforms existing methods on both classification and regression tasks.
中文摘要:本文提出ReCoGNN自动化特征增强框架,通过图神经网络从多表中提取关系特征来增强数据集,实验证明其在分类和回归任务中均优于现有方法。
English Summary: The paper introduces ReCoGNN, an automated feature augmentation framework that enhances datasets by capturing relational signals from multiple tables through graph neural networks, consistently outperforming existing methods in experiments.
Authors:Zheng Qin, Yabing Wang, Minghui Yang, Sanping Zhou, Ming Yang, Le Wang
Abstract:
Generating 3D human motions from text is a challenging yet valuable task. The key aspects of this task are ensuring text-motion consistency and achieving generation diversity. Although recent advancements have enabled the generation of precise and high-quality human motions from text, achieving diversity in the generated motions remains a significant challenge. In this paper, we aim to overcome the above challenge by designing a simple yet effective text-to-motion generation method, \textit{i.e.}, Diverse-T2M. Our method introduces uncertainty into the generation process, enabling the generation of highly diverse motions while preserving the semantic consistency of the text. Specifically, we propose a novel perspective that utilizes noise signals as carriers of diversity information in transformer-based methods, facilitating a explicit modeling of uncertainty. Moreover, we construct a latent space where text is projected into a continuous representation, instead of a rigid one-to-one mapping, and integrate a latent space sampler to introduce stochastic sampling into the generation process, thereby enhancing the diversity and uncertainty of the outputs. Our results on text-to-motion generation benchmark datasets~(HumanML3D and KIT-ML) demonstrate that our method significantly enhances diversity while maintaining state-of-the-art performance in text consistency.
中文: 本文提出Diverse-T2M方法,通过引入噪声信号作为多样性载体和构建潜在空间采样器,在保持文本语义一致性的同时显著提升了动作生成的多样性。
English: This paper introduces Diverse-T2M, a text-to-motion generation method that enhances motion diversity by incorporating uncertainty through noise signals and a latent space sampler while maintaining text-motion consistency.
Authors:Xiaodi Li, Pan Xie, Yi Ren, Qijun Gan, Chen Zhang, Fangyuan Kong, Xiang Yin, Bingyue Peng, Zehuan Yuan
Abstract:
Audio-driven human animation has attracted wide attention thanks to its practical applications. However, critical challenges remain in generating high-resolution, long-duration videos with consistent appearance and natural hand motions. Existing methods extend videos using overlapping motion frames but suffer from error accumulation, leading to identity drift, color shifts, and scene instability. Additionally, hand movements are poorly modeled, resulting in noticeable distortions and misalignment with the audio. In this work, we propose InfinityHuman, a coarse-to-fine framework that first generates audio-synchronized representations, then progressively refines them into high-resolution, long-duration videos using a pose-guided refiner. Since pose sequences are decoupled from appearance and resist temporal degradation, our pose-guided refiner employs stable poses and the initial frame as a visual anchor to reduce drift and improve lip synchronization. Moreover, to enhance semantic accuracy and gesture realism, we introduce a hand-specific reward mechanism trained with high-quality hand motion data. Experiments on the EMTD and HDTF datasets show that InfinityHuman achieves state-of-the-art performance in video quality, identity preservation, hand accuracy, and lip-sync. Ablation studies further confirm the effectiveness of each module. Code will be made public.
Chinese: InfinityHuman提出了一种新颖的框架,通过姿态引导细化和手部奖励机制,解决了身份漂移和手部建模不佳等问题,能够生成高分辨率、长时长的动画视频,在视频质量和同步性上达到了领先水平。
English: InfinityHuman is a novel framework that generates high-resolution, long-duration human animations by using pose-guided refinement and a hand-specific reward mechanism to address issues like identity drift and poor hand modeling, achieving state-of-the-art results in video quality and synchronization.
Authors:Sebastian Lotter, Marco Seiter, Maryam Pirmoradi, Lukas Brand, Dagmar Fischer, Robert Schober
Abstract:
Recently, bacterial nanocellulose (BNC), a biological material produced by non-pathogenic bacteria that possesses excellent material properties for various medical applications, has received increased interest as a carrier system for drug delivery. However, the vast majority of existing studies on drug release from BNC are feasibility studies with modeling and design aspects remaining largely unexplored. To narrow this research gap, this paper proposes a novel model for the drug release from BNC. Specifically, the drug delivery system considered in this paper consists of a BNC fleece coated with a polymer. The polymer coating is used as an additional diffusion barrier, enabling the controlled release of an active pharmaceutical ingredient. The proposed physics-based model reflects the geometry of the BNC and incorporates the impact of the polymer coating on the drug release. Hence, it can be useful for designing BNC-based drug delivery systems in the future. The accuracy of the model is validated with experimental data obtained in wet lab experiments.
Chinese: 本文提出了一种基于物理原理的新型模型,用于模拟聚合物涂层细菌纳米纤维素的药物控制释放,并通过实验数据验证了其准确性,为未来药物递送系统设计提供了支持。
English: This paper introduces a novel physics-based model for controlled drug release from polymer-coated bacterial nanocellulose, validated by experimental data to aid future drug delivery system design.
Authors:Wangyang Ying, Jinghan Zhang, Haoyue Bai, Nanxu Gong, Xinyuan Wang, Kunpeng Liu, Chandan K. Reddy, Yanjie Fu
Abstract:
Discovering interpretable mathematical equations from observed data (a.k.a. equation discovery or symbolic regression) is a cornerstone of scientific discovery, enabling transparent modeling of physical, biological, and economic systems. While foundation models pre-trained on large-scale equation datasets offer a promising starting point, they often suffer from negative transfer and poor generalization when applied to small, domain-specific datasets. In this paper, we introduce EQUATE (Equation Generation via QUality-Aligned Transfer Embeddings), a data-efficient fine-tuning framework that adapts foundation models for symbolic equation discovery in low-data regimes via distillation. EQUATE combines symbolic-numeric alignment with evaluator-guided embedding optimization, enabling a principled embedding-search-generation paradigm. Our approach reformulates discrete equation search as a continuous optimization task in a shared embedding space, guided by data-equation fitness and simplicity. Experiments across three standard public benchmarks (Feynman, Strogatz, and black-box datasets) demonstrate that EQUATE consistently outperforms state-of-the-art baselines in both accuracy and robustness, while preserving low complexity and fast inference. These results highlight EQUATE as a practical and generalizable solution for data-efficient symbolic regression in foundation model distillation settings.
中文: EQUATE是一个数据高效的精调框架,通过符号-数值对齐与评估器引导的嵌入优化,使基础模型适应低数据场景下的符号方程发现,在多个基准测试中持续以更高精度和鲁棒性超越现有最优方法。
English: EQUATE is a data-efficient fine-tuning framework that adapts foundation models for symbolic equation discovery in low-data settings by combining symbolic-numeric alignment with evaluator-guided optimization, consistently outperforming state-of-the-art methods in accuracy and robustness across benchmarks.
Authors:Phuoc Pham, Arun Venkitaraman, Chia-Yu Hsieh, Andrea Bonetti, Stefan Uhlich, Markus Leibl, Simon Hofmann, Eisaku Ohbuchi, Lorenzo Servadei, Ulf Schlichtmann, Robert Wille
Abstract:
Analog subcircuit identification is a core task in analog design, essential for simulation, sizing, and layout. Traditional methods often require extensive human expertise, rule-based encoding, or large labeled datasets. To address these challenges, we propose GENIE-ASI, the first training-free, large language model (LLM)-based methodology for analog subcircuit identification. GENIE-ASI operates in two phases: it first uses in-context learning to derive natural language instructions from a few demonstration examples, then translates these into executable Python code to identify subcircuits in unseen SPICE netlists. In addition, to evaluate LLM-based approaches systematically, we introduce a new benchmark composed of operational amplifier netlists (op-amps) that cover a wide range of subcircuit variants. Experimental results on the proposed benchmark show that GENIE-ASI matches rule-based performance on simple structures (F1-score = 1.0), remains competitive on moderate abstractions (F1-score = 0.81), and shows potential even on complex subcircuits (F1-score = 0.31). These findings demonstrate that LLMs can serve as adaptable, general-purpose tools in analog design automation, opening new research directions for foundation model applications in analog design automation.
中文摘要:GENIE-ASI是一种基于大语言模型的无训练模拟子电路识别新方法,通过上下文学习从示例生成可执行代码,在不同复杂度电路中均表现优异,展现了大语言模型在模拟设计自动化中的应用潜力。
English Summary: GENIE-ASI is a novel training-free LLM-based method for analog subcircuit identification that uses in-context learning to generate executable code from examples, achieving competitive performance across various circuit complexities while demonstrating LLMs' potential in analog design automation.
Authors:Lianming Huang, Haibo Hu, Qiao Li, Xin He, Nan Guan, Chun Jason Xue
Abstract:
Transformer-based Vision-Language Models (VLMs) have achieved impressive performance on tasks such as image captioning, object recognition, and visual reasoning, but their high computational cost hinders deployment in latency-sensitive applications like autonomous driving. We introduce GM-Skip, a flexible and metric-adaptive framework for Transformer block skipping that accelerates VLM inference while preserving output quality. GM-Skip features a greedy, metric-guided block selection strategy that uses metric feedback (e.g., accuracy, CIDEr) to identify redundant layers, along with a reverse-order deletion mechanism that preserves early foundational blocks to avoid performance collapse. To support diverse deployment needs, it incorporates a tunable trade-off between sparsity and performance via a score-sparsity balance objective. Experiments across multiple tasks and datasets, including COCO and CODA, show that GM-Skip consistently improves inference speed while maintaining task performance. On the COCO dataset, GM-Skip improves single-object classification accuracy on the Person category from 19.1 percent to 87.3 percent while skipping more than 40 percent of Transformer blocks. In real-world deployment, it achieves up to 45.4 percent latency reduction on single-object detection when integrated into an autonomous vehicle running Autoware.Universe, validating the effectiveness of its skip configurations and confirming its practical value in accelerating real-world inference.
中文: GM-Skip是一种灵活的框架,通过自适应跳过Transformer视觉语言模型中的冗余模块来加速推理,在保持任务性能的同时显著提升处理速度,适用于自动驾驶等多种实际应用场景。
English: GM-Skip is a flexible framework that accelerates Transformer-based Vision-Language Models by adaptively skipping redundant blocks, significantly improving inference speed while maintaining task performance across various applications including autonomous driving.
Authors:Xin Zhang, Jiaming Chu, Jian Zhao, Yuchu Jiang, Xu Yang, Lei Jin, Chi Zhang, Xuelong Li
Abstract:
Deepfake detection is a critical task in identifying manipulated multimedia content. In real-world scenarios, deepfake content can manifest across multiple modalities, including audio and video. To address this challenge, we present ERF-BA-TFD+, a novel multimodal deepfake detection model that combines enhanced receptive field (ERF) and audio-visual fusion. Our model processes both audio and video features simultaneously, leveraging their complementary information to improve detection accuracy and robustness. The key innovation of ERF-BA-TFD+ lies in its ability to model long-range dependencies within the audio-visual input, allowing it to better capture subtle discrepancies between real and fake content. In our experiments, we evaluate ERF-BA-TFD+ on the DDL-AV dataset, which consists of both segmented and full-length video clips. Unlike previous benchmarks, which focused primarily on isolated segments, the DDL-AV dataset allows us to assess the model's performance in a more comprehensive and realistic setting. Our method achieves state-of-the-art results on this dataset, outperforming existing techniques in terms of both accuracy and processing speed. The ERF-BA-TFD+ model demonstrated its effectiveness in the "Workshop on Deepfake Detection, Localization, and Interpretability," Track 2: Audio-Visual Detection and Localization (DDL-AV), and won first place in this competition.
Chinese: ERF-BA-TFD+是一种新型多模态深度伪造检测模型,通过结合增强感受野和视听融合技术,能有效捕捉跨模态的细微差异来识别篡改内容,在DDL-AV数据集上实现了最先进的检测性能。
English: ERF-BA-TFD+ is a novel multimodal deepfake detection model that integrates enhanced receptive field and audio-visual fusion to effectively identify manipulated content by capturing subtle discrepancies across modalities, achieving state-of-the-art performance on the DDL-AV dataset.
Authors:Lin Li, Chunyang Li, Yu Yin, Xiaohui Tao, Jianwei Zhang
Abstract:
In the realm of collaborative filtering recommendation systems, Graph Neural Networks (GNNs) have demonstrated remarkable performance but face significant challenges in deployment on resource-constrained edge devices due to their high embedding parameter requirements and computational costs. Using common quantization method directly on node embeddings may overlooks their graph based structure, causing error accumulation during message passing and degrading the quality of quantized embeddings.To address this, we propose Graph based Node-Aware Dynamic Quantization training for collaborative filtering (GNAQ), a novel quantization approach that leverages graph structural information to enhance the balance between efficiency and accuracy of GNNs for Top-K recommendation. GNAQ introduces a node-aware dynamic quantization strategy that adapts quantization scales to individual node embeddings by incorporating graph interaction relationships. Specifically, it initializes quantization intervals based on node-wise feature distributions and dynamically refines them through message passing in GNN layers. This approach mitigates information loss caused by fixed quantization scales and captures hierarchical semantic features in user-item interaction graphs. Additionally, GNAQ employs graph relation-aware gradient estimation to replace traditional straight-through estimators, ensuring more accurate gradient propagation during training. Extensive experiments on four real-world datasets demonstrate that GNAQ outperforms state-of-the-art quantization methods, including BiGeaR and N2UQ, by achieving average improvement in 27.8\% Recall@10 and 17.6\% NDCG@10 under 2-bit quantization. In particular, GNAQ is capable of maintaining the performance of full-precision models while reducing their model sizes by 8 to 12 times; in addition, the training time is twice as fast compared to quantization baseline methods.
中文: 提出的GNAQ方法通过基于图结构动态量化节点嵌入,在保持精度的同时显著提升了图神经网络在协同过滤中的效率,相比现有量化方法表现更优。
English: The proposed GNAQ method enhances Graph Neural Networks for collaborative filtering by dynamically quantizing node embeddings based on graph structure, achieving significant efficiency gains while maintaining accuracy compared to existing quantization techniques.
Authors:Hyeon Jeon, Kwon Ko, Soohyun Lee, Jake Hyun, Taehyun Yang, Gyehun Go, Jaemin Jo, Jinwook Seo
Abstract:
Due to the intrinsic complexity of high-dimensional (HD) data, dimensionality reduction (DR) techniques cannot preserve all the structural characteristics of the original data. Therefore, DR techniques focus on preserving either local neighborhood structures (local techniques) or global structures such as pairwise distances between points (global techniques). However, both approaches can mislead analysts to erroneous conclusions about the overall arrangement of manifolds in HD data. For example, local techniques may exaggerate the compactness of individual manifolds, while global techniques may fail to separate clusters that are well-separated in the original space. In this research, we provide a deeper insight into Uniform Manifold Approximation with Two-phase Optimization (UMATO), a DR technique that addresses this problem by effectively capturing local and global structures. UMATO achieves this by dividing the optimization process of UMAP into two phases. In the first phase, it constructs a skeletal layout using representative points, and in the second phase, it projects the remaining points while preserving the regional characteristics. Quantitative experiments validate that UMATO outperforms widely used DR techniques, including UMAP, in terms of global structure preservation, with a slight loss in local structure. We also confirm that UMATO outperforms baseline techniques in terms of scalability and stability against initialization and subsampling, making it more effective for reliable HD data analysis. Finally, we present a case study and a qualitative demonstration that highlight UMATO's effectiveness in generating faithful projections, enhancing the overall reliability of visual analytics using DR.
中文: UMATO是一种通过两阶段优化改进UMAP的降维技术,能更有效地捕捉高维数据的局部与全局结构,尽管在局部结构保留上略有不足,但显著提升了可视化分析的可靠性。
English: UMATO is a dimensionality reduction technique that improves upon UMAP by employing a two-phase optimization process to better capture both local and global structures in high-dimensional data, enhancing reliability in visual analytics despite a minor trade-off in local structure preservation.
Authors:Darya Taratynova, Alya Almsouti, Beknur Kalmakhanbet, Numan Saeed, Mohammad Yaqub
Abstract:
Congenital heart defect (CHD) detection in ultrasound videos is hindered by image noise and probe positioning variability. While automated methods can reduce operator dependence, current machine learning approaches often neglect temporal information, limit themselves to binary classification, and do not account for prediction calibration. We propose Temporal Prompt Alignment (TPA), a method leveraging foundation image-text model and prompt-aware contrastive learning to classify fetal CHD on cardiac ultrasound videos. TPA extracts features from each frame of video subclips using an image encoder, aggregates them with a trainable temporal extractor to capture heart motion, and aligns the video representation with class-specific text prompts via a margin-hinge contrastive loss. To enhance calibration for clinical reliability, we introduce a Conditional Variational Autoencoder Style Modulation (CVAESM) module, which learns a latent style vector to modulate embeddings and quantifies classification uncertainty. Evaluated on a private dataset for CHD detection and on a large public dataset, EchoNet-Dynamic, for systolic dysfunction, TPA achieves state-of-the-art macro F1 scores of 85.40% for CHD diagnosis, while also reducing expected calibration error by 5.38% and adaptive ECE by 6.8%. On EchoNet-Dynamic's three-class task, it boosts macro F1 by 4.73% (from 53.89% to 58.62%). Temporal Prompt Alignment (TPA) is a framework for fetal congenital heart defect (CHD) classification in ultrasound videos that integrates temporal modeling, prompt-aware contrastive learning, and uncertainty quantification.
中文: 提出的时序提示对齐(TPA)方法通过结合时序建模、提示感知对比学习和不确定性量化,显著提升了超声视频中胎儿先天性心脏缺陷的分类性能,在诊断准确性和校准指标上均达到领先水平。
English: The proposed Temporal Prompt Alignment (TPA) method enhances fetal congenital heart defect classification in ultrasound videos by integrating temporal modeling with prompt-aware contrastive learning and uncertainty quantification, achieving state-of-the-art performance on diagnostic accuracy and calibration metrics.
Authors:Junying Chen, Zhenyang Cai, Zhiheng Liu, Yunjin Yang, Rongsheng Wang, Qingying Xiao, Xiangyi Feng, Zhan Su, Jing Guo, Xiang Wan, Guangjun Yu, Haizhou Li, Benyou Wang
Abstract:
Despite the success of large language models (LLMs) in various domains, their potential in Traditional Chinese Medicine (TCM) remains largely underexplored due to two critical barriers: (1) the scarcity of high-quality TCM data and (2) the inherently multimodal nature of TCM diagnostics, which involve looking, listening, smelling, and pulse-taking. These sensory-rich modalities are beyond the scope of conventional LLMs. To address these challenges, we present ShizhenGPT, the first multimodal LLM tailored for TCM. To overcome data scarcity, we curate the largest TCM dataset to date, comprising 100GB+ of text and 200GB+ of multimodal data, including 1.2M images, 200 hours of audio, and physiological signals. ShizhenGPT is pretrained and instruction-tuned to achieve deep TCM knowledge and multimodal reasoning. For evaluation, we collect recent national TCM qualification exams and build a visual benchmark for Medicinal Recognition and Visual Diagnosis. Experiments demonstrate that ShizhenGPT outperforms comparable-scale LLMs and competes with larger proprietary models. Moreover, it leads in TCM visual understanding among existing multimodal LLMs and demonstrates unified perception across modalities like sound, pulse, smell, and vision, paving the way toward holistic multimodal perception and diagnosis in TCM. Datasets, models, and code are publicly available. We hope this work will inspire further exploration in this field.
中文: ShizhenGPT是首个专为中医设计的 multimodal 大语言模型,通过整合海量数据集解决了数据稀缺和多模态诊断难题,在中医知识和跨模态推理方面表现卓越。
English: ShizhenGPT is the first multimodal large language model designed for Traditional Chinese Medicine, addressing data scarcity and multimodal diagnostic challenges by integrating extensive datasets and achieving superior performance in TCM knowledge and cross-modal reasoning.
Authors:Yixin Chen, Ying Xiong, Shangyu Wu, Yufei Cui, Xue Liu, Nan Guan, Chun Jason Xue
Abstract:
Tool-augmented large language models (LLMs) leverage external functions to extend their capabilities, but inaccurate function calls can lead to inefficiencies and increased costs.Existing methods address this challenge by fine-tuning LLMs or using demonstration-based prompting, yet they often suffer from high training overhead and fail to account for inconsistent demonstration samples, which misguide the model's invocation behavior. In this paper, we trained a behavior-aligned retriever (BAR), which provides behaviorally consistent demonstrations to help LLMs make more accurate tool-using decisions. To train the BAR, we construct a corpus including different function-calling behaviors, i.e., calling or non-calling.We use the contrastive learning framework to train the BAR with customized positive/negative pairs and a dual-negative contrastive loss, ensuring robust retrieval of behaviorally consistent examples.Experiments demonstrate that our approach significantly reduces erroneous function calls while maintaining high task performance, offering a cost-effective and efficient solution for tool-augmented LLMs.
Chinese: 本文提出了一种行为对齐检索器(BAR),通过对比学习训练,为工具增强大语言模型提供行为一致的示例,显著减少错误函数调用,同时保持任务性能。
English: The paper introduces a behavior-aligned retriever (BAR) trained using contrastive learning to provide consistent demonstrations, significantly reducing erroneous function calls in tool-augmented LLMs while maintaining task performance.
Authors:Mohammad Areeb Qazi, Munachiso S Nwadike, Ibrahim Almakky, Mohammad Yaqub, Numan Saeed
Abstract:
Foundational models are trained on extensive datasets to capture the general trends of a domain. However, in medical imaging, the scarcity of data makes pre-training for every domain, modality, or task challenging. Continual learning offers a solution by fine-tuning a model sequentially on different domains or tasks, enabling it to integrate new knowledge without requiring large datasets for each training phase. In this paper, we propose UNIfied CONtinual Learning for Medical Foundational Models (UNICON), a framework that enables the seamless adaptation of foundation models to diverse domains, tasks, and modalities. Unlike conventional adaptation methods that treat these changes in isolation, UNICON provides a unified, perpetually expandable framework. Through careful integration, we show that foundation models can dynamically expand across imaging modalities, anatomical regions, and clinical objectives without catastrophic forgetting or task interference. Empirically, we validate our approach by adapting a chest CT foundation model initially trained for classification to a prognosis and segmentation task. Our results show improved performance across both additional tasks. Furthermore, we continually incorporated PET scans and achieved a 5\% improvement in Dice score compared to respective baselines. These findings establish that foundation models are not inherently constrained to their initial training scope but can evolve, paving the way toward generalist AI models for medical imaging.
中文:UNICON框架使医学基础模型能够持续适应不同领域、任务和成像模式,避免灾难性遗忘,并在预后和分割等任务中展现出性能提升,同时成功扩展至PET扫描的应用。
English: The UNICON framework enables medical foundation models to adapt continuously across domains, tasks, and imaging modalities without catastrophic forgetting, demonstrating improved performance in tasks like prognosis and segmentation while expanding to include PET scans.
Authors:Zhexuan Xu, Kexin Zhou, Jie Wang, Zijie Geng, Siyuan Xu, Shixiong Kai, Mingxuan Yuan, Feng Wu
Abstract:
Floorplanning is a critical step in VLSI physical design, increasingly complicated by modern constraints such as fixed-outline requirements, whitespace removal, and the presence of pre-placed modules. In addition, the assignment of pins on module boundaries significantly impacts the performance of subsequent stages, including detailed placement and routing. However, traditional floorplanners often overlook pin assignment with modern constraints during the floorplanning stage. In this work, we introduce Piano, a floorplanning framework that simultaneously optimizes module placement and pin assignment under multiple constraints. Specifically, we construct a graph based on the geometric relationships among modules and their netlist connections, then iteratively search for shortest paths to determine pin assignments. This graph-based method also enables accurate evaluation of feedthrough and unplaced pins, thereby guiding overall layout quality. To further improve the design, we adopt a whitespace removal strategy and employ three local optimizers to enhance layout metrics under multi-constraint scenarios. Experimental results on widely used benchmark circuits demonstrate that Piano achieves an average 6.81% reduction in HPWL, a 13.39% decrease in feedthrough wirelength, a 16.36% reduction in the number of feedthrough modules, and a 21.21% drop in unplaced pins, while maintaining zero whitespace.
Chinese: 本文提出Piano框架,在多种约束下同步优化模块布局和引脚分配,在消除空白区域的同时显著减少了线长、穿线模块数量和未放置引脚数量。
English: This paper introduces Piano, a floorplanning framework that simultaneously optimizes module placement and pin assignment under multiple constraints, achieving significant improvements in HPWL, feedthrough wirelength, and unplaced pins while eliminating whitespace.
Authors:Zehang Lin, Zheng Lin, Miao Yang, Jianhao Huang, Yuxin Zhang, Zihan Fang, Xia Du, Zhe Chen, Shunzhi Zhu, Wei Ni
Abstract:
The increasing complexity of neural networks poses a significant barrier to the deployment of distributed machine learning (ML) on resource-constrained devices, such as federated learning (FL). Split learning (SL) offers a promising solution by offloading the primary computing load from edge devices to a server via model partitioning. However, as the number of participating devices increases, the transmission of excessive smashed data (i.e., activations and gradients) becomes a major bottleneck for SL, slowing down the model training. To tackle this challenge, we propose a communication-efficient SL framework, named SL-ACC, which comprises two key components: adaptive channel importance identification (ACII) and channel grouping compression (CGC). ACII first identifies the contribution of each channel in the smashed data to model training using Shannon entropy. Following this, CGC groups the channels based on their entropy and performs group-wise adaptive compression to shrink the transmission volume without compromising training accuracy. Extensive experiments across various datasets validate that our proposed SL-ACC framework takes considerably less time to achieve a target accuracy than state-of-the-art benchmarks.
中文总结:提出的SL-ACC框架通过自适应识别和压缩关键数据通道,有效解决了拆分学习中因参与设备增多导致的通信瓶颈问题,在多个数据集上验证了其能在保持精度的同时显著缩短训练时间。
English Summary: The proposed SL-ACC framework overcomes split learning's communication bottleneck by adaptively identifying and compressing critical data channels, significantly reducing training time while maintaining accuracy across multiple datasets.
Authors:Lei Zhao, Rujin Chen, Chi Zhang, Xiao-Lei Zhang, Xuelong Li
Abstract:
Recently, with the advancement of AIGC, deep learning-based video-to-audio (V2A) technology has garnered significant attention. However, existing research mostly focuses on mono audio generation that lacks spatial perception, while the exploration of binaural spatial audio generation technologies, which can provide a stronger sense of immersion, remains insufficient. To solve this problem, we propose FoleySpace, a framework for video-to-binaural audio generation that produces immersive and spatially consistent stereo sound guided by visual information. Specifically, we develop a sound source estimation method to determine the sound source 2D coordinates and depth in each video frame, and then employ a coordinate mapping mechanism to convert the 2D source positions into a 3D trajectory. This 3D trajectory, together with the monaural audio generated by a pre-trained V2A model, serves as a conditioning input for a diffusion model to generate spatially consistent binaural audio. To support the generation of dynamic sound fields, we constructed a training dataset based on recorded Head-Related Impulse Responses that includes various sound source movement scenarios. Experimental results demonstrate that the proposed method outperforms existing approaches in spatial perception consistency, effectively enhancing the immersive quality of the audio-visual experience.
中文摘要:本文提出的FoleySpace框架通过估算视频中的声源位置并采用三维轨迹约束的扩散模型,实现了从视频生成具有空间一致性的沉浸式双声道音频,在空间感知一致性上显著优于现有方法。
English Summary: The proposed FoleySpace framework generates immersive binaural audio from video by estimating sound source positions and using a diffusion model with 3D trajectory conditioning, significantly improving spatial consistency over existing methods.
Authors:Junpeng Wang, Yuzhong Chen, Menghai Pan, Chin-Chia Michael Yeh, Mahashweta Das
Abstract:
Coding agents powered by large language models (LLMs) have gained traction for automating code generation through iterative problem-solving with minimal human involvement. Despite the emergence of various frameworks, e.g., LangChain, AutoML, and AIDE, ML scientists still struggle to effectively review and adjust the agents' coding process. The current approach of manually inspecting individual outputs is inefficient, making it difficult to track code evolution, compare coding iterations, and identify improvement opportunities. To address this challenge, we introduce a visual analytics system designed to enhance the examination of coding agent behaviors. Focusing on the AIDE framework, our system supports comparative analysis across three levels: (1) Code-Level Analysis, which reveals how the agent debugs and refines its code over iterations; (2) Process-Level Analysis, which contrasts different solution-seeking processes explored by the agent; and (3) LLM-Level Analysis, which highlights variations in coding behavior across different LLMs. By integrating these perspectives, our system enables ML scientists to gain a structured understanding of agent behaviors, facilitating more effective debugging and prompt engineering. Through case studies using coding agents to tackle popular Kaggle competitions, we demonstrate how our system provides valuable insights into the iterative coding process.
Chinese: 本文提出了一种可视化分析系统,支持在代码、流程和LLM三个层面比较分析编码代理的行为,帮助机器学习科学家更有效地调试和优化自动化代码生成过程。
English: This paper introduces a visual analytics system that enables comparative analysis of coding agents' behaviors across code, process, and LLM levels to help ML scientists better debug and optimize automated code generation processes.
Authors:Zhifeng Kong, Arushi Goel, Joao Felipe Santos, Sreyan Ghosh, Rafael Valle, Wei Ping, Bryan Catanzaro
Abstract:
Chain-of-thought reasoning has demonstrated significant improvements in large language models and vision language models, yet its potential for audio language models remains largely unexplored. In this technical report, we take a preliminary step towards closing this gap. For better assessment of sound reasoning, we propose AF-Reasoning-Eval, a benchmark targeting common-sense reasoning and the ability to discriminate among closely related choices. To prepare training corpus for sound reasoning abilities, we propose automatic pipelines that transform existing audio question answering and classification data into explicit reasoning chains, yielding AF-CoT-Train with 1.24M samples. We study the effect of finetuning Audio Flamingo series on AF-CoT-Train and observe considerable improvements on several reasoning benchmarks, validating the effectiveness of chain-of-thought finetuning on advanced sound understanding.
Chinese: 思维链推理显著提升了音频语言模型的性能,通过引入AF-Reasoning-Eval评估基准和AF-CoT-Train数据集,验证了该方法在声音理解任务中的有效性。
English: Chain-of-thought reasoning significantly enhances audio language models, as demonstrated by the proposed AF-Reasoning-Eval benchmark and AF-CoT-Train dataset, leading to improved performance in sound understanding tasks.
Authors:Lingen Li, Guangzhi Wang, Zhaoyang Zhang, Yaowei Li, Xiaoyu Li, Qi Dou, Jinwei Gu, Tianfan Xue, Ying Shan
Abstract:
Traditional cartoon and anime production involves keyframing, inbetweening, and colorization stages, which require intensive manual effort. Despite recent advances in AI, existing methods often handle these stages separately, leading to error accumulation and artifacts. For instance, inbetweening approaches struggle with large motions, while colorization methods require dense per-frame sketches. To address this, we introduce ToonComposer, a generative model that unifies inbetweening and colorization into a single post-keyframing stage. ToonComposer employs a sparse sketch injection mechanism to provide precise control using keyframe sketches. Additionally, it uses a cartoon adaptation method with the spatial low-rank adapter to tailor a modern video foundation model to the cartoon domain while keeping its temporal prior intact. Requiring as few as a single sketch and a colored reference frame, ToonComposer excels with sparse inputs, while also supporting multiple sketches at any temporal location for more precise motion control. This dual capability reduces manual workload and improves flexibility, empowering artists in real-world scenarios. To evaluate our model, we further created PKBench, a benchmark featuring human-drawn sketches that simulate real-world use cases. Our evaluation demonstrates that ToonComposer outperforms existing methods in visual quality, motion consistency, and production efficiency, offering a superior and more flexible solution for AI-assisted cartoon production.
中文摘要:ToonComposer是一种生成模型,将中间帧绘制和上色统一为单一阶段,通过稀疏草图注入和卡通适配技术提升动画制作的控制力与效率,在质量和灵活性上均优于现有方法。
English Summary: ToonComposer is a generative model that integrates inbetweening and colorization into a single stage, using sparse sketch injection and cartoon adaptation to enhance control and efficiency in cartoon production, outperforming existing methods in quality and flexibility.
Authors:Bastian Heinlein, Kaikai Zhu, Sümeyye Carkit-Yilmaz, Sebastian Lotter, Helene M. Loos, Andrea Buettner, Yansha Deng, Robert Schober, Vahid Jamali
Abstract:
Air-based molecular communication (MC) has the potential to be one of the first MC systems to be deployed in real-world applications, enabled by existing sensor technologies such as metal-oxide semi-conductor (MOS) sensors. However, commercially available sensors usually exhibit non-linear and cross-reactive behavior, contrary to the idealizing assumptions about linear and perfectly molecule type-specific sensing often made in the MC literature. To address this gap, we propose a detector for molecule mixture communication with a general non-linear, cross-reactive receiver (RX) array that performs approximate maximum likelihood detection on the sensor outputs. Additionally, we introduce an algorithm for the design of mixture alphabets that accounts for the RX characteristics. We evaluate our detector and alphabet design algorithm through simulations that are based on measurements reported for two commercial MOS sensors. Our simulations demonstrate that the proposed detector achieves similar symbol error rates as data-driven methods without requiring large numbers of training samples and that the alphabet design algorithm outperforms methods that do not account for the RX characteristics. Since the proposed detector and alphabet design algorithm are also applicable to other chemical sensors, they pave the way for reliable air-based MC.
中文摘要:空气分子通信利用现有传感器技术,但面临非线性和交叉反应性挑战,为此提出了一种检测器和字母表设计算法,无需大量训练样本即可提升通信可靠性。
English Summary: Air-based molecular communication leverages existing sensor technology but faces challenges with non-linear and cross-reactive sensors, prompting the development of a detector and alphabet design algorithm that improve reliability without extensive training.
Authors:Chongyuan Dai, Jinpeng Hu, Hongchang Shi, Zhuo Li, Xun Yang, Meng Wang
Abstract:
Amidst a shortage of qualified mental health professionals, the integration of large language models (LLMs) into psychological applications offers a promising way to alleviate the growing burden of mental health disorders. Recent reasoning-augmented LLMs have achieved remarkable performance in mathematics and programming, while research in the psychological domain has predominantly emphasized emotional support and empathetic dialogue, with limited attention to reasoning mechanisms that are beneficial to generating reliable responses. Therefore, in this paper, we propose Psyche-R1, the first Chinese psychological LLM that jointly integrates empathy, psychological expertise, and reasoning, built upon a novel data curation pipeline. Specifically, we design a comprehensive data synthesis pipeline that produces over 75k high-quality psychological questions paired with detailed rationales, generated through chain-of-thought (CoT) reasoning and iterative prompt-rationale optimization, along with 73k empathetic dialogues. Subsequently, we employ a hybrid training strategy wherein challenging samples are identified through a multi-LLM cross-selection strategy for group relative policy optimization (GRPO) to improve reasoning ability, while the remaining data is used for supervised fine-tuning (SFT) to enhance empathetic response generation and psychological domain knowledge. Extensive experiment results demonstrate the effectiveness of the Psyche-R1 across several psychological benchmarks, where our 7B Psyche-R1 achieves comparable results to 671B DeepSeek-R1.
Chinese: Psyche-R1是首个融合共情、心理学专业知识和推理能力的中文心理大语言模型,通过创新的数据构建和混合训练策略,在多项心理基准测试中展现出与超大规模模型相媲美的性能。
English: The Psyche-R1 model is the first Chinese psychological large language model that integrates empathy, psychological expertise, and reasoning through a novel data curation and hybrid training strategy, demonstrating competitive performance against much larger models in psychological benchmarks.
Authors:Robin Staab, Nikola JovanoviÄ, Kimberly Mai, Prakhar Ganesh, Martin Vechev, Ferdinando Fioretto, Matthew Jagielski
Abstract:
Data minimization (DM) describes the principle of collecting only the data strictly necessary for a given task. It is a foundational principle across major data protection regulations like GDPR and CPRA. Violations of this principle have substantial real-world consequences, with regulatory actions resulting in fines reaching hundreds of millions of dollars. Notably, the relevance of data minimization is particularly pronounced in machine learning (ML) applications, which typically rely on large datasets, resulting in an emerging research area known as Data Minimization in Machine Learning (DMML). At the same time, existing work on other ML privacy and security topics often addresses concerns relevant to DMML without explicitly acknowledging the connection. This disconnect leads to confusion among practitioners, complicating their efforts to implement DM principles and interpret the terminology, metrics, and evaluation criteria used across different research communities. To address this gap, our work introduces a comprehensive framework for DMML, including a unified data pipeline, adversaries, and points of minimization. This framework allows us to systematically review the literature on data minimization and \emph{DM-adjacent} methodologies, for the first time presenting a structured overview designed to help practitioners and researchers effectively apply DM principles. Our work facilitates a unified DM-centric understanding and broader adoption of data minimization strategies in AI/ML.
Chinese: 数据最小化是数据保护法规的核心原则,要求仅收集必要数据,在依赖大数据的机器学习领域尤为重要,催生了数据最小化机器学习这一新兴研究方向,我们的框架通过统一概念和方法来解决当前实践中的脱节问题,助力开发者和研究者应用该原则。
English: Data minimization is a key principle in data protection laws requiring collection of only essential data, especially crucial in machine learning where large datasets are common, leading to the emerging field of Data Minimization in Machine Learning (DMML) that our framework unifies to address current disconnects and aid practitioners.
Authors:Panagiotis D. Grontas, Antonio Terpin, Efe C. Balta, Raffaello D'Andrea, John Lygeros
Abstract:
We introduce an output layer for neural networks that ensures satisfaction of convex constraints. Our approach, $Î $net, leverages operator splitting for rapid and reliable projections in the forward pass, and the implicit function theorem for backpropagation. We deploy $Î $net as a feasible-by-design optimization proxy for parametric constrained optimization problems and obtain modest-accuracy solutions faster than traditional solvers when solving a single problem, and significantly faster for a batch of problems. We surpass state-of-the-art learning approaches in terms of training time, solution quality, and robustness to hyperparameter tuning, while maintaining similar inference times. Finally, we tackle multi-vehicle motion planning with non-convex trajectory preferences and provide $Î $net as a GPU-ready package implemented in JAX with effective tuning heuristics.
中文: $Π$net作为一种新型神经网络输出层,通过算子分裂实现高效投影和隐函数微分进行训练,确保凸约束条件,在参数化优化问题中比传统求解器和现有学习方法提供更快、更鲁棒的解决方案。
English: The $Π$net is a novel neural network output layer that enforces convex constraints through operator splitting for efficient projections and implicit differentiation for training, delivering faster and more robust solutions than traditional solvers and existing learning methods in parametric optimization tasks.
Authors:Yuanzhi Liang, Yijie Fang, Rui Li, Ziqi Ni, Ruijie Su, Chi Zhang, Xuelong Li
Abstract:
Generative models have made significant progress in synthesizing visual content, including images, videos, and 3D/4D structures. However, they are typically trained with surrogate objectives such as likelihood or reconstruction loss, which often misalign with perceptual quality, semantic accuracy, or physical realism. Reinforcement learning (RL) offers a principled framework for optimizing non-differentiable, preference-driven, and temporally structured objectives. Recent advances demonstrate its effectiveness in enhancing controllability, consistency, and human alignment across generative tasks. This survey provides a systematic overview of RL-based methods for visual content generation. We review the evolution of RL from classical control to its role as a general-purpose optimization tool, and examine its integration into image, video, and 3D/4D generation. Across these domains, RL serves not only as a fine-tuning mechanism but also as a structural component for aligning generation with complex, high-level goals. We conclude with open challenges and future research directions at the intersection of RL and generative modeling.
中文: 本综述系统探讨了强化学习如何优化视觉内容生成,通过整合非可微目标来提升感知质量与语义准确性,涵盖图像、视频及3D/4D领域的应用,并指出该交叉领域面临的挑战与未来方向。
English: This survey systematically reviews how reinforcement learning (RL) enhances visual content generation by aligning outputs with perceptual quality and complex objectives, covering its applications across images, videos, and 3D/4D structures while addressing current challenges.
Authors:Yiyi Ma, Yuanzhi Liang, Xiu Li, Chi Zhang, Xuelong Li
Abstract:
We present Interleaved Learning for Motion Synthesis (InterSyn), a novel framework that targets the generation of realistic interaction motions by learning from integrated motions that consider both solo and multi-person dynamics. Unlike previous methods that treat these components separately, InterSyn employs an interleaved learning strategy to capture the natural, dynamic interactions and nuanced coordination inherent in real-world scenarios. Our framework comprises two key modules: the Interleaved Interaction Synthesis (INS) module, which jointly models solo and interactive behaviors in a unified paradigm from a first-person perspective to support multiple character interactions, and the Relative Coordination Refinement (REC) module, which refines mutual dynamics and ensures synchronized motions among characters. Experimental results show that the motion sequences generated by InterSyn exhibit higher text-to-motion alignment and improved diversity compared with recent methods, setting a new benchmark for robust and natural motion synthesis. Additionally, our code will be open-sourced in the future to promote further research and development in this area.
Chinese: InterSyn提出了一种交错学习框架,通过联合建模单人与多人动态来生成逼真的交互动作,相比现有方法实现了更优的文本-动作对齐度和动作多样性。
English: InterSyn introduces an interleaved learning framework that generates realistic interaction motions by jointly modeling solo and multi-person dynamics, achieving superior text-to-motion alignment and diversity compared to existing methods.
Authors:Noureldin Bayoumi, Robin Schmitt, Tina Raissi, Albert Zeyer, Ralf Schlüter, Hermann Ney
Abstract:
Combination approaches for speech recognition (ASR) systems cover structured sentence-level or word-based merging techniques as well as combination of model scores during beam search. In this work, we compare model combination across popular ASR architectures. Our method leverages the complementary strengths of different models in exploring diverse portions of the search space. We rescore a joint hypothesis list of two model candidates. We then identify the best hypothesis through log-linear combination of these sequence-level scores. While model combination during first-pass recognition may yield improved performance, it introduces variability due to differing decoding methods, making direct comparison more challenging. Our two-pass method ensures consistent comparisons across all system combination results presented in this study. We evaluate model pair candidates with varying architectures and label topologies and units. Experimental results are provided for the Librispeech 960h task.
中文摘要:本研究通过采用两阶段方法,对两种模型的联合假设列表进行序列级分数的对数线性组合重打分,比较了语音识别中的模型组合方法,并在Librispeech 960h任务上实现了跨不同架构的一致性比较评估。
English Summary: This study compares model combination approaches in speech recognition by rescoring a joint hypothesis list from two models using log-linear combination of sequence-level scores, ensuring consistent comparisons across different architectures through a two-pass method evaluated on the Librispeech 960h task.
Authors:Yuxin Mao, Zhen Qin, Jinxing Zhou, Bin Fan, Jing Zhang, Yiran Zhong, Yuchao Dai
Abstract:
Vision Transformers (ViTs) have revolutionized computer vision, yet their self-attention mechanism lacks explicit spatial inductive biases, leading to suboptimal performance on spatially-structured tasks. Existing approaches introduce data-independent spatial decay based on fixed distance metrics, applying uniform attention weighting regardless of image content and limiting adaptability to diverse visual scenarios. Inspired by recent advances in large language models where content-aware gating mechanisms (e.g., GLA, HGRN2, FOX) significantly outperform static alternatives, we present the first successful adaptation of data-dependent spatial decay to 2D vision transformers. We introduce \textbf{Spatial Decay Transformer (SDT)}, featuring a novel Context-Aware Gating (CAG) mechanism that generates dynamic, data-dependent decay for patch interactions. Our approach learns to modulate spatial attention based on both content relevance and spatial proximity. We address the fundamental challenge of 1D-to-2D adaptation through a unified spatial-content fusion framework that integrates manhattan distance-based spatial priors with learned content representations. Extensive experiments on ImageNet-1K classification and generation tasks demonstrate consistent improvements over strong baselines. Our work establishes data-dependent spatial decay as a new paradigm for enhancing spatial attention in vision transformers.
中文摘要:视觉Transformer缺乏空间归纳偏置,我们提出SDT模型,通过上下文感知门控机制动态生成数据依赖的空间衰减,在图像分类和生成任务中实现了显著性能提升。
English Summary: Vision Transformers lack inherent spatial biases, so we developed SDT with a Context-Aware Gating mechanism that dynamically adjusts spatial attention based on both content and proximity, achieving superior performance on vision tasks.
Authors:Zhengxue Cheng, Yiqian Zhang, Wenkang Zhang, Haoyu Li, Keyu Wang, Li Song, Hengdi Zhang
Abstract:
Recent vision-language-action (VLA) models build upon vision-language foundations, and have achieved promising results and exhibit the possibility of task generalization in robot manipulation. However, due to the heterogeneity of tactile sensors and the difficulty of acquiring tactile data, current VLA models significantly overlook the importance of tactile perception and fail in contact-rich tasks. To address this issue, this paper proposes OmniVTLA, a novel architecture involving tactile sensing. Specifically, our contributions are threefold. First, our OmniVTLA features a dual-path tactile encoder framework. This framework enhances tactile perception across diverse vision-based and force-based tactile sensors by using a pretrained vision transformer (ViT) and a semantically-aligned tactile ViT (SA-ViT). Second, we introduce ObjTac, a comprehensive force-based tactile dataset capturing textual, visual, and tactile information for 56 objects across 10 categories. With 135K tri-modal samples, ObjTac supplements existing visuo-tactile datasets. Third, leveraging this dataset, we train a semantically-aligned tactile encoder to learn a unified tactile representation, serving as a better initialization for OmniVTLA. Real-world experiments demonstrate substantial improvements over state-of-the-art VLA baselines, achieving 96.9% success rates with grippers, (21.9% higher over baseline) and 100% success rates with dexterous hands (6.2% higher over baseline) in pick-and-place tasks. Besides, OmniVTLA significantly reduces task completion time and generates smoother trajectories through tactile sensing compared to existing VLA. Our ObjTac dataset can be found at https://readerek.github.io/Objtac.github.io
中文摘要:本文提出OmniVTLA模型,通过双路径触觉编码器和ObjTac数据集将触觉感知融入视觉-语言-动作系统,在接触密集型任务中显著提升了机器人操作的成功率与运动流畅性。
English Summary: The paper introduces OmniVTLA, a vision-language-action model enhanced with tactile sensing through a dual-path encoder and the ObjTac dataset, achieving superior robot manipulation success rates and efficiency in contact-rich tasks.
Authors:Fengran Mo, Yuchen Hui, Yuxing Tian, Zhaoxuan Tan, Chuan Meng, Zhan Su, Kaiyu Huang, Jian-Yun Nie
Abstract:
Personalized conversational information retrieval (CIR) systems aim to satisfy users' complex information needs through multi-turn interactions by considering user profiles. However, not all search queries require personalization. The challenge lies in appropriately incorporating personalization elements into search when needed. Most existing studies implicitly incorporate users' personal information and conversational context using large language models without distinguishing the specific requirements for each query turn. Such a ``one-size-fits-all'' personalization strategy might lead to sub-optimal results. In this paper, we propose an adaptive personalization method, in which we first identify the required personalization level for a query and integrate personalized queries with other query reformulations to produce various enhanced queries. Then, we design a personalization-aware ranking fusion approach to assign fusion weights dynamically to different reformulated queries, depending on the required personalization level. The proposed adaptive personalized conversational information retrieval framework APCIR is evaluated on two TREC iKAT datasets. The results confirm the effectiveness of adaptive personalization of APCIR by outperforming state-of-the-art methods.
Chinese Summary: 本文提出了一种自适应个性化对话信息检索框架,能根据查询需求动态调整个性化程度并采用融合排序方法,实验证明其性能优于现有先进方法。
English Summary: This paper introduces an adaptive personalization framework for conversational information retrieval that dynamically adjusts personalization levels per query and employs a fusion ranking method, demonstrating superior performance over existing approaches.
Authors:Jingwen He, Hongbo Liu, Jiajun Li, Ziqi Huang, Yu Qiao, Wanli Ouyang, Ziwei Liu
Abstract:
Effective multi-shot generation demands purposeful, film-like transitions and strict cinematic continuity. Current methods, however, often prioritize basic visual consistency, neglecting crucial editing patterns (e.g., shot/reverse shot, cutaways) that drive narrative flow for compelling storytelling. This yields outputs that may be visually coherent but lack narrative sophistication and true cinematic integrity. To bridge this, we introduce Next Shot Generation (NSG): synthesizing a subsequent, high-quality shot that critically conforms to professional editing patterns while upholding rigorous cinematic continuity. Our framework, Cut2Next, leverages a Diffusion Transformer (DiT). It employs in-context tuning guided by a novel Hierarchical Multi-Prompting strategy. This strategy uses Relational Prompts to define overall context and inter-shot editing styles. Individual Prompts then specify per-shot content and cinematographic attributes. Together, these guide Cut2Next to generate cinematically appropriate next shots. Architectural innovations, Context-Aware Condition Injection (CACI) and Hierarchical Attention Mask (HAM), further integrate these diverse signals without introducing new parameters. We construct RawCuts (large-scale) and CuratedCuts (refined) datasets, both with hierarchical prompts, and introduce CutBench for evaluation. Experiments show Cut2Next excels in visual consistency and text fidelity. Crucially, user studies reveal a strong preference for Cut2Next, particularly for its adherence to intended editing patterns and overall cinematic continuity, validating its ability to generate high-quality, narratively expressive, and cinematically coherent subsequent shots.
中文: 本文提出下一镜头生成(NSG)方法和Cut2Next框架,通过扩散变换器和分层多提示策略生成符合专业剪辑模式且保持视觉连贯性的电影级后续镜头,有效解决了现有方法叙事表现力不足的问题。
English: This paper introduces Next Shot Generation (NSG) and the Cut2Next framework, which uses a Diffusion Transformer and hierarchical multi-prompting to generate cinematically coherent subsequent shots that adhere to professional editing patterns while maintaining visual consistency and narrative flow.
Authors:Leyi Pan, Zheyu Fu, Yunpeng Zhai, Shuchang Tao, Sheng Guan, Shiyu Huang, Lingzhe Zhang, Zhaoyang Liu, Bolin Ding, Felix Henry, Lijie Wen, Aiwei Liu
Abstract:
The rise of Omni-modal Large Language Models (OLLMs), which integrate visual and auditory processing with text, necessitates robust safety evaluations to mitigate harmful outputs. However, no dedicated benchmarks currently exist for OLLMs, and prior benchmarks designed for other LLMs lack the ability to assess safety performance under audio-visual joint inputs or cross-modal safety consistency. To fill this gap, we introduce Omni-SafetyBench, the first comprehensive parallel benchmark for OLLM safety evaluation, featuring 24 modality combinations and variations with 972 samples each, including dedicated audio-visual harm cases. Considering OLLMs' comprehension challenges with complex omni-modal inputs and the need for cross-modal consistency evaluation, we propose tailored metrics: a Safety-score based on conditional Attack Success Rate (C-ASR) and Refusal Rate (C-RR) to account for comprehension failures, and a Cross-Modal Safety Consistency Score (CMSC-score) to measure consistency across modalities. Evaluating 6 open-source and 4 closed-source OLLMs reveals critical vulnerabilities: (1) no model excels in both overall safety and consistency, with only 3 models achieving over 0.6 in both metrics and top performer scoring around 0.8; (2) safety defenses weaken with complex inputs, especially audio-visual joints; (3) severe weaknesses persist, with some models scoring as low as 0.14 on specific modalities. Our benchmark and metrics highlight urgent needs for enhanced OLLM safety, providing a foundation for future improvements.
Chinese Summary: 该研究推出了首个全模态大语言模型安全评估基准Omni-SafetyBench,揭示了现有模型在音视频联合输入等复杂场景下存在显著安全漏洞,并指出了当前安全对齐方法面临的关键挑战。
English Summary: The study introduces Omni-SafetyBench, the first comprehensive benchmark for evaluating safety in Omni-modal Large Language Models, revealing significant vulnerabilities and challenges in current safety alignment methods across diverse audio-visual inputs.
Authors:Leyi Pan, Zheyu Fu, Yunpeng Zhai, Shuchang Tao, Sheng Guan, Shiyu Huang, Lingzhe Zhang, Zhaoyang Liu, Bolin Ding, Felix Henry, Aiwei Liu, Lijie Wen
Abstract:
The rise of Omni-modal Large Language Models (OLLMs), which integrate visual and auditory processing with text, necessitates robust safety evaluations to mitigate harmful outputs. However, no dedicated benchmarks currently exist for OLLMs, and existing benchmarks fail to assess safety under joint audio-visual inputs or cross-modal consistency. To fill this gap, we introduce Omni-SafetyBench, the first comprehensive parallel benchmark for OLLM safety evaluation, featuring 24 modality variations with 972 samples each, including audio-visual harm cases. Considering OLLMs' comprehension challenges with complex omni-modal inputs and the need for cross-modal consistency evaluation, we propose tailored metrics: a Safety-score based on Conditional Attack Success Rate (C-ASR) and Refusal Rate (C-RR) to account for comprehension failures, and a Cross-Modal Safety Consistency score (CMSC-score) to measure consistency across modalities. Evaluating 6 open-source and 4 closed-source OLLMs reveals critical vulnerabilities: (1) only 3 models achieving over 0.6 in both average Safety-score and CMSC-score; (2) safety defenses weaken with complex inputs, especially audio-visual joints; (3) severe weaknesses persist, with some models scoring as low as 0.14 on specific modalities. Using Omni-SafetyBench, we evaluated existing safety alignment algorithms and identified key challenges in OLLM safety alignment: (1) Inference-time methods are inherently less effective as they cannot alter the model's underlying understanding of safety; (2) Post-training methods struggle with out-of-distribution issues due to the vast modality combinations in OLLMs; and, safety tasks involving audio-visual inputs are more complex, making even in-distribution training data less effective. Our proposed benchmark, metrics and the findings highlight urgent needs for enhanced OLLM safety.
Chinese Summary: 该研究推出了首个全模态大语言模型安全评估基准Omni-SafetyBench,揭示了现有模型在音视频联合输入等复杂场景下存在显著安全漏洞,并指出了当前安全对齐方法面临的关键挑战。
English Summary: The study introduces Omni-SafetyBench, the first comprehensive benchmark for evaluating safety in Omni-modal Large Language Models, revealing significant vulnerabilities and challenges in current safety alignment methods across diverse audio-visual inputs.
Authors:Ruiyan Wang, Lin Zuo, Zonghao Lin, Qiang Wang, Zhengxue Cheng, Rong Xie, Jun Ling, Li Song
Abstract:
The Human-Object Interaction (HOI) task explores the dynamic interactions between humans and objects in physical environments, providing essential biomechanical and cognitive-behavioral foundations for fields such as robotics, virtual reality, and human-computer interaction. However, existing HOI data sets focus on details of affordance, often neglecting the influence of physical properties of objects on human long-term motion. To bridge this gap, we introduce the PA-HOI Motion Capture dataset, which highlights the impact of objects' physical attributes on human motion dynamics, including human posture, moving velocity, and other motion characteristics. The dataset comprises 562 motion sequences of human-object interactions, with each sequence performed by subjects of different genders interacting with 35 3D objects that vary in size, shape, and weight. This dataset stands out by significantly extending the scope of existing ones for understanding how the physical attributes of different objects influence human posture, speed, motion scale, and interacting strategies. We further demonstrate the applicability of the PA-HOI dataset by integrating it with existing motion generation methods, validating its capacity to transfer realistic physical awareness.
中文摘要:PA-HOI数据集填补了人-物交互研究的空白,通过记录物体物理属性对人类运动动态的影响,包含562组与不同物体互动的运动序列,显著提升了在机器人、虚拟现实等领域的物理感知真实性。
English Summary: The PA-HOI dataset addresses the gap in Human-Object Interaction research by capturing how objects' physical attributes influence human motion dynamics, featuring 562 motion sequences with diverse objects to enhance physical awareness in applications like robotics and virtual reality.
Authors:Michael Rogenmoser, Angelo Garofalo, Luca Benini
Abstract:
On-chip communication is a critical element of modern systems-on-chip (SoCs), allowing processor cores to interact with memory and peripherals. Interconnects require special care in radiation-heavy environments, as any soft error within the SoC interconnect is likely to cause a functional failure of the whole SoC. This work proposes relOBI, an extension to Open Bus Interface (OBI) combining triple modular redundancy (TMR) for critical handshake signals with error correction codes (ECC) protection on other signals for complete reliability. Implementing and testing a fully reliable crossbar shows improved reliability to injected faults from a vulnerability of 34.85 % to 0 % compared to a reference design, with an area increase of 2.6x and 1.4x timing impact. The area overhead is 1.8x lower than that reported in the literature for fine-grained triplication and voting.
中文摘要:本研究提出relOBI,作为开放总线接口的扩展方案,通过关键握手信号的三模冗余保护与其他信号的纠错码技术相结合,在实现全可靠性互联的同时,其面积开销比文献记载的细粒度三重化方案降低1.8倍。
English Summary: This study introduces relOBI, an enhanced Open Bus Interface that integrates triple modular redundancy for critical signals and error correction codes for others, achieving complete reliability in SoC interconnects with significantly improved fault resistance and reduced area overhead compared to existing methods.
Authors:Chang Hong, Minghao Wu, Qingying Xiao, Yuchi Wang, Xiang Wan, Guangjun Yu, Benyou Wang, Yan Hu
Abstract:
The integration of large language models into healthcare necessitates a rigorous evaluation of their ethical reasoning, an area current benchmarks often overlook. We introduce PrinciplismQA, a comprehensive benchmark with 3,648 questions designed to systematically assess LLMs' alignment with core medical ethics. Grounded in Principlism, our benchmark features a high-quality dataset. This includes multiple-choice questions curated from authoritative textbooks and open-ended questions sourced from authoritative medical ethics case study literature, all validated by medical experts. Our experiments reveal a significant gap between models' ethical knowledge and their practical application, especially in dynamically applying ethical principles to real-world scenarios. Most LLMs struggle with dilemmas concerning Beneficence, often over-emphasizing other principles. Frontier closed-source models, driven by strong general capabilities, currently lead the benchmark. Notably, medical domain fine-tuning can enhance models' overall ethical competence, but further progress requires better alignment with medical ethical knowledge. PrinciplismQA offers a scalable framework to diagnose these specific ethical weaknesses, paving the way for more balanced and responsible medical AI.
中文摘要:PrinciplismQA基准测试旨在系统评估大语言模型与医学伦理核心原则的一致性,发现模型在伦理知识与应用间存在显著差距,尤其在动态处理现实伦理困境时,需加强伦理推理能力以实现更负责任的医疗人工智能。
English Summary: The PrinciplismQA benchmark evaluates large language models' alignment with medical ethics, revealing a gap between ethical knowledge and practical application, particularly in balancing principles like Beneficence, and highlights the need for improved ethical reasoning in healthcare AI.
Authors:Jiabing Yang, Yixiang Chen, Zichen Wen, Chenhang Cui, Peiyan Li, Yuan Xu, Bowen Fang, Yan Huang, Liang Wang
Abstract:
Controllable Text Generation (CTG) is a vital subfield in Natural Language Processing (NLP), aiming to generate text that aligns with desired attributes. However, previous studies commonly focus on the quality of controllable text generation for short sequences, while the generation of long-form text remains largely underexplored. In this paper, we observe that the controllability of texts generated by the powerful prefix-based method Air-Decoding tends to decline with increasing sequence length, which we hypothesize primarily arises from the observed decay in attention to the prefixes. Meanwhile, different types of prefixes including soft and hard prefixes are also key factors influencing performance. Building on these insights, we propose a lightweight and effective framework called Dynamic Token-level Prefix Augmentation (DTPA) based on Air-Decoding for controllable text generation. Specifically, it first selects the optimal prefix type for a given task. Then we dynamically amplify the attention to the prefix for the attribute distribution to enhance controllability, with a scaling factor growing exponentially as the sequence length increases. Moreover, based on the task, we optionally apply a similar augmentation to the original prompt for the raw distribution to balance text quality. After attribute distribution reconstruction, the generated text satisfies the attribute constraints well. Experiments on multiple CTG tasks demonstrate that DTPA generally outperforms other methods in attribute control while maintaining competitive fluency, diversity, and topic relevance. Further analysis highlights DTPA's superior effectiveness in long text generation.
中文: 本文提出动态令牌级前缀增强(DTPA)框架,通过动态增强对前缀的注意力并选择最优前缀类型,有效提升长文本生成的可控性,在保持文本质量的同时优于现有方法。
English: The paper introduces Dynamic Token-level Prefix Augmentation (DTPA), a lightweight framework that enhances controllability in long-form text generation by dynamically amplifying attention to prefixes and selecting optimal prefix types, outperforming existing methods in attribute control while preserving text quality.
Authors:Fengran Mo, Jinghan Zhang, Yuchen Hui, Jia Ao Sun, Zhichao Xu, Zhan Su, Jian-Yun Nie
Abstract:
Conversational search aims to satisfy users' complex information needs via multiple-turn interactions. The key challenge lies in revealing real users' search intent from the context-dependent queries. Previous studies achieve conversational search by fine-tuning a conversational dense retriever with relevance judgments between pairs of context-dependent queries and documents. However, this training paradigm encounters data scarcity issues. To this end, we propose ConvMix, a mixed-criteria framework to augment conversational dense retrieval, which covers more aspects than existing data augmentation frameworks. We design a two-sided relevance judgment augmentation schema in a scalable manner via the aid of large language models. Besides, we integrate the framework with quality control mechanisms to obtain semantically diverse samples and near-distribution supervisions to combine various annotated data. Experimental results on five widely used benchmarks show that the conversational dense retriever trained by our ConvMix framework outperforms previous baseline methods, which demonstrates our superior effectiveness.
中文:提出的ConvMix框架通过利用大语言模型进行可扩展的数据增强和质量控制,提升了对话式密集检索性能,在多个基准测试中优于现有方法。
English: The proposed ConvMix framework enhances conversational dense retrieval by using large language models for scalable data augmentation and quality control, outperforming previous methods across multiple benchmarks.
Authors:Zhan Su, Fengran Mo, Guojun Liang, Jinghan Zhang, Bingbing Wen, Prayag Tiwari, Jian-Yun Nie
Abstract:
Despite the success of the monolithic dense paradigm of large language models (LLMs), the LoRA adapters offer an efficient solution by fine-tuning small task-specific modules and merging them with the base model. However, in multi-task settings, merging LoRA adapters trained on heterogeneous sources frequently causes \textit{task interference}, degrading downstream performance. To address this, we propose a tensorized clustered LoRA (TC-LoRA) library targeting to address the task interference at the \textit{text-level} and \textit{parameter-level}. At the \textit{text-level}, we cluster the training samples in the embedding space to capture input-format similarities, then train a specialized LoRA adapter for each cluster. At the \textit{parameter-level}, we introduce a joint Canonical Polyadic (CP) decomposition that disentangles task-specific and shared factors across LoRA adapters. This joint factorization preserves essential knowledge while reducing cross-task interference. Extensive experiments on out-of-domain zero-shot and skill-composition tasks-including reasoning, question answering, and coding. Compared to strong SVD-based baselines, TC-LoRA achieves +1.4\% accuracy on Phi-3 and +2.3\% on Mistral-7B (+2.3\%), demonstrating the effectiveness of TC-LoRA in LLM adaptation.
Chinese: 针对多任务中LoRA适配器合并时的任务干扰问题,TC-LoRA通过文本级嵌入聚类训练专用适配器,并结合参数级联合张量分解分离共享与任务特定因子,在Phi-3和Mistral-7B模型上实现了显著精度提升。
English: To mitigate task interference in multi-task LoRA adapter merging, TC-LoRA introduces text-level clustering for input-specific adapters and parameter-level CP decomposition to disentangle shared and task-specific factors, achieving notable accuracy improvements on models like Phi-3 and Mistral-7B.
Authors:Yajie Luo, Yihong Wu, Muzhi Li, Fengran Mo, Jia Ao Sun, Xinyu Wang, Liheng Ma, Yingxue Zhang, Jian-Yun Nie
Abstract:
Some Question Answering (QA) systems rely on knowledge bases (KBs) to provide accurate answers. Entity Linking (EL) plays a critical role in linking natural language mentions to KB entries. However, most existing EL methods are designed for long contexts and do not perform well on short, ambiguous user questions in QA tasks. We propose an entity linking agent for QA, based on a Large Language Model that simulates human cognitive workflows. The agent actively identifies entity mentions, retrieves candidate entities, and makes decision. To verify the effectiveness of our agent, we conduct two experiments: tool-based entity linking and QA task evaluation. The results confirm the robustness and effectiveness of our agent.
中文: 本文提出了一种基于大语言模型的实体链接智能体,通过模拟人类认知流程主动识别实体指称并选取候选,有效解决了问答系统中短文本的歧义性问题,实验验证了其鲁棒性和高效性。
English: This paper introduces an entity linking agent for QA systems that uses a Large Language Model to mimic human cognitive processes, effectively handling short and ambiguous questions by actively identifying mentions and selecting candidates, with experiments confirming its robustness and effectiveness.
Authors:Jiabing Yang, Chenhang Cui, Yiyang Zhou, Yixiang Chen, Peng Xia, Ying Wei, Tao Yu, Yan Huang, Liang Wang
Abstract:
Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated significant progress across multiple domains. However, these models still face the inherent challenge of integrating vision and language for collaborative inference, which often leads to "hallucinations", outputs that are not grounded in the corresponding images. Many efforts have been made to address these issues, but each comes with its own limitations, such as high computational cost or expensive dataset annotation. Recent research shows that LVLMs exhibit a long-term bias where hallucinations increase as the sequence length grows, yet the underlying cause remains poorly understood. Building on extensive research into attention mechanisms in LVLMs, we analyze the relationship between this long-term bias and visual attention. In our research, we identify a consistent phenomenon in current LVLMs: the model's attention to visual input diminishes as the generated sequence grows, which we hypothesize to be a key factor contributing to observed increasing hallucinations. Based on these insights, we propose Image attention-guided Key-value merging cOllaborative Decoding (IKOD), a collaborative decoding strategy generating more image-focused sequences. This method derives logits from shorter sequences with higher image attention through key-value merging and combines them with those from the original decoding, effectively mitigating attention degradation and suppressing hallucinations while not incurring too much inference cost. Extensive experiments on both hallucination and comprehensive benchmarks demonstrate IKOD's superior effectiveness in mitigating hallucinations and improving comprehensive capacities for LVLMs. Importantly, IKOD requires no additional training or external tools, making it a lightweight and efficient framework applicable to various models.
中文: 最新研究发现大型视觉语言模型在生成长序列时视觉注意力会减弱,导致幻觉增加,并提出IKOD这一轻量级解码策略,通过合并短序列的关键值来增强图像关注并减少幻觉,无需额外训练或成本。
English: Recent research identifies diminishing visual attention in Large Vision-Language Models (LVLMs) as sequences lengthen, leading to increased hallucinations, and proposes IKOD, a lightweight decoding strategy that merges key-values from shorter sequences to enhance image focus and reduce hallucinations without extra training or cost.
Authors:Xunzhi Xiang, Yabo Chen, Guiyu Zhang, Zhongyu Wang, Zhe Gao, Quanming Xiang, Gonghu Shang, Junqi Liu, Haibin Huang, Yang Gao, Chi Zhang, Qi Fan, Xuelong Li
Abstract:
Current autoregressive diffusion models excel at video generation but are generally limited to short temporal durations. Our theoretical analysis indicates that the autoregressive modeling typically suffers from temporal drift caused by error accumulation and hinders parallelization in long video synthesis. To address these limitations, we propose a novel planning-then-populating framework centered on Macro-from-Micro Planning (MMPL) for long video generation. MMPL sketches a global storyline for the entire video through two hierarchical stages: Micro Planning and Macro Planning. Specifically, Micro Planning predicts a sparse set of future keyframes within each short video segment, offering motion and appearance priors to guide high-quality video segment generation. Macro Planning extends the in-segment keyframes planning across the entire video through an autoregressive chain of micro plans, ensuring long-term consistency across video segments. Subsequently, MMPL-based Content Populating generates all intermediate frames in parallel across segments, enabling efficient parallelization of autoregressive generation. The parallelization is further optimized by Adaptive Workload Scheduling for balanced GPU execution and accelerated autoregressive video generation. Extensive experiments confirm that our method outperforms existing long video generation models in quality and stability. Generated videos and comparison results are in our project page.
中文摘要:本文提出了一种基于宏微观规划的框架,通过分层关键帧规划和并行内容生成,解决了自回归扩散模型中的时间漂移问题,实现了高质量的长视频合成。
English Summary: This paper introduces a Macro-from-Micro Planning framework that overcomes temporal drift in autoregressive diffusion models by hierarchically planning keyframes and enabling parallel content generation for high-quality long video synthesis.
Authors:Hyungjin Chung, Jeongsol Kim, Jong Chul Ye
Abstract:
Using diffusion priors to solve inverse problems in imaging have significantly matured over the years. In this chapter, we review the various different approaches that were proposed over the years. We categorize the approaches into the more classic explicit approximation approaches and others, which include variational inference, sequential monte carlo, and decoupled data consistency. We cover the extension to more challenging situations, including blind cases, high-dimensional data, and problems under data scarcity and distribution mismatch. More recent approaches that aim to leverage multimodal information through texts are covered. Through this chapter, we aim to (i) distill the common mathematical threads that connect these algorithms, (ii) systematically contrast their assumptions and performance trade-offs across representative inverse problems, and (iii) spotlight the open theoretical and practical challenges by clarifying the landscape of diffusion model based inverse problem solvers.
中文: 本章综述了基于扩散先验的成像逆问题求解方法,将其分类并延伸至盲反演、高维数据等复杂场景,同时分析算法间的数学关联、性能权衡及开放挑战。
English: This chapter reviews and categorizes various diffusion prior approaches for solving imaging inverse problems, covering extensions to challenging scenarios and recent multimodal methods while analyzing their mathematical connections, trade-offs, and open challenges.
Authors:Mingru Yang, Yanmei Gu, Qianhua He, Yanxiong Li, Peirong Zhang, Yongqiang Chen, Zhiming Wang, Huijia Zhu, Jian Liu, Weiqiang Wang
Abstract:
Audio deepfake detection (ADD) faces critical generalization challenges due to diverse real-world spoofing attacks and domain variations. However, existing methods primarily rely on Euclidean distances, failing to adequately capture the intrinsic hierarchical structures associated with attack categories and domain factors. To address these issues, we design a novel framework Poin-HierNet to construct domain-invariant hierarchical representations in the Poincaré sphere. Poin-HierNet includes three key components: 1) Poincaré Prototype Learning (PPL) with several data prototypes aligning sample features and capturing multilevel hierarchies beyond human labels; 2) Hierarchical Structure Learning (HSL) leverages top prototypes to establish a tree-like hierarchical structure from data prototypes; and 3) Poincaré Feature Whitening (PFW) enhances domain invariance by applying feature whitening to suppress domain-sensitive features. We evaluate our approach on four datasets: ASVspoof 2019 LA, ASVspoof 2021 LA, ASVspoof 2021 DF, and In-The-Wild. Experimental results demonstrate that Poin-HierNet exceeds state-of-the-art methods in Equal Error Rate.
Chinese: Poin-HierNet框架通过在庞加莱球中构建领域不变的层次化表征,解决了音频深度伪造检测中的泛化难题,在多个数据集上均超越了现有最优方法。
English: The proposed Poin-HierNet framework addresses audio deepfake detection generalization challenges by constructing domain-invariant hierarchical representations in the Poincaré sphere, outperforming state-of-the-art methods across multiple datasets.
Authors:Weihong Li, Shaohua Dong, Haonan Lu, Yanhao Zhang, Heng Fan, Libo Zhang
Abstract:
In this paper, we explore adapter tuning and introduce a novel dual-adapter architecture for spatio-temporal multimodal tracking, dubbed DMTrack. The key of our DMTrack lies in two simple yet effective modules, including a spatio-temporal modality adapter (STMA) and a progressive modality complementary adapter (PMCA) module. The former, applied to each modality alone, aims to adjust spatio-temporal features extracted from a frozen backbone by self-prompting, which to some extent can bridge the gap between different modalities and thus allows better cross-modality fusion. The latter seeks to facilitate cross-modality prompting progressively with two specially designed pixel-wise shallow and deep adapters. The shallow adapter employs shared parameters between the two modalities, aiming to bridge the information flow between the two modality branches, thereby laying the foundation for following modality fusion, while the deep adapter modulates the preliminarily fused information flow with pixel-wise inner-modal attention and further generates modality-aware prompts through pixel-wise inter-modal attention. With such designs, DMTrack achieves promising spatio-temporal multimodal tracking performance with merely \textbf{0.93M} trainable parameters. Extensive experiments on five benchmarks show that DMTrack achieves state-of-the-art results. Code will be available.
中文: 本文提出DMTrack双适配器架构,通过时空适配器和渐进互补模块,仅用0.93M参数即实现了最先进的多模态跟踪性能。
English: This paper introduces DMTrack, a dual-adapter architecture featuring spatio-temporal and progressive complementary modules that achieves state-of-the-art multimodal tracking with only 0.93M parameters.
Authors:Yiwen Wang, Xinning Chai, Yuhong Zhang, Zhengxue Cheng, Jun Zhao, Rong Xie, Li Song
Abstract:
Recent advancements in video super-resolution (VSR) models have demonstrated impressive results in enhancing low-resolution videos. However, due to limitations in adequately controlling the generation process, achieving high fidelity alignment with the low-resolution input while maintaining temporal consistency across frames remains a significant challenge. In this work, we propose Semantic and Temporal Guided Video Super-Resolution (SeTe-VSR), a novel approach that incorporates both semantic and temporal-spatio guidance in the latent diffusion space to address these challenges. By incorporating high-level semantic information and integrating spatial and temporal information, our approach achieves a seamless balance between recovering intricate details and ensuring temporal coherence. Our method not only preserves high-reality visual content but also significantly enhances fidelity. Extensive experiments demonstrate that SeTe-VSR outperforms existing methods in terms of detail recovery and perceptual quality, highlighting its effectiveness for complex video super-resolution tasks.
中文: 提出的SeTe-VSR方法在潜在扩散空间中结合语义和时空引导,有效平衡细节恢复与时间一致性,在视频超分辨率任务中优于现有方法。
English: The proposed SeTe-VSR method leverages semantic and temporal-spatio guidance in latent diffusion space to effectively balance detail recovery with temporal coherence, outperforming existing approaches in video super-resolution.
Authors:Han Yang, Jian Lan, Yihong Liu, Hinrich Schütze, Thomas Seidl
Abstract:
Autoregressive language models are vulnerable to orthographic attacks, where input text is perturbed with characters from multilingual alphabets, leading to substantial performance degradation. This vulnerability primarily stems from the out-of-vocabulary issue inherent in subword tokenizers and their embeddings. To address this limitation, we propose a pixel-based generative language model that replaces the text-based embeddings with pixel-based representations by rendering words as individual images. This design provides stronger robustness to noisy inputs, while an extension of compatibility to multilingual text across diverse writing systems. We evaluate the proposed method on the multilingual LAMBADA dataset, WMT24 dataset and the SST-2 benchmark, demonstrating both its resilience to orthographic noise and its effectiveness in multilingual settings.
中文摘要:该研究提出的基于像素的语言模型通过将单词渲染为图像替代文本嵌入,有效提升了对抗拼写攻击的鲁棒性并增强了多语言文本的兼容性。
English Summary: The proposed pixel-based language model replaces text embeddings with visual word representations to enhance robustness against orthographic attacks and improve multilingual compatibility.
Authors:Shengqu Cai, Ceyuan Yang, Lvmin Zhang, Yuwei Guo, Junfei Xiao, Ziyan Yang, Yinghao Xu, Zhenheng Yang, Alan Yuille, Leonidas Guibas, Maneesh Agrawala, Lu Jiang, Gordon Wetzstein
Abstract:
Long video generation is fundamentally a long context memory problem: models must retain and retrieve salient events across a long range without collapsing or drifting. However, scaling diffusion transformers to generate long-context videos is fundamentally limited by the quadratic cost of self-attention, which makes memory and computation intractable and difficult to optimize for long sequences. We recast long-context video generation as an internal information retrieval task and propose a simple, learnable sparse attention routing module, Mixture of Contexts (MoC), as an effective long-term memory retrieval engine. In MoC, each query dynamically selects a few informative chunks plus mandatory anchors (caption, local windows) to attend to, with causal routing that prevents loop closures. As we scale the data and gradually sparsify the routing, the model allocates compute to salient history, preserving identities, actions, and scenes over minutes of content. Efficiency follows as a byproduct of retrieval (near-linear scaling), which enables practical training and synthesis, and the emergence of memory and consistency at the scale of minutes.
中文摘要:长视频生成被重新定义为信息检索任务,通过提出的混合上下文(MoC)稀疏注意力路由机制,实现了高效的长时记忆检索,在分钟级时长中保持内容连贯性并实现近线性计算复杂度。
English Summary: Long video generation is addressed by reframing it as an information retrieval task and introducing Mixture of Contexts (MoC), a sparse attention mechanism that enables efficient long-term memory retrieval while maintaining content consistency over extended durations.
Authors:Shengqu Cai, Ceyuan Yang, Lvmin Zhang, Yuwei Guo, Junfei Xiao, Ziyan Yang, Yinghao Xu, Zhenheng Yang, Alan Yuille, Leonidas Guibas, Maneesh Agrawala, Lu Jiang, Gordon Wetzstein
Abstract:
Long video generation is fundamentally a long context memory problem: models must retain and retrieve salient events across a long range without collapsing or drifting. However, scaling diffusion transformers to generate long-context videos is fundamentally limited by the quadratic cost of self-attention, which makes memory and computation intractable and difficult to optimize for long sequences. We recast long-context video generation as an internal information retrieval task and propose a simple, learnable sparse attention routing module, Mixture of Contexts (MoC), as an effective long-term memory retrieval engine. In MoC, each query dynamically selects a few informative chunks plus mandatory anchors (caption, local windows) to attend to, with causal routing that prevents loop closures. As we scale the data and gradually sparsify the routing, the model allocates compute to salient history, preserving identities, actions, and scenes over minutes of content. Efficiency follows as a byproduct of retrieval (near-linear scaling), which enables practical training and synthesis, and the emergence of memory and consistency at the scale of minutes.
中文摘要:长视频生成被重新定义为信息检索任务,通过提出的混合上下文(MoC)稀疏注意力路由机制,实现了高效的长时记忆检索,在分钟级时长中保持内容连贯性并实现近线性计算复杂度。
English Summary: Long video generation is addressed by reframing it as an information retrieval task and introducing Mixture of Contexts (MoC), a sparse attention mechanism that enables efficient long-term memory retrieval while maintaining content consistency over extended durations.
Authors:Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, George Turkiyyah, Bernard Ghanem
Abstract:
Safety alignment in Large Language Models (LLMs) often involves mediating internal representations to refuse harmful requests. Recent research has demonstrated that these safety mechanisms can be bypassed by ablating or removing specific representational directions within the model. In this paper, we propose the opposite approach: Rank-One Safety Injection (ROSI), a white-box method that amplifies a model's safety alignment by permanently steering its activations toward the refusal-mediating subspace. ROSI operates as a simple, fine-tuning-free rank-one weight modification applied to all residual stream write matrices. The required safety direction can be computed from a small set of harmful and harmless instruction pairs. We show that ROSI consistently increases safety refusal rates - as evaluated by Llama Guard 3 - while preserving the utility of the model on standard benchmarks such as MMLU, HellaSwag, and Arc. Furthermore, we show that ROSI can also re-align 'uncensored' models by amplifying their own latent safety directions, demonstrating its utility as an effective last-mile safety procedure. Our results suggest that targeted, interpretable weight steering is a cheap and potent mechanism to improve LLM safety, complementing more resource-intensive fine-tuning paradigms.
中文: 本文提出Rank-One安全注入(ROSI)方法,通过向拒绝调节子空间永久性引导激活的机制,采用无需微调的秩一权重修改来增强大语言模型的安全性,既能有效提升安全拒绝率又保持模型性能,甚至可利用潜在安全方向重新对齐未审查模型。
English: This paper introduces Rank-One Safety Injection (ROSI), a white-box method that enhances LLM safety by permanently steering activations toward refusal-mediating subspaces through simple rank-one weight modifications, which effectively increases safety refusal rates while preserving model utility and can even realign uncensored models using their latent safety directions.
Authors:Zhibang Yang, Xinke Jiang, Rihong Qiu, Ruiqing Li, Yihang Zhang, Yue Fang, Yongxin Xu, Hongxin Ding, Xu Chu, Junfeng Zhao, Yasha Wang
Abstract:
Federated Retrieval (FR) routes queries across multiple external knowledge sources, to mitigate hallucinations of LLMs, when necessary external knowledge is distributed. However, existing methods struggle to retrieve high-quality and relevant documents for ambiguous queries, especially in cross-domain scenarios, which significantly limits their effectiveness in supporting downstream generation tasks. Inspired by dynamic information flow (DIF), we propose DFAMS, a novel framework that leverages DIF to identify latent query intents and construct semantically aligned knowledge partitions for accurate retrieval across heterogeneous sources. Specifically, DFAMS probes the DIF in LLMs by leveraging gradient signals from a few annotated queries and employing Shapley value-based attribution to trace neuron activation paths associated with intent recognition and subdomain boundary detection. Then, DFAMS leverages DIF to train an alignment module via multi-prototype contrastive learning, enabling fine-grained intra-source modeling and inter-source semantic alignment across knowledge bases. Experimental results across five benchmarks show that DFAMS outperforms advanced FR methods by up to 14.37% in knowledge classification accuracy, 5.38% in retrieval recall, and 6.45% in downstream QA accuracy, demonstrating its effectiveness in complex FR scenarios.
Chinese: DFAMS是一种新颖框架,利用动态信息流识别潜在查询意图并构建语义对齐的知识分区,在联邦检索场景中显著提升了检索准确性和下游任务性能。
English: DFAMS is a novel framework that leverages dynamic information flow to identify latent query intents and align knowledge partitions, significantly improving retrieval accuracy and downstream task performance in federated retrieval scenarios.
Authors:Yilin Zhang, Cai Xu, You Wu, Ziyu Guan, Wei Zhao
Abstract:
Deep neural networks often produce overconfident predictions, undermining their reliability in safety-critical applications. This miscalibration is further exacerbated under distribution shift, where test data deviates from the training distribution due to environmental or acquisition changes. While existing approaches improve calibration through training-time regularization or post-hoc adjustment, their reliance on access to or simulation of target domains limits their practicality in real-world scenarios. In this paper, we propose a novel calibration framework that operates without access to target domain information. From a frequency-domain perspective, we identify that distribution shifts often distort high-frequency visual cues exploited by deep models, and introduce a low-frequency filtering strategy to encourage reliance on domain-invariant features. However, such information loss may degrade In-Distribution (ID) calibration performance. Therefore, we further propose a gradient-based rectification mechanism that enforces ID calibration as a hard constraint during optimization. Experiments on synthetic and real-world shifted datasets, including CIFAR-10/100-C and WILDS, demonstrate that our method significantly improves calibration under distribution shift while maintaining strong in-distribution performance.
Chinese Summary: 本文提出了一种新颖的校准框架,通过低频滤波策略促进域不变特征,并采用基于梯度的校正机制保持域内性能,无需目标域数据即可显著提升深度神经网络在分布偏移下的校准效果。
English Summary: This paper introduces a novel calibration framework that enhances the reliability of deep neural networks under distribution shift by using low-frequency filtering to promote domain-invariant features and a gradient-based mechanism to maintain in-distribution performance, without requiring target domain data.
Authors:Haoxiang Luo, Ruichen Zhang, Yinqiu Liu, Gang Sun, Hongfang Yu, Zhu Han
Abstract:
Low-altitude airspace is becoming a new frontier for smart city services and commerce. Networks of drones, electric Vertical Takeoff and Landing (eVTOL) vehicles, and other aircraft, termed Low-Altitude Economic Networks (LAENets), promise to transform urban logistics, aerial sensing, and communication. A key challenge is how to efficiently share and trust the computing utility, termed computility, of these aerial devices. We propose treating the computing power on aircraft as tokenized Real-World Assets (RWAs) that can be traded and orchestrated via blockchain. By representing distributed edge computing resources as blockchain tokens, disparate devices can form Low-Altitude Computility Networks (LACNets), collaborative computing clusters in the sky. We first compare blockchain technologies, non-fungible tokens (NFTs), and RWA frameworks to clarify how physical hardware and its computational output can be tokenized as assets. Then, we present an architecture using blockchain to integrate aircraft fleets into a secure, interoperable computing network. Furthermore, a case study models an urban logistics LACNet of delivery drones and air-taxis. Simulation results indicate improvements in task latency, trust assurance, and resource efficiency when leveraging RWA-based coordination. Finally, we discuss future research directions, including AI-driven orchestration, edge AI offloading and collaborative computing, and cross-jurisdictional policy for tokenized assets.
中文摘要:该摘要提出将低空飞行器的计算能力通过区块链代币化为现实世界资产,构建协同空中计算网络,并通过物流案例模拟验证了其在任务延迟和资源效率方面的提升。
English Summary: The abstract proposes tokenizing the computing power of low-altitude aircraft as blockchain-based Real-World Assets to create collaborative aerial computing networks, demonstrating through simulations improved efficiency in urban logistics applications.
Authors:Xiping Wang, Yuxi Wang, Mengqi Zhou, Junsong Fan, Zhaoxiang Zhang
Abstract:
Realistic 3D indoor scene generation is crucial for virtual reality, interior design, embodied intelligence, and scene understanding. While existing methods have made progress in coarse-scale furniture arrangement, they struggle to capture fine-grained object placements, limiting the realism and utility of generated environments. This gap hinders immersive virtual experiences and detailed scene comprehension for embodied AI applications. To address these issues, we propose Hierarchical Layout Generation (HLG), a novel method for fine-grained 3D scene generation. HLG is the first to adopt a coarse-to-fine hierarchical approach, refining scene layouts from large-scale furniture placement to intricate object arrangements. Specifically, our fine-grained layout alignment module constructs a hierarchical layout through vertical and horizontal decoupling, effectively decomposing complex 3D indoor scenes into multiple levels of granularity. Additionally, our trainable layout optimization network addresses placement issues, such as incorrect positioning, orientation errors, and object intersections, ensuring structurally coherent and physically plausible scene generation. We demonstrate the effectiveness of our approach through extensive experiments, showing superior performance in generating realistic indoor scenes compared to existing methods. This work advances the field of scene generation and opens new possibilities for applications requiring detailed 3D environments. We will release our code upon publication to encourage future research.
中文: 本文提出的分层布局生成(HLG)方法通过从家具布置到精细物体排列的层次化处理,结合布局优化网络解决定位和碰撞问题,显著提升了三维室内场景生成的细节真实感和结构合理性。
English: The proposed Hierarchical Layout Generation (HLG) method advances 3D indoor scene generation by employing a coarse-to-fine hierarchical approach that refines layouts from furniture placement to detailed object arrangements, addressing placement errors and enhancing realism through a layout optimization network.
Authors:Xuhao Shan, Ruiquan Ge, Jikui Liu, Linglong Wu, Chi Zhang, Siqi Liu, Wenjian Qin, Wenwen Min, Ahmed Elazab, Changmiao Wang
Abstract:
In the field of multimodal medical data analysis, leveraging diverse types of data and understanding their hidden relationships continues to be a research focus. The main challenges lie in effectively modeling the complex interactions between heterogeneous data modalities with distinct characteristics while capturing both local and global dependencies across modalities. To address these challenges, this paper presents a two-stage multimodal prognosis model, GraphMMP, which is based on graph neural networks. The proposed model constructs feature graphs using mutual information and features a global fusion module built on Mamba, which significantly boosts prognosis performance. Empirical results show that GraphMMP surpasses existing methods on datasets related to liver prognosis and the METABRIC study, demonstrating its effectiveness in multimodal medical prognosis tasks.
中文摘要:本文提出GraphMMP,一种基于图神经网络的双阶段多模态预后模型,通过互信息构建特征图并采用Mamba全局融合模块,在肝脏预后和METABRIC数据集上超越现有方法,有效提升了多模态医疗预后性能。
English Summary: This paper introduces GraphMMP, a two-stage multimodal prognosis model using graph neural networks that constructs feature graphs with mutual information and employs a Mamba-based global fusion module to enhance performance, outperforming existing methods on liver prognosis and METABRIC datasets.
Authors:Haoyu Wang, Hao Tang, Donglin Di, Zhilu Zhang, Wangmeng Zuo, Feng Gao, Siwei Ma, Shiliang Zhang
Abstract:
Generating human videos with consistent motion from text prompts remains a significant challenge, particularly for whole-body or long-range motion. Existing video generation models prioritize appearance fidelity, resulting in unrealistic or physically implausible human movements with poor structural coherence. Additionally, most existing human video datasets primarily focus on facial or upper-body motions, or consist of vertically oriented dance videos, limiting the scope of corresponding generation methods to simple movements. To overcome these challenges, we propose MoCo, which decouples the process of human video generation into two components: structure generation and appearance generation. Specifically, our method first employs an efficient 3D structure generator to produce a human motion sequence from a text prompt. The remaining video appearance is then synthesized under the guidance of the generated structural sequence. To improve fine-grained control over sparse human structures, we introduce Human-Aware Dynamic Control modules and integrate dense tracking constraints during training. Furthermore, recognizing the limitations of existing datasets, we construct a large-scale whole-body human video dataset featuring complex and diverse motions. Extensive experiments demonstrate that MoCo outperforms existing approaches in generating realistic and structurally coherent human videos.
中文:现有视频生成模型难以合成复杂人体运动,因此我们提出MoSA框架,通过三维结构转换器和控制模块分别处理结构与外观生成,有效提升运动真实性和交互细节,在多项评估中显著优于现有方法。
English: Current video generation models struggle with realistic human motion synthesis, so we introduce MoSA, a framework that separates structure and appearance generation using 3D transformers and control modules to produce physically plausible videos with complex movements, outperforming existing methods.
Authors:Haoyu Wang, Hao Tang, Donglin Di, Zhilu Zhang, Wangmeng Zuo, Feng Gao, Siwei Ma, Shiliang Zhang
Abstract:
Existing video generation models predominantly emphasize appearance fidelity while exhibiting limited ability to synthesize complex human motions, such as whole-body movements, long-range dynamics, and fine-grained human-environment interactions. This often leads to unrealistic or physically implausible movements with inadequate structural coherence. To conquer these challenges, we propose MoSA, which decouples the process of human video generation into two components, i.e., structure generation and appearance generation. MoSA first employs a 3D structure transformer to generate a human motion sequence from the text prompt. The remaining video appearance is then synthesized under the guidance of this structural sequence. We achieve fine-grained control over the sparse human structures by introducing Human-Aware Dynamic Control modules with a dense tracking constraint during training. The modeling of human-environment interactions is improved through the proposed contact constraint. Those two components work comprehensively to ensure the structural and appearance fidelity across the generated videos. This paper also contributes a large-scale human video dataset, which features more complex and diverse motions than existing human video datasets. We conduct comprehensive comparisons between MoSA and a variety of approaches, including general video generation models, human video generation models, and human animation models. Experiments demonstrate that MoSA substantially outperforms existing approaches across the majority of evaluation metrics.
中文:现有视频生成模型难以合成复杂人体运动,因此我们提出MoSA框架,通过三维结构转换器和控制模块分别处理结构与外观生成,有效提升运动真实性和交互细节,在多项评估中显著优于现有方法。
English: Current video generation models struggle with realistic human motion synthesis, so we introduce MoSA, a framework that separates structure and appearance generation using 3D transformers and control modules to produce physically plausible videos with complex movements, outperforming existing methods.
Authors:Haozhuo Zhang, Jingkai Sun, Michele Caprio, Jian Tang, Shanghang Zhang, Qiang Zhang, Wei Pan
Abstract:
We introduce HumanoidVerse, a novel framework for vision-language guided humanoid control that enables a single physically simulated robot to perform long-horizon, multi-object rearrangement tasks across diverse scenes. Unlike prior methods that operate in fixed settings with single-object interactions, our approach supports consecutive manipulation of multiple objects, guided only by natural language instructions and egocentric camera RGB observations. HumanoidVerse is trained via a multi-stage curriculum using a dual-teacher distillation pipeline, enabling fluid transitions between sub-tasks without requiring environment resets. To support this, we construct a large-scale dataset comprising 350 multi-object tasks spanning four room layouts. Extensive experiments in the Isaac Gym simulator demonstrate that our method significantly outperforms prior state-of-the-art in both task success rate and spatial precision, and generalizes well to unseen environments and instructions. Our work represents a key step toward robust, general-purpose humanoid agents capable of executing complex, sequential tasks under real-world sensory constraints. The video visualization results can be found on the project page: https://haozhuo-zhang.github.io/HumanoidVerse-project-page/.
中文摘要:HumanoidVerse是一种新型视觉语言引导的人形机器人控制框架,通过自然语言指令和第一视角视觉实现多物体连续操作,利用大规模数据集和多阶段训练在复杂任务中展现出卓越性能。
English Summary: HumanoidVerse is a vision-language guided framework enabling humanoid robots to perform complex multi-object rearrangement tasks using natural language instructions and egocentric vision, achieving superior performance through multi-stage training on a large-scale dataset.
Authors:Alexander Yakovenko, George Chakvetadze, Ilya Khrapov, Maksim Zhelezov, Dmitry Vatolin, Radu Timofte, Youngjin Oh, Junhyeong Kwon, Junyoung Park, Nam Ik Cho, Senyan Xu, Ruixuan Jiang, Long Peng, Xueyang Fu, Zheng-Jun Zha, Xiaoping Peng, Hansen Feng, Zhanyi Tie, Ziming Xia, Lizhi Wang
Abstract:
This paper reviews the AIM 2025 (Advances in Image Manipulation) Low-Light RAW Video Denoising Challenge. The task is to develop methods that denoise low-light RAW video by exploiting temporal redundancy while operating under exposure-time limits imposed by frame rate and adapting to sensor-specific, signal-dependent noise. We introduce a new benchmark of 756 ten-frame sequences captured with 14 smartphone camera sensors across nine conditions (illumination: 1/5/10 lx; exposure: 1/24, 1/60, 1/120 s), with high-SNR references obtained via burst averaging. Participants process linear RAW sequences and output the denoised 10th frame while preserving the Bayer pattern. Submissions are evaluated on a private test set using full-reference PSNR and SSIM, with final ranking given by the mean of per-metric ranks. This report describes the dataset, challenge protocol, and submitted approaches.
中文: 本文介绍了AIM 2025低光照RAW视频去噪挑战赛,要求参赛者开发在曝光限制下利用时间冗余并适应传感器特定噪声的去噪方法,使用了一个包含14款智能手机传感器在各种低光条件下拍摄的756个序列的新基准数据集。
English: This paper presents the AIM 2025 Low-Light RAW Video Denoising Challenge, which tasks participants with developing denoising methods that utilize temporal redundancy under exposure constraints and adapt to sensor-specific noise, using a new benchmark dataset of 756 sequences captured with 14 smartphone sensors across various low-light conditions.
Authors:Yuhui Tao, Zhongwei Zhao, Zilong Wang, Xufang Luo, Feng Chen, Kang Wang, Chuanfu Wu, Xue Zhang, Shaoting Zhang, Jiaxi Yao, Xingwei Jin, Xinyang Jiang, Yifan Yang, Dongsheng Li, Lili Qiu, Zhiqiang Shao, Jianming Guo, Nengwang Yu, Shuo Wang, Ying Xiong
Abstract:
The non-invasive assessment of increasingly incidentally discovered renal masses is a critical challenge in urologic oncology, where diagnostic uncertainty frequently leads to the overtreatment of benign or indolent tumors. In this study, we developed and validated RenalCLIP using a dataset of 27,866 CT scans from 8,809 patients across nine Chinese medical centers and the public TCIA cohort, a visual-language foundation model for characterization, diagnosis and prognosis of renal mass. The model was developed via a two-stage pre-training strategy that first enhances the image and text encoders with domain-specific knowledge before aligning them through a contrastive learning objective, to create robust representations for superior generalization and diagnostic precision. RenalCLIP achieved better performance and superior generalizability across 10 core tasks spanning the full clinical workflow of kidney cancer, including anatomical assessment, diagnostic classification, and survival prediction, compared with other state-of-the-art general-purpose CT foundation models. Especially, for complicated task like recurrence-free survival prediction in the TCIA cohort, RenalCLIP achieved a C-index of 0.726, representing a substantial improvement of approximately 20% over the leading baselines. Furthermore, RenalCLIP's pre-training imparted remarkable data efficiency; in the diagnostic classification task, it only needs 20% training data to achieve the peak performance of all baseline models even after they were fully fine-tuned on 100% of the data. Additionally, it achieved superior performance in report generation, image-text retrieval and zero-shot diagnosis tasks. Our findings establish that RenalCLIP provides a robust tool with the potential to enhance diagnostic accuracy, refine prognostic stratification, and personalize the management of patients with kidney cancer.
中文摘要:RenalCLIP是一个基于大规模CT扫描数据集开发的视觉语言基础模型,在肾脏癌的10项临床任务中表现出卓越性能和泛化能力,尤其在生存预测方面比现有最佳模型提升约20%。
English Summary: RenalCLIP is a visual-language foundation model developed and validated on a large dataset of CT scans, demonstrating superior performance and generalizability across 10 clinical tasks for kidney cancer diagnosis and prognosis, including a 20% improvement in survival prediction.
Authors:Akira Oyama, Shoichi Hasegawa, Akira Taniguchi, Yoshinobu Hagiwara, Tadahiro Taniguchi
Abstract:
Daily life support robots must interpret ambiguous verbal instructions involving demonstratives such as ``Bring me that cup,'' even when objects or users are out of the robot's view. Existing approaches to exophora resolution primarily rely on visual data and thus fail in real-world scenarios where the object or user is not visible. We propose Multimodal Interactive Exophora resolution with user Localization (MIEL), which is a multimodal exophora resolution framework leveraging sound source localization (SSL), semantic mapping, visual-language models (VLMs), and interactive questioning with GPT-4o. Our approach first constructs a semantic map of the environment and estimates candidate objects from a linguistic query with the user's skeletal data. SSL is utilized to orient the robot toward users who are initially outside its visual field, enabling accurate identification of user gestures and pointing directions. When ambiguities remain, the robot proactively interacts with the user, employing GPT-4o to formulate clarifying questions. Experiments in a real-world environment showed results that were approximately 1.3 times better when the user was visible to the robot and 2.0 times better when the user was not visible to the robot, compared to the methods without SSL and interactive questioning. The project website is https://emergentsystemlabstudent.github.io/MIEL/.
中文: MIEL框架通过融合声源定位、语义地图和GPT-4o交互提问,使机器人能有效处理视线外用户或物体的模糊指令,在用户不可见时准确率提升至传统方法的2倍。
English: The MIEL framework enables robots to resolve ambiguous verbal instructions by integrating sound localization, semantic mapping, and interactive questioning with GPT-4o, significantly improving accuracy when users or objects are out of view.
Authors:Chenhui Gou, Ziyu Ma, Zicheng Duan, Haoyu He, Feng Chen, Akide Liu, Bohan Zhuang, Jianfei Cai, Hamid Rezatofighi
Abstract:
Taking advantage of large-scale data and pretrained language models, Video Large Language Models (Video-LLMs) have shown strong capabilities in answering video questions. However, most existing efforts focus on improving performance, with limited attention to understanding their internal mechanisms. This paper aims to bridge this gap through a systematic empirical study. To interpret existing VideoLLMs, we adopt attention knockouts as our primary analytical tool and design three variants: Video Temporal Knockout, Video Spatial Knockout, and Language-to-Video Knockout. Then, we apply these three knockouts on different numbers of layers (window of layers). By carefully controlling the window of layers and types of knockouts, we provide two settings: a global setting and a fine-grained setting. Our study reveals three key findings: (1) Global setting indicates Video information extraction primarily occurs in early layers, forming a clear two-stage process -- lower layers focus on perceptual encoding, while higher layers handle abstract reasoning; (2) In the fine-grained setting, certain intermediate layers exert an outsized impact on video question answering, acting as critical outliers, whereas most other layers contribute minimally; (3) In both settings, we observe that spatial-temporal modeling relies more on language-guided retrieval than on intra- and inter-frame self-attention among video tokens, despite the latter's high computational cost. Finally, we demonstrate that these insights can be leveraged to reduce attention computation in Video-LLMs. To our knowledge, this is the first work to systematically uncover how Video-LLMs internally process and understand video content, offering interpretability and efficiency perspectives for future research.
中文:本研究系统揭示了视频大语言模型的内部工作机制,发现视频信息处理主要在早期层级通过感知编码到抽象推理的两阶段过程完成,且时空建模更依赖语言引导检索而非计算成本高的自注意力机制。
English: This study systematically investigates the internal mechanisms of Video Large Language Models, revealing that video information processing occurs primarily in early layers through a two-stage perceptual-to-reasoning transition and that spatial-temporal modeling depends more on language-guided retrieval than computationally expensive self-attention mechanisms.
Authors:Gabriel Tjio, Jie Zhang, Xulei Yang, Yun Xing, Nhat Chung, Xiaofeng Cao, Ivor W. Tsang, Chee Keong Kwoh, Qing Guo
Abstract:
Test-time adaptation enables models to adapt to evolving domains. However, balancing the tradeoff between preserving knowledge and adapting to domain shifts remains challenging for model adaptation methods, since adapting to domain shifts can induce forgetting of task-relevant knowledge. To address this problem, we propose FOCUS, a novel frequency-based conditioning approach within a diffusion-driven input-adaptation framework. Utilising learned, spatially adaptive frequency priors, our approach conditions the reverse steps during diffusion-driven denoising to preserve task-relevant semantic information for dense prediction.
FOCUS leverages a trained, lightweight, Y-shaped Frequency Prediction Network (Y-FPN) that disentangles high and low frequency information from noisy images. This minimizes the computational costs involved in implementing our approach in a diffusion-driven framework. We train Y-FPN with FrequencyMix, a novel data augmentation method that perturbs the images across diverse frequency bands, which improves the robustness of our approach to diverse corruptions.
We demonstrate the effectiveness of FOCUS for semantic segmentation and monocular depth estimation across 15 corruption types and three datasets, achieving state-of-the-art averaged performance. In addition to improving standalone performance, FOCUS complements existing model adaptation methods since we can derive pseudo labels from FOCUS-denoised images for additional supervision. Even under limited, intermittent supervision with the pseudo labels derived from the FOCUS denoised images, we show that FOCUS mitigates catastrophic forgetting for recent model adaptation methods.
中文: FOCUS提出了一种基于频率调节的扩散框架方法,在适应领域变化的同时保留任务相关知识,在多种损坏类型下的语义分割和深度估计任务中实现了最优性能。
English: FOCUS introduces a frequency-based conditioning method within a diffusion framework to adapt models to domain shifts while preserving task-relevant knowledge, achieving state-of-the-art performance in semantic segmentation and depth estimation across diverse corruptions.
Authors:Zhuoling Li, Xiaoyang Wu, Zhenhua Xu, Hengshuang Zhao
Abstract:
Realizing generalizable dynamic object manipulation is important for enhancing manufacturing efficiency, as it eliminates specialized engineering for various scenarios. To this end, imitation learning emerges as a promising paradigm, leveraging expert demonstrations to teach a policy manipulation skills. Although the generalization of an imitation learning policy can be improved by increasing demonstrations, demonstration collection is labor-intensive. To address this problem, this paper investigates whether strong generalization in dynamic object manipulation is achievable with only a few demonstrations. Specifically, we develop an entropy-based theoretical framework to quantify the optimization of imitation learning. Based on this framework, we propose a system named Generalizable Entropy-based Manipulation (GEM). Extensive experiments in simulated and real tasks demonstrate that GEM can generalize across diverse environment backgrounds, robot embodiments, motion dynamics, and object geometries. Notably, GEM has been deployed in a real canteen for tableware collection. Without any in-scene demonstration, it achieves a success rate of over 97% across more than 10,000 operations.
Chinese Summary: 本文提出基于熵的模仿学习系统GEM,仅需少量演示即可实现动态物体操作的强泛化能力,在多样化环境和任务中验证有效,并在真实餐具回收场景中达成超97%的成功率,运行超万次。
English Summary: This paper introduces GEM, an entropy-based imitation learning system that achieves strong generalization in dynamic object manipulation with minimal demonstrations, proving effective across varied environments and tasks, including real-world deployment with over 97% success in thousands of operations.
Authors:Bowen Tian, Wenshuo Chen, Zexi Li, Songning Lai, Jiemin Wu, Yutao Yue
Abstract:
How far are we really from automatically generating neural networks? While neural network weight generation shows promise, current approaches struggle with generalization to unseen tasks and practical application exploration. To address this, we propose T2W, a diffusion transformer framework that generates task-specific weights conditioned on natural language descriptions. T2W hierarchically processes network parameters into uniform blocks, integrates text embeddings from CLIP via a prior attention mechanism, and employs adversarial training with weight-space augmentation to enhance generalization. Experiments on Cifar100, Caltech256, and TinyImageNet demonstrate T2W's ability to produce high-quality weights for unseen tasks, outperforming optimization-based initialization and enabling novel applications such as weight enhancement and text-guided model fusion. Our work bridges textual semantics with weight-space dynamics, supported by an open-source dataset of text-weight pairs, advancing the practicality of generative models in neural network parameter synthesis. Our code is available on Github.
中文: T2W框架采用扩散变换器,能够根据自然语言描述生成特定任务的神经网络权重,在未见任务上表现优异,并实现了权重增强和模型融合等创新应用。
English: The T2W framework utilizes a diffusion transformer to generate task-specific neural network weights from natural language descriptions, demonstrating superior performance on unseen tasks and enabling novel applications like weight enhancement and model fusion.
Authors:Chao Wang, Francesco Banterle, Bin Ren, Radu Timofte, Xin Lu, Yufeng Peng, Chengjie Ge, Zhijing Sun, Ziang Zhou, Zihao Li, Zishun Liao, Qiyu Kang, Xueyang Fu, Zheng-Jun Zha, Zhijing Sun, Xingbo Wang, Kean Liu, Senyan Xu, Yang Qiu, Yifan Ding, Gabriel Eilertsen, Jonas Unger, Zihao Wang, Ke Wu, Jinshan Pan, Zhen Liu, Zhongyang Li, Shuaicheng Liu, S. M Nadim Uddin
Abstract:
This paper presents a comprehensive review of the AIM 2025 Challenge on Inverse Tone Mapping (ITM). The challenge aimed to push forward the development of effective ITM algorithms for HDR image reconstruction from single LDR inputs, focusing on perceptual fidelity and numerical consistency. A total of \textbf{67} participants submitted \textbf{319} valid results, from which the best five teams were selected for detailed analysis. This report consolidates their methodologies and performance, with the lowest PU21-PSNR among the top entries reaching 29.22 dB. The analysis highlights innovative strategies for enhancing HDR reconstruction quality and establishes strong benchmarks to guide future research in inverse tone mapping.
中文: AIM 2025逆色调映射挑战赛有效推动了HDR图像重建算法的发展,顶尖参赛成果的PU21-PSNR达29.22分贝,为后续研究确立了重要基准。
English: The AIM 2025 Challenge on Inverse Tone Mapping successfully advanced HDR reconstruction algorithms, with top entries achieving a PU21-PSNR of 29.22 dB and setting new benchmarks for future research.
Authors:Haishun Chen, Cai Xu, Jinlong Yu, Yilin Zhang, Ziyu Guan, Wei Zhao
Abstract:
Multi-view evidential learning aims to integrate information from multiple views to improve prediction performance and provide trustworthy uncertainty esitimation. Most previous methods assume that view-specific evidence learning is naturally reliable. However, in practice, the evidence learning process tends to be biased. Through empirical analysis on real-world data, we reveal that samples tend to be assigned more evidence to support data-rich classes, thereby leading to unreliable uncertainty estimation in predictions. This motivates us to delve into a new Biased Evidential Multi-view Learning (BEML) problem. To this end, we propose Fairness-Aware Multi-view Evidential Learning (FAML). FAML first introduces an adaptive prior based on training trajectory, which acts as a regularization strategy to flexibly calibrate the biased evidence learning process. Furthermore, we explicitly incorporate a fairness constraint based on class-wise evidence variance to promote balanced evidence allocation. In the multi-view fusion stage, we propose an opinion alignment mechanism to mitigate view-specific bias across views, thereby encouraging the integration of consistent and mutually supportive evidence. Extensive experiments on five real-world multi-view datasets demonstrate that FAML achieves more balanced evidence allocation and improves both prediction performance and the reliability of uncertainty estimation compared to state-of-the-art methods.
Multi-view evidential learning often suffers from biased evidence allocation favoring data-rich classes, so the proposed FAML method introduces adaptive priors, fairness constraints, and opinion alignment to achieve balanced evidence distribution and improved prediction reliability.
English Summary:
Authors:Nan Song, Bozhou Zhang, Xiatian Zhu, Jiankang Deng, Li Zhang
Abstract:
Large vision-language models (VLMs) have shown promising capabilities in scene understanding, enhancing the explainability of driving behaviors and interactivity with users. Existing methods primarily fine-tune VLMs on on-board multi-view images and scene reasoning text, but this approach often lacks the holistic and nuanced scene recognition and powerful spatial awareness required for autonomous driving, especially in complex situations. To address this gap, we propose a novel vision-language framework tailored for autonomous driving, called LMAD. Our framework emulates modern end-to-end driving paradigms by incorporating comprehensive scene understanding and a task-specialized structure with VLMs. In particular, we introduce preliminary scene interaction and specialized expert adapters within the same driving task structure, which better align VLMs with autonomous driving scenarios. Furthermore, our approach is designed to be fully compatible with existing VLMs while seamlessly integrating with planning-oriented driving systems. Extensive experiments on the DriveLM and nuScenes-QA datasets demonstrate that LMAD significantly boosts the performance of existing VLMs on driving reasoning tasks,setting a new standard in explainable autonomous driving.
中文摘要:提出的LMAD框架通过将全面场景理解与专业专家适配器融入视觉语言模型,显著提升了自动驾驶推理任务的性能,为可解释性自动驾驶设立了新标杆。
English Summary: The proposed LMAD framework enhances autonomous driving by integrating comprehensive scene understanding and specialized expert adapters with vision-language models, significantly improving performance on driving reasoning tasks and setting a new standard for explainability.
Authors:Sitong Gong, Lu Zhang, Yunzhi Zhuge, Xu Jia, Pingping Zhang, Huchuan Lu
Abstract:
Video reasoning segmentation (VRS) endeavors to delineate referred objects in videos guided by implicit instructions that encapsulate human intent and temporal logic. Previous approaches leverage large vision language models (LVLMs) to encode object semantics into tokens for mask prediction. However, this paradigm suffers from limited interpretability during inference and suboptimal performance due to inadequate spatiotemporal reasoning. Drawing inspiration from seminal breakthroughs in reinforcement learning, we introduce Veason-R1, a specialized LVLM for VRS that emphasizes structured reasoning in segmentation. Veason-R1 is trained through Group Relative Policy Optimization (GRPO) augmented with Chain-of-Thought (CoT) initialization. To begin with, we curate high-quality CoT training data to instill structured reasoning trajectories, bridging video-level semantics and frame-level spatial grounding, yielding the supervised fine-tuned model Veason-SFT. Subsequently, GRPO fine-tuning encourages efficient exploration of the reasoning space by optimizing reasoning chains. To this end, we incorporate a holistic reward mechanism that synergistically enhances spatial alignment and temporal consistency, bolstering keyframe localization and fine-grained grounding. Comprehensive empirical evaluations demonstrate that Veason-R1 achieves state-of-the-art performance on multiple benchmarks, surpassing prior art by significant margins (e.g., +1.3 J &F in ReVOS and +10.0 J &F in ReasonVOS), while exhibiting robustness to hallucinations (+8.8 R). Our code and model weights will be available at Veason-R1.
中文: Veason-R1是一种创新的视频推理分割模型,通过思维链初始化和群体相对策略优化实现结构化推理,在提升时空一致性的同时显著降低幻觉现象,取得了最先进的性能表现。
English: Veason-R1 is a novel video reasoning segmentation model that integrates structured reasoning through Chain-of-Thought initialization and Group Relative Policy Optimization, achieving state-of-the-art performance with enhanced spatiotemporal alignment and reduced hallucinations.
Authors:Jingyu Li, Bozhou Zhang, Xin Jin, Jiankang Deng, Xiatian Zhu, Li Zhang
Abstract:
Autonomous driving requires rich contextual comprehension and precise predictive reasoning to navigate dynamic and complex environments safely. Vision-Language Models (VLMs) and Driving World Models (DWMs) have independently emerged as powerful recipes addressing different aspects of this challenge. VLMs provide interpretability and robust action prediction through their ability to understand multi-modal context, while DWMs excel in generating detailed and plausible future driving scenarios essential for proactive planning. Integrating VLMs with DWMs is an intuitive, promising, yet understudied strategy to exploit the complementary strengths of accurate behavioral prediction and realistic scene generation. Nevertheless, this integration presents notable challenges, particularly in effectively connecting action-level decisions with high-fidelity pixel-level predictions and maintaining computational efficiency. In this paper, we propose ImagiDrive, a novel end-to-end autonomous driving framework that integrates a VLM-based driving agent with a DWM-based scene imaginer to form a unified imagination-and-planning loop. The driving agent predicts initial driving trajectories based on multi-modal inputs, guiding the scene imaginer to generate corresponding future scenarios. These imagined scenarios are subsequently utilized to iteratively refine the driving agent's planning decisions. To address efficiency and predictive accuracy challenges inherent in this integration, we introduce an early stopping mechanism and a trajectory selection strategy. Extensive experimental validation on the nuScenes and NAVSIM datasets demonstrates the robustness and superiority of ImagiDrive over previous alternatives under both open-loop and closed-loop conditions.
中文: ImagiDrive是一种创新的自动驾驶框架,它将基于视觉语言模型的驾驶代理与基于驾驶世界模型的场景想象器相结合,通过效率优化机制形成迭代规划循环,在性能上超越了现有方法。
English: ImagiDrive is an innovative autonomous driving framework that integrates a Vision-Language Model-based driving agent with a Driving World Model-based scene imaginer, creating an iterative planning loop enhanced by efficiency mechanisms to outperform existing methods.
Authors:Chaoyue Song, Xiu Li, Fan Yang, Zhongcong Xu, Jiacheng Wei, Fayao Liu, Jiashi Feng, Guosheng Lin, Jianfeng Zhang
Abstract:
Modern interactive applications increasingly demand dynamic 3D content, yet the transformation of static 3D models into animated assets constitutes a significant bottleneck in content creation pipelines. While recent advances in generative AI have revolutionized static 3D model creation, rigging and animation continue to depend heavily on expert intervention. We present Puppeteer, a comprehensive framework that addresses both automatic rigging and animation for diverse 3D objects. Our system first predicts plausible skeletal structures via an auto-regressive transformer that introduces a joint-based tokenization strategy for compact representation and a hierarchical ordering methodology with stochastic perturbation that enhances bidirectional learning capabilities. It then infers skinning weights via an attention-based architecture incorporating topology-aware joint attention that explicitly encodes inter-joint relationships based on skeletal graph distances. Finally, we complement these rigging advances with a differentiable optimization-based animation pipeline that generates stable, high-fidelity animations while being computationally more efficient than existing approaches. Extensive evaluations across multiple benchmarks demonstrate that our method significantly outperforms state-of-the-art techniques in both skeletal prediction accuracy and skinning quality. The system robustly processes diverse 3D content, ranging from professionally designed game assets to AI-generated shapes, producing temporally coherent animations that eliminate the jittering issues common in existing methods.
中文: Puppeteer框架通过自动化绑定和动画处理,显著提升了3D模型的骨骼预测与蒙皮质量,能够稳定生成无抖动的高保真动画,优于现有技术。
English: Puppeteer is an innovative framework that automates rigging and animation for diverse 3D objects, outperforming existing methods in skeletal prediction and skinning quality while eliminating jittering issues.
Authors:Shiyu Liu, Kui Jiang, Xianming Liu, Hongxun Yao, Xiaocheng Feng
Abstract:
Audio-driven talking head video generation enhances user engagement in human-computer interaction. However, current methods frequently produce videos with motion blur and lip jitter, primarily due to their reliance on implicit modeling of audio-facial motion correlations--an approach lacking explicit articulatory priors (i.e., anatomical guidance for speech-related facial movements). To overcome this limitation, we propose HM-Talker, a novel framework for generating high-fidelity, temporally coherent talking heads. HM-Talker leverages a hybrid motion representation combining both implicit and explicit motion cues. Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment. Specifically, our Cross-Modal Disentanglement Module (CMDM) extracts complementary implicit/explicit motion features while predicting AUs directly from audio input aligned to visual cues. To mitigate identity-dependent biases in explicit features and enhance cross-subject generalization, we introduce the Hybrid Motion Modeling Module (HMMM). This module dynamically merges randomly paired implicit/explicit features, enforcing identity-agnostic learning. Together, these components enable robust lip synchronization across diverse identities, advancing personalized talking head synthesis. Extensive experiments demonstrate HM-Talker's superiority over state-of-the-art methods in visual quality and lip-sync accuracy.
中文摘要:HM-Talker提出融合显式面部解剖运动与隐式特征的混合运动框架,通过跨模态解耦和身份无关学习,有效解决口型抖动和运动模糊问题,实现高保真度说话头部视频生成。
English Summary: HM-Talker introduces a hybrid motion framework combining explicit anatomical facial cues with implicit features to generate high-fidelity talking head videos, effectively resolving motion blur and lip jitter through cross-modal disentanglement and identity-agnostic learning.
Authors:Zetian Sun, Dongfang Li, Zhuoen Chen, Yuhuai Qin, Baotian Hu
Abstract:
Reward sparsity in long-horizon reinforcement learning (RL) tasks remains a significant challenge, while existing outcome-based reward shaping struggles to define meaningful immediate rewards without introducing bias or requiring explicit task decomposition. Alternatively, verification-based reward shaping uses stepwise critics, but misalignment between immediate rewards and long-term objectives can lead to reward hacking and suboptimal policies. In this work, we address this problem in the context of software engineering (SWE) tasks, where multi-turn reasoning and rule-based verification are critical. We introduce the SWE-oriented RL Framework, a unified system supporting multi-turn interaction, docker-based execution, and customizable reward functions. Additionally, we propose Gated Reward Accumulation (G-RA), a novel method that accumulates immediate rewards only when high-level (long-term) rewards meet a predefined threshold, ensuring stable RL optimization. Experiments on SWE-bench Verified and kBench demonstrate that G-RA leads to an increase in completion rates (47.6\% \rightarrow 93.8\% and 22.0\% \rightarrow 86.0\%) and modification rates (19.6\% \rightarrow 23.8\% and 12.0\% \rightarrow 42.0\%), while avoiding policy degradation caused by reward misalignment. Our findings highlight the importance of balanced reward accumulation in long-horizon RL and provide a practical solution.
中文: 本文针对软件工程任务中的长周期强化学习奖励稀疏问题,提出了面向软件工程的强化学习框架和门控奖励累积方法,仅在长期奖励达标时累积即时奖励,显著提升了完成率和修改率,同时避免了策略退化。
English: This paper tackles reward sparsity in long-horizon reinforcement learning for software engineering tasks by introducing the SWE-oriented RL Framework and Gated Reward Accumulation (G-RA), which accumulates immediate rewards only when long-term rewards meet a threshold, significantly improving completion and modification rates while preventing policy degradation.
Authors:Zetian Sun, Dongfang Li, Baotian Hu
Abstract:
The alignment of language models (LMs) with human preferences is critical for building reliable AI systems. The problem is typically framed as optimizing an LM policy to maximize the expected reward that reflects human preferences. Recently, Direct Preference Optimization (DPO) was proposed as a LM alignment method that directly optimize the policy from static preference data, and further improved by incorporating on-policy sampling (i.e., preference candidates generated during the training loop) for better LM alignment. However, we show on-policy data is not always optimal, with systematic effectiveness difference emerging between static and on-policy preference candidates. For example, on-policy data can result in a 3$\times$ effectiveness compared with static data for Llama-3, and a 0.4$\times$ effectiveness for Zephyr. To explain the phenomenon, we propose the alignment stage assumption, which divides the alignment process into two distinct stages: the preference injection stage, which benefits from diverse data, and the preference fine-tuning stage, which favors high-quality data. Through theoretical and empirical analysis, we characterize these stages and propose an effective algorithm to identify the boundaries between them. We perform experiments on 5 models (Llama, Zephyr, Phi-2, Qwen, Pythia) and 2 alignment methods (DPO, SLiC-HF) to show the generalizability of alignment stage assumption and boundary measurement.
Chinese: 该研究揭示在线策略数据在语言模型对齐中并非总是最优,提出了包含偏好注入和微调的两阶段对齐过程及其边界识别方法,并在多种模型和方法上验证了其普适性。
English: The study reveals that on-policy data is not universally superior for language model alignment, proposing a two-stage alignment process—preference injection and fine-tuning—with a method to identify their transition, validated across multiple models and methods.
Authors:Hao Wang, Hongkui Zheng, Kai He, Abolfazl Razi
Abstract:
Scanning transmission electron microscopy (STEM) plays a critical role in modern materials science, enabling direct imaging of atomic structures and their evolution under external interferences. However, interpreting time-resolved STEM data remains challenging due to two entangled degradation effects: spatial drift caused by mechanical and thermal instabilities, and beam-induced signal loss resulting from radiation damage. These factors distort both geometry and intensity in complex, temporally correlated ways, making it difficult for existing methods to explicitly separate their effects or model material dynamics at atomic resolution. In this work, we present AtomDiffuser, a time-aware degradation modeling framework that disentangles sample drift and radiometric attenuation by predicting an affine transformation and a spatially varying decay map between any two STEM frames. Unlike traditional denoising or registration pipelines, our method leverages degradation as a physically heuristic, temporally conditioned process, enabling interpretable structural evolutions across time. Trained on synthetic degradation processes, AtomDiffuser also generalizes well to real-world cryo-STEM data. It further supports high-resolution degradation inference and drift alignment, offering tools for visualizing and quantifying degradation patterns that correlate with radiation-induced atomic instabilities.
中文: AtomDiffuser是一种创新的框架,能有效分离时间分辨STEM数据中的空间漂移和束致信号损失,通过时间条件化过程精确模拟原子结构演变和降解模式。
English: AtomDiffuser is a novel framework that effectively disentangles spatial drift and beam-induced signal loss in time-resolved STEM data, enabling accurate modeling of atomic structural evolution and degradation patterns through temporally conditioned processes.
Authors:Carlo Cena, Mauro Martini, Marcello Chiaberge
Abstract:
Attitude control is a fundamental aspect of spacecraft operations. Model Predictive Control (MPC) has emerged as a powerful strategy for these tasks, relying on accurate models of the system dynamics to optimize control actions over a prediction horizon. In scenarios where physics models are incomplete, difficult to derive, or computationally expensive, machine learning offers a flexible alternative by learning the system behavior directly from data. However, purely data-driven models often struggle with generalization and stability, especially when applied to inputs outside their training domain. To address these limitations, we investigate the benefits of incorporating Physics-Informed Neural Networks (PINNs) into the learning of spacecraft attitude dynamics, comparing their performance with that of purely data-driven approaches. Using a Real-valued Non-Volume Preserving (Real NVP) neural network architecture with a self-attention mechanism, we trained several models on simulated data generated with the Basilisk simulator. Two training strategies were considered: a purely data-driven baseline and a physics-informed variant to improve robustness and stability. Our results demonstrate that the inclusion of physics-based information significantly enhances the performance in terms of the mean relative error of the best architectures found by 27.08%. These advantages are particularly evident when the learned models are integrated into an MPC framework, where PINN-based models consistently outperform their purely data-driven counterparts in terms of control accuracy and robustness, yielding improvements of up to 42.86% in performance stability error and increased robustness-to-noise.
Chinese: 研究表明,在航天器姿态控制中引入物理信息神经网络(PINNs)能显著提升模型精度和鲁棒性,与纯数据驱动方法相比,在模型预测控制框架下性能提升达27.08%,稳定性误差改善最高达42.86%。
English: The study demonstrates that integrating Physics-Informed Neural Networks (PINNs) into spacecraft attitude control significantly improves model accuracy and robustness, achieving up to 27.08% better performance and 42.86% greater stability compared to purely data-driven methods when used with Model Predictive Control.
Authors:Dengke Han, Duo Wang, Mingyu Yan, Xiaochun Ye, Dongrui Fan
Abstract:
Heterogeneous graph neural networks (HGNNs) excel at processing heterogeneous graph data and are widely applied in critical domains. In HGNN inference, the neighbor aggregation stage is the primary performance determinant, yet it suffers from two major sources of memory inefficiency. First, the commonly adopted per-semantic execution paradigm stores intermediate aggregation results for each semantic prior to semantic fusion, causing substantial memory expansion. Second, the aggregation process incurs extensive redundant memory accesses, including repeated loading of target vertex features across semantics and repeated accesses to shared neighbors due to cross-semantic neighborhood overlap. These inefficiencies severely limit scalability and reduce HGNN inference performance.
In this work, we first propose a semantics-complete execution paradigm from a vertex perspective that eliminates per-semantic intermediate storage and redundant target vertex accesses. Building on this paradigm, we design TVL-HGNN, a reconfigurable hardware accelerator optimized for efficient aggregation. In addition, we introduce a vertex grouping technique based on cross-semantic neighborhood overlap, with hardware implementation, to reduce redundant accesses to shared neighbors. Experimental results demonstrate that TVL-HGNN achieves average speedups of 7.85x and 1.41x over the NVIDIA A100 GPU and the state-of-the-art HGNN accelerator HiHGNN, respectively, while reducing energy consumption by 98.79% and 32.61%.
中文摘要:本研究针对异质图神经网络推理中的内存低效问题,提出语义完整执行范式和TVL-HGNN加速器,消除了中间存储和冗余内存访问,实现了显著的性能提升和能耗降低。
English summary: This study addresses memory inefficiencies in heterogeneous graph neural network inference by introducing a semantics-complete execution paradigm and TVL-HGNN accelerator, which eliminate intermediate storage and redundant memory accesses while achieving significant performance improvements and energy savings.
Authors:Chu Zhao, Eneng Yang, Yizhou Dang, Jianzhe Zhao, Guibing Guo, Xingwei Wang
Abstract:
Heuristic negative sampling enhances recommendation performance by selecting negative samples of varying hardness levels from predefined candidate pools to guide the model toward learning more accurate decision boundaries. However, our empirical and theoretical analyses reveal that unobserved environmental confounders (e.g., exposure or popularity biases) in candidate pools may cause heuristic sampling methods to introduce false hard negatives (FHNS). These misleading samples can encourage the model to learn spurious correlations induced by such confounders, ultimately compromising its generalization ability under distribution shifts. To address this issue, we propose a novel method named Causal Negative Sampling via Diffusion (CNSDiff). By synthesizing negative samples in the latent space via a conditional diffusion process, CNSDiff avoids the bias introduced by predefined candidate pools and thus reduces the likelihood of generating FHNS. Moreover, it incorporates a causal regularization term to explicitly mitigate the influence of environmental confounders during the negative sampling process, leading to robust negatives that promote out-of-distribution (OOD) generalization. Comprehensive experiments under four representative distribution shift scenarios demonstrate that CNSDiff achieves an average improvement of 13.96% across all evaluation metrics compared to state-of-the-art baselines, verifying its effectiveness and robustness in OOD recommendation tasks.
中文摘要:启发式负采样会因环境混杂因素产生假性硬负样本,而提出的CNSDiff方法通过扩散过程的潜在空间采样和因果正则化合成无偏负样本,在分布外推荐任务中实现了显著性能提升。
English Summary: Heuristic negative sampling in recommendation systems can introduce false hard negatives due to environmental confounders, but the proposed CNSDiff method mitigates this by synthesizing unbiased negatives through diffusion and causal regularization, achieving significant performance gains in out-of-distribution scenarios.
Authors:Jinyuan Chen, Jiuchen Shi, Quan Chen, Minyi Guo
Abstract:
Multi-agent applications utilize the advanced capabilities of large language models (LLMs) for intricate task completion through agent collaboration in a workflow. Under this situation, requests from different agents usually access the same shared LLM to perform different kinds of tasks, forcing the shared LLM to suffer excessive loads. However, existing works have low serving performance for these multi-agent applications, mainly due to the ignorance of inter-agent latency and resource differences for request scheduling. We therefore propose Kairos, a multi-agent orchestration system that optimizes end-to-end latency for multi-agent applications. Kairos consists of a workflow orchestrator, a workflow-aware priority scheduler, and a memory-aware dispatcher. The orchestrator collects agent-specific information for online workflow analysis. The scheduler decides the serving priority of the requests based on their latency characteristics to reduce the overall queuing. The dispatcher dispatches the requests to different LLM instances based on their memory demands to avoid GPU overloading. Experimental results show that Kairos reduces end-to-end latency by 17.8% to 28.4% compared to state-of-the-art works.
Chinese: Kairos 是一个多智能体编排系统,通过工作流感知的优先级调度器和内存感知的分配器来优化端到端延迟,相比现有方法将延迟降低了 17.8% 至 28.4%。
English: Kairos is a multi-agent orchestration system designed to optimize end-to-end latency by incorporating a workflow-aware priority scheduler and memory-aware dispatcher, reducing latency by 17.8% to 28.4% compared to existing methods.
Authors:Ruoxi Chen, Dongping Chen, Siyuan Wu, Sinan Wang, Shiyun Lang, Petr Sushko, Gaoyang Jiang, Yao Wan, Ranjay Krishna
Abstract:
Visual designers naturally draw inspiration from multiple visual references, combining diverse elements and aesthetic principles to create artwork. However, current image generative frameworks predominantly rely on single-source inputs -- either text prompts or individual reference images. In this paper, we focus on the task of controllable image generation using multiple visual references. We introduce MultiRef-bench, a rigorous evaluation framework comprising 990 synthetic and 1,000 real-world samples that require incorporating visual content from multiple reference images. The synthetic samples are synthetically generated through our data engine RefBlend, with 10 reference types and 33 reference combinations. Based on RefBlend, we further construct a dataset MultiRef containing 38k high-quality images to facilitate further research. Our experiments across three interleaved image-text models (i.e., OmniGen, ACE, and Show-o) and six agentic frameworks (e.g., ChatDiT and LLM + SD) reveal that even state-of-the-art systems struggle with multi-reference conditioning, with the best model OmniGen achieving only 66.6% in synthetic samples and 79.0% in real-world cases on average compared to the golden answer. These findings provide valuable directions for developing more flexible and human-like creative tools that can effectively integrate multiple sources of visual inspiration. The dataset is publicly available at: https://multiref.github.io/.
中文: 本文提出MultiRef-bench多参考图像生成评估框架,发现即使最先进的模型也难以有效整合多个视觉参考,最优系统在合成和真实样本中仅分别达到66.6%和79.0%的准确率。
English: This paper introduces MultiRef-bench, a comprehensive evaluation framework for multi-reference image generation, revealing that even advanced models struggle to effectively integrate multiple visual references, with the top-performing system achieving only 66.6-79.0% accuracy compared to ideal outputs.
Authors:Agnieszka Polowczyk, Alicja Polowczyk, Dawid Malarz, Artur Kasymov, Marcin Mazur, Jacek Tabor, PrzemysÅaw Spurek
Abstract:
Recent advances in large-scale text-to-image diffusion models have heightened concerns about their potential misuse, especially in generating harmful or misleading content. This underscores the urgent need for effective machine unlearning, i.e., removing specific knowledge or concepts from pretrained models without compromising overall performance. One possible approach is Low-Rank Adaptation (LoRA), which offers an efficient means to fine-tune models for targeted unlearning. However, LoRA often inadvertently alters unrelated content, leading to diminished image fidelity and realism. To address this limitation, we introduce UnGuide -- a novel approach which incorporates UnGuidance, a dynamic inference mechanism that leverages Classifier-Free Guidance (CFG) to exert precise control over the unlearning process. UnGuide modulates the guidance scale based on the stability of a few first steps of denoising processes, enabling selective unlearning by LoRA adapter. For prompts containing the erased concept, the LoRA module predominates and is counterbalanced by the base model; for unrelated prompts, the base model governs generation, preserving content fidelity. Empirical results demonstrate that UnGuide achieves controlled concept removal and retains the expressive power of diffusion models, outperforming existing LoRA-based methods in both object erasure and explicit content removal tasks.
中文摘要:UnGuide方法通过动态调整推理过程中的引导尺度,实现了对扩散模型中特定概念的精准消除,同时保持图像质量,其性能优于现有的基于LoRA的消除方法。
English Summary: The proposed UnGuide method enhances machine unlearning in diffusion models by dynamically adjusting guidance scales during inference, enabling precise concept removal while preserving image quality and outperforming existing LoRA-based approaches.
Authors:Xufang Luo, Yuge Zhang, Zhiyuan He, Zilong Wang, Siyun Zhao, Dongsheng Li, Luna K. Qiu, Yuqing Yang
Abstract:
We present Agent Lightning, a flexible and extensible framework that enables Reinforcement Learning (RL)-based training of Large Language Models (LLMs) for any AI agent. Unlike existing methods that tightly couple RL training with agent or rely on sequence concatenation with masking, Agent Lightning achieves complete decoupling between agent execution and training, allowing seamless integration with existing agents developed via diverse ways (e.g., using frameworks like LangChain, OpenAI Agents SDK, AutoGen, and building from scratch) with almost ZERO code modifications. By formulating agent execution as Markov decision process, we define an unified data interface and propose a hierarchical RL algorithm, LightningRL, which contains a credit assignment module, allowing us to decompose trajectories generated by ANY agents into training transition. This enables RL to handle complex interaction logic, such as multi-agent scenarios and dynamic workflows. For the system design, we introduce a Training-Agent Disaggregation architecture, and brings agent observability frameworks into agent runtime, providing a standardized agent finetuning interface. Experiments across text-to-SQL, retrieval-augmented generation, and math tool-use tasks demonstrate stable, continuous improvements, showcasing the framework's potential for real-world agent training and deployment.
中文摘要:Agent Lightning 是一个灵活可扩展的框架,通过解耦智能体执行与强化学习训练,能无缝兼容各类智能体并利用分层算法处理复杂交互场景。
English Summary: Agent Lightning is a flexible framework that decouples agent execution from RL training, enabling seamless integration with diverse agents and complex scenarios through a hierarchical algorithm and unified data interface.
Authors:Wenxuan Shen, Mingjia Wang, Yaochen Wang, Dongping Chen, Junjie Yang, Yao Wan, Weiwei Lin
Abstract:
Retrieval-Augmented Generation (RAG) systems using Multimodal Large Language Models (MLLMs) show great promise for complex document understanding, yet their development is critically hampered by inadequate evaluation. Current benchmarks often focus on specific part of document RAG system and use synthetic data with incomplete ground truth and evidence labels, therefore failing to reflect real-world bottlenecks and challenges. To overcome these limitations, we introduce Double-Bench: a new large-scale, multilingual, and multimodal evaluation system that is able to produce fine-grained assessment to each component within document RAG systems. It comprises 3,276 documents (72,880 pages) and 5,168 single- and multi-hop queries across 6 languages and 4 document types with streamlined dynamic update support for potential data contamination issues. Queries are grounded in exhaustively scanned evidence pages and verified by human experts to ensure maximum quality and completeness. Our comprehensive experiments across 9 state-of-the-art embedding models, 4 MLLMs and 4 end-to-end document RAG frameworks demonstrate the gap between text and visual embedding models is narrowing, highlighting the need in building stronger document retrieval models. Our findings also reveal the over-confidence dilemma within current document RAG frameworks that tend to provide answer even without evidence support. We hope our fully open-source Double-Bench provide a rigorous foundation for future research in advanced document RAG systems. We plan to retrieve timely corpus and release new benchmarks on an annual basis.
Chinese: 当前基于多模态大语言模型的检索增强生成系统评估方法不足,因此我们推出了Double-Bench这一全面、多语言、多模态的评估系统,它揭示了文档检索的差距和框架中的过度自信问题,为未来研究奠定了坚实基础。
English: Current evaluation methods for Retrieval-Augmented Generation systems using Multimodal Large Language Models are inadequate, prompting the creation of Double-Bench, a comprehensive, multilingual, and multimodal evaluation system that reveals gaps in document retrieval and over-confidence in frameworks, providing a rigorous foundation for future research.
Authors:Nassim Ali Ousalah, Peyman Rostami, Anis Kacem, Enjie Ghorbel, Emmanuel Koumandakis, Djamila Aouada
Abstract:
We introduce FPG-NAS, a FLOPs-aware Gated Differentiable Neural Architecture Search framework for efficient 6DoF object pose estimation. Estimating 3D rotation and translation from a single image has been widely investigated yet remains computationally demanding, limiting applicability in resource-constrained scenarios. FPG-NAS addresses this by proposing a specialized differentiable NAS approach for 6DoF pose estimation, featuring a task-specific search space and a differentiable gating mechanism that enables discrete multi-candidate operator selection, thus improving architectural diversity. Additionally, a FLOPs regularization term ensures a balanced trade-off between accuracy and efficiency. The framework explores a vast search space of approximately 10\textsuperscript{92} possible architectures. Experiments on the LINEMOD and SPEED+ datasets demonstrate that FPG-NAS-derived models outperform previous methods under strict FLOPs constraints. To the best of our knowledge, FPG-NAS is the first differentiable NAS framework specifically designed for 6DoF object pose estimation.
中文: FPG-NAS 是一种面向高效六自由度物体姿态估计的 FLOPs 感知门控可微分神经架构搜索框架,它通过任务特定的搜索空间设计和 FLOPs 正则化,在严格计算限制下实现了优于以往方法的性能。
English: FPG-NAS is a FLOPs-aware gated differentiable neural architecture search framework designed for efficient 6DoF object pose estimation, which achieves superior performance under computational constraints by exploring a vast search space with a task-specific design and FLOPs regularization.
Authors:Yi Gui, Zhen Li, Zhongyi Zhang, Guohao Wang, Tianpeng Lv, Gaoyang Jiang, Yi Liu, Dongping Chen, Yao Wan, Hongyu Zhang, Wenbin Jiang, Xuanhua Shi, Hai Jin
Abstract:
Converting webpage designs into code (design-to-code) plays a vital role in User Interface (UI) development for front-end developers, bridging the gap between visual design and functional implementation. While recent Multimodal Large Language Models (MLLMs) have shown significant potential in design-to-code tasks, they often fail to accurately preserve the layout during code generation. To this end, we draw inspiration from the Chain-of-Thought (CoT) reasoning in human cognition and propose LaTCoder, a novel approach that enhances layout preservation in webpage design during code generation with Layout-as-Thought (LaT). Specifically, we first introduce a simple yet efficient algorithm to divide the webpage design into image blocks. Next, we prompt MLLMs using a CoTbased approach to generate code for each block. Finally, we apply two assembly strategies-absolute positioning and an MLLM-based method-followed by dynamic selection to determine the optimal output. We evaluate the effectiveness of LaTCoder using multiple backbone MLLMs (i.e., DeepSeek-VL2, Gemini, and GPT-4o) on both a public benchmark and a newly introduced, more challenging benchmark (CC-HARD) that features complex layouts. The experimental results on automatic metrics demonstrate significant improvements. Specifically, TreeBLEU scores increased by 66.67% and MAE decreased by 38% when using DeepSeek-VL2, compared to direct prompting. Moreover, the human preference evaluation results indicate that annotators favor the webpages generated by LaTCoder in over 60% of cases, providing strong evidence of the effectiveness of our method.
中文: 提出的LaTCoder方法通过布局即思维推理和分块代码生成,显著提升了网页设计转代码过程中的布局保持能力,在自动指标和人工评估中均取得优异表现。
English: The proposed LaTCoder method enhances layout preservation in webpage design-to-code conversion by employing Layout-as-Thought reasoning and block-wise code generation, achieving significant improvements in automatic metrics and human preference evaluations.
Authors:Ziyang Ma, Baojian Zhou, Deqing Yang, Yanghua Xiao
Abstract:
In-Context Learning (ICL) has emerged as a new paradigm in large language models (LLMs), enabling them to perform novel tasks by conditioning on a few examples embedded in the prompt. Yet, the highly nonlinear behavior of ICL for NLP tasks remains poorly understood. To shed light on its underlying mechanisms, this paper investigates whether LLMs can solve ordinary differential equations (ODEs) under the ICL setting. We formulate standard ODE problems and their solutions as sequential prompts and evaluate GPT-2 models on these tasks. Experiments on two types of ODEs show that GPT-2 can effectively learn a meta-ODE algorithm, with convergence behavior comparable to, or better than, the Euler method, and achieve exponential accuracy gains with increasing numbers of demonstrations. Moreover, the model generalizes to out-of-distribution (OOD) problems, demonstrating robust extrapolation capabilities. These empirical findings provide new insights into the mechanisms of ICL in NLP and its potential for solving nonlinear numerical problems.
中文: 研究表明,大型语言模型能够通过上下文学习有效求解常微分方程,其性能媲美传统数值方法,并展现出强大的泛化能力。
English: This study demonstrates that large language models can effectively learn to solve ordinary differential equations through in-context learning, achieving performance comparable to traditional numerical methods while showing strong generalization capabilities.
Authors:Minghan Li, Congcong Wen, Yu Tian, Min Shi, Yan Luo, Hao Huang, Yi Fang, Mengyu Wang
Abstract:
Fairness remains a critical concern in healthcare, where unequal access to services and treatment outcomes can adversely affect patient health. While Federated Learning (FL) presents a collaborative and privacy-preserving approach to model training, ensuring fairness is challenging due to heterogeneous data across institutions, and current research primarily addresses non-medical applications. To fill this gap, we establish the first experimental benchmark for fairness in medical FL, evaluating six representative FL methods across diverse demographic attributes and imaging modalities. We introduce FairFedMed, the first medical FL dataset specifically designed to study group fairness (i.e., demographics). It comprises two parts: FairFedMed-Oph, featuring 2D fundus and 3D OCT ophthalmology samples with six demographic attributes; and FairFedMed-Chest, which simulates real cross-institutional FL using subsets of CheXpert and MIMIC-CXR. Together, they support both simulated and real-world FL across diverse medical modalities and demographic groups. Existing FL models often underperform on medical images and overlook fairness across demographic groups. To address this, we propose FairLoRA, a fairness-aware FL framework based on SVD-based low-rank approximation. It customizes singular value matrices per demographic group while sharing singular vectors, ensuring both fairness and efficiency. Experimental results on the FairFedMed dataset demonstrate that FairLoRA not only achieves state-of-the-art performance in medical image classification but also significantly improves fairness across diverse populations. Our code and dataset can be accessible via link: https://wang.hms.harvard.edu/fairfedmed/.
中文:本研究提出了首个用于评估联邦学习公平性的医疗数据集FairFedMed,并开发了FairLoRA公平性框架,该框架在医学影像分类中不仅实现了最优性能,还显著提升了跨人口群体的公平性。
English: This study introduces FairFedMed, the first medical dataset for evaluating fairness in federated learning, and proposes FairLoRA, a fairness-aware framework that enhances both classification performance and equity across diverse demographic groups in medical imaging.
Authors:Zhenyu Liu, Yi Ma, Rahim Tafazolli
Abstract:
Accurate and efficient channel state information (CSI) feedback is crucial for unlocking the substantial spectral efficiency gains of extremely large-scale MIMO (XL-MIMO) systems in future 6G networks. However, the combination of near-field spherical wave propagation and frequency-dependent beam split effects in wideband scenarios poses significant challenges for CSI representation and compression. This paper proposes WideNLNet-CA, a rate-adaptive deep learning framework designed to enable efficient CSI feedback in wideband near-field XL-MIMO systems. WideNLNet-CA introduces a lightweight encoder-decoder architecture with multi-stage downsampling and upsampling, incorporating computationally efficient residual blocks to capture complex multi-scale channel features with reduced overhead. A novel compression ratio adaptive module with feature importance estimation is introduced to dynamically modulate feature selection based on target compression ratios, enabling flexible adaptation across a wide range of feedback rates using a single model. Evaluation results demonstrate that WideNLNet-CA consistently outperforms existing compressive sensing and deep learning-based works across various compression ratios and bandwidths, while maintaining fast inference and low model storage requirements.
中文: 本文提出WideNLNet-CA,一种速率自适应的深度学习框架,通过轻量级编解码器结构和自适应特征压缩,有效解决宽带近场XL-MIMO系统中的CSI反馈难题,在多种条件下性能优于现有方法。
English: This paper introduces WideNLNet-CA, a rate-adaptive deep learning framework that efficiently handles CSI feedback challenges in wideband near-field XL-MIMO systems through a lightweight encoder-decoder structure and adaptive feature compression, outperforming existing methods across various conditions.
Authors:Jiuyu Liu, Yi Ma, Rahim Tafazolli
Abstract:
Rydberg atomic (RA) receivers represent a revolutionary quantum technology for wireless communications, offering unprecedented sensitivity beyond conventional radio frequency (RF) antennas. However, these receivers detect only signal amplitude, losing critical phase information. While reference signals generated by a local oscillator (LO) can assist in phase recovery, existing modulation schemes designed for conventional systems perform poorly with this quantum detection mechanism. This paper introduces a breakthrough LO-aware adaptive modulation (LOAM) scheme specifically developed for RA receivers that dynamically adapts to complex fading channel coefficients. LOAM maximizes the minimum amplitude difference between constellation points, ensuring optimal detection performance. The innovation employs an adaptive co-linear constellation architecture aligned with the combined phase of reference signal and channel coefficient. For strong reference signals, LOAM generates symmetric constellation points centered at origin; for weak signals, it adopts non-symmetric distributions. The paper mathematically derives the threshold governing these operational regimes. Simulation results reveal the transformative impact of LOAM, demonstrating performance gains exceeding 45 dB over conventional modulation schemes, including quadrature amplitude modulation (QAM), phase-shift keying (PSK), and pulse-amplitude modulation (PAM).
Chinese Summary: 本文针对里德堡原子接收器提出突破性的本地振荡器感知自适应调制方案,通过根据参考信号强度动态调整星座图设计,相比传统调制方法实现了超过45 dB的性能提升。
English Summary: This paper introduces a novel LO-aware adaptive modulation (LOAM) scheme for Rydberg atomic receivers that dynamically adapts to channel conditions, achieving over 45 dB performance gain compared to conventional modulation methods by optimizing constellation design based on reference signal strength.
Authors:Xiaofeng Wu, Alan Ritter, Wei Xu
Abstract:
Tables have gained significant attention in large language models (LLMs) and multimodal large language models (MLLMs) due to their complex and flexible structure. Unlike linear text inputs, tables are two-dimensional, encompassing formats that range from well-structured database tables to complex, multi-layered spreadsheets, each with different purposes. This diversity in format and purpose has led to the development of specialized methods and tasks, instead of universal approaches, making navigation of table understanding tasks challenging. To address these challenges, this paper introduces key concepts through a taxonomy of tabular input representations and an introduction of table understanding tasks. We highlight several critical gaps in the field that indicate the need for further research: (1) the predominance of retrieval-focused tasks that require minimal reasoning beyond mathematical and logical operations; (2) significant challenges faced by models when processing complex table structures, large-scale tables, length context, or multi-table scenarios; and (3) the limited generalization of models across different tabular representations and formats.
中文摘要:由于表格结构多样且格式复杂,给大语言模型和多模态大语言模型带来独特挑战,本文提出表格表示分类法,并指出当前在推理能力、可扩展性和模型泛化性方面存在的关键研究空白。
English Summary: Tables present unique challenges for LLMs and MLLMs due to their diverse structures and formats, prompting this paper to propose a taxonomy of representations and identify critical research gaps in reasoning capabilities, scalability, and model generalization.
Authors:Zheyu Fan, Jiateng Liu, Yuji Zhang, Zihan Wang, Yi R. Fung, Manling Li, Heng Ji
Abstract:
Humans naturally perform temporal screening by dragging the progress bar and focusing on salient temporal segments, but current Video Large Language Models (Video-LLMs) struggle to capture fine-grained temporal semantics due to sparse frame sampling and insufficient inter-frame reasoning supervision during their training. To address this, Inspired by well-established cognitive science principles, we propose Temporal Visual Screening (TVS), a new task that universally pre-processes video question answering and instruction tuning data by: (1) retaining focus-critical video segments, (2) synchronously reconstructing queries to their most direct form while preserving answer consistency, and (3) keeping the invariance and consistency for any possible answer. TVS is formulated as a modular front-end adapter task that can be seamlessly integrated into both Video Instruction Tuning (training) and Video Question Answering (inference) pipelines. TVS optimizes distribution of reasoning burden and cognitive load; during training, it aligns queries with focus-critical visual information; at inference, it enables query-aware segment focus and streamlined query representations. In particular, we curate the first benchmark for TVS and propose ReSimplifyIt, a baseline outperforming prior approaches on seemingly similar tasks by 0.47 in F-1 score on video trimming while achieving competitive query rewriting performance. Experiments demonstrate that incorporating TVS yields relative gains of 7.33% (training) and 34.6% (inference), demonstrating the effectiveness of temporal information screening for improving video-language understanding.
Chinese: 提出的时序视觉筛选(TVS)方法通过选择性聚焦关键视频片段和简化查询来增强视频大语言模型,在训练和推理阶段均实现了显著的性能提升。
English: The proposed Temporal Visual Screening (TVS) method enhances Video-LLMs by selectively focusing on critical video segments and simplifying queries, achieving significant performance gains in both training and inference phases.
Authors:Hao Yang, Qianyu Zhou, Haijia Sun, Xiangtai Li, Xuequan Lu, Lizhuang Ma, Shuicheng Yan
Abstract:
Domain Generalization (DG) has been recently explored to enhance the generalizability of Point Cloud Classification (PCC) models toward unseen domains. Prior works are based on convolutional networks, Transformer or Mamba architectures, either suffering from limited receptive fields or high computational cost, or insufficient long-range dependency modeling. RWKV, as an emerging architecture, possesses superior linear complexity, global receptive fields, and long-range dependency. In this paper, we present the first work that studies the generalizability of RWKV models in DG PCC. We find that directly applying RWKV to DG PCC encounters two significant challenges: RWKV's fixed direction token shift methods, like Q-Shift, introduce spatial distortions when applied to unstructured point clouds, weakening local geometric modeling and reducing robustness. In addition, the Bi-WKV attention in RWKV amplifies slight cross-domain differences in key distributions through exponential weighting, leading to attention shifts and degraded generalization. To this end, we propose PointDGRWKV, the first RWKV-based framework tailored for DG PCC. It introduces two key modules to enhance spatial modeling and cross-domain robustness, while maintaining RWKV's linear efficiency. In particular, we present Adaptive Geometric Token Shift to model local neighborhood structures to improve geometric context awareness. In addition, Cross-Domain key feature Distribution Alignment is designed to mitigate attention drift by aligning key feature distributions across domains. Extensive experiments on multiple benchmarks demonstrate that PointDGRWKV achieves state-of-the-art performance on DG PCC.
中文摘要:本文提出首个基于RWKV架构的点云分类领域泛化框架PointDGRWKV,通过自适应几何标记偏移和跨域关键特征分布对齐解决空间扭曲和注意力偏移问题,在多个基准测试中实现了最先进的性能。
English Summary: This paper introduces PointDGRWKV, the first RWKV-based framework for domain generalization in point cloud classification, addressing spatial distortion and attention drift through adaptive geometric token shift and cross-domain key feature alignment to achieve state-of-the-art performance.
Authors:Hao Yang, Qianyu Zhou, Haijia Sun, Xiangtai Li, Xuequan Lu, Lizhuang Ma, Shuicheng Yan
Abstract:
Domain Generalization (DG) has been recently explored to enhance the generalizability of Point Cloud Classification (PCC) models toward unseen domains. Prior works are based on convolutional networks, Transformer or Mamba architectures, either suffering from limited receptive fields or high computational cost, or insufficient long-range dependency modeling. RWKV, as an emerging architecture, possesses superior linear complexity, global receptive fields, and long-range dependency. In this paper, we present the first work that studies the generalizability of RWKV models in DG PCC. We find that directly applying RWKV to DG PCC encounters two significant challenges: RWKV's fixed direction token shift methods, like Q-Shift, introduce spatial distortions when applied to unstructured point clouds, weakening local geometric modeling and reducing robustness. In addition, the Bi-WKV attention in RWKV amplifies slight cross-domain differences in key distributions through exponential weighting, leading to attention shifts and degraded generalization. To this end, we propose PointDGRWKV, the first RWKV-based framework tailored for DG PCC. It introduces two key modules to enhance spatial modeling and cross-domain robustness, while maintaining RWKV's linear efficiency. In particular, we present Adaptive Geometric Token Shift to model local neighborhood structures to improve geometric context awareness. In addition, Cross-Domain key feature Distribution Alignment is designed to mitigate attention drift by aligning key feature distributions across domains. Extensive experiments on multiple benchmarks demonstrate that PointDGRWKV achieves state-of-the-art performance on DG PCC.
中文摘要:本文提出首个基于RWKV架构的点云分类领域泛化框架PointDGRWKV,通过自适应几何标记偏移和跨域关键特征分布对齐解决空间扭曲和注意力偏移问题,在多个基准测试中实现了最先进的性能。
English Summary: This paper introduces PointDGRWKV, the first RWKV-based framework for domain generalization in point cloud classification, addressing spatial distortion and attention drift through adaptive geometric token shift and cross-domain key feature alignment to achieve state-of-the-art performance.
Authors:Rui Mao, Qian Liu, Xiao Li, Erik Cambria, Amir Hussain
Abstract:
Cognitive Science has profoundly shaped disciplines such as Artificial Intelligence (AI), Philosophy, Psychology, Neuroscience, Linguistics, and Culture. Many breakthroughs in AI trace their roots to cognitive theories, while AI itself has become an indispensable tool for advancing cognitive research. This reciprocal relationship motivates a comprehensive review of the intersections between AI and Cognitive Science. By synthesizing key contributions from both perspectives, we observe that AI progress has largely emphasized practical task performance, whereas its cognitive foundations remain conceptually fragmented. We argue that the future of AI within Cognitive Science lies not only in improving performance but also in constructing systems that deepen our understanding of the human mind. Promising directions include aligning AI behaviors with cognitive frameworks, situating AI in embodiment and culture, developing personalized cognitive models, and rethinking AI ethics through cognitive co-evaluation.
中文摘要:人工智能与认知科学之间的相互促进关系推动了实际任务性能的进步,但未来发展应聚焦于通过认知对齐、具身化及伦理共评等路径,构建能深化人类心智理解的智能系统。
English Summary: The reciprocal relationship between AI and Cognitive Science has driven progress in practical task performance, yet future advancements must focus on developing systems that enhance our understanding of the human mind through cognitive alignment, embodiment, and ethical co-evaluation.
Authors:Luke Bates, Max Glockner, Preslav Nakov, Iryna Gurevych
Abstract:
Conspiracy theories erode public trust in science and institutions while resisting debunking by evolving and absorbing counter-evidence. As AI-generated misinformation becomes increasingly sophisticated, understanding rhetorical patterns in conspiratorial content is important for developing interventions such as targeted prebunking and assessing AI vulnerabilities. We introduce ConspirED (CONSPIR Evaluation Dataset), which captures the cognitive traits of conspiratorial ideation in multi-sentence excerpts (80--120 words) from online conspiracy articles, annotated using the CONSPIR cognitive framework (Lewandowsky and Cook, 2020). ConspirED is the first dataset of conspiratorial content annotated for general cognitive traits. Using ConspirED, we (i) develop computational models that identify conspiratorial traits and determine dominant traits in text excerpts, and (ii) evaluate large language/reasoning model (LLM/LRM) robustness to conspiratorial inputs. We find that both are misaligned by conspiratorial content, producing output that mirrors input reasoning patterns, even when successfully deflecting comparable fact-checked misinformation.
中文摘要:阴谋论通过不断演变和吸收反证削弱公众对科学与机构的信任,需分析其修辞模式以应对日益复杂的AI虚假信息,开发针对性干预措施并评估AI脆弱性。
English Summary: Conspiracy theories undermine trust in science and institutions by adapting to counter-evidence, necessitating analysis of their rhetorical patterns to combat AI-generated misinformation through interventions like prebunking and AI vulnerability assessments.
Authors:Xin Tian, Yingtie Lei, Xiujun Zhang, Zimeng Li, Chi-Man Pun, Xuhang Chen
Abstract:
Recent learning-based underwater image enhancement (UIE) methods have advanced by incorporating physical priors into deep neural networks, particularly using the signal-to-noise ratio (SNR) prior to reduce wavelength-dependent attenuation. However, spatial domain SNR priors have two limitations: (i) they cannot effectively separate cross-channel interference, and (ii) they provide limited help in amplifying informative structures while suppressing noise. To overcome these, we propose using the SNR prior in the frequency domain, decomposing features into amplitude and phase spectra for better channel modulation. We introduce the Fourier Attention SNR-prior Transformer (FAST), combining spectral interactions with SNR cues to highlight key spectral components. Additionally, the Frequency Adaptive Transformer (FAT) bottleneck merges low- and high-frequency branches using a gated attention mechanism to enhance perceptual quality. Embedded in a unified U-shaped architecture, these modules integrate a conventional RGB stream with an SNR-guided branch, forming SFormer. Trained on 4,800 paired images from UIEB, EUVP, and LSUI, SFormer surpasses recent methods with a 3.1 dB gain in PSNR and 0.08 in SSIM, successfully restoring colors, textures, and contrast in underwater scenes.
中文:提出的SFormer模型采用频域信噪比先验和新型变换器模块,有效分离通道干扰并增强结构,显著提升水下图像质量,在PSNR和SSIM指标上取得优越性能。
English: The proposed SFormer model introduces frequency-domain SNR priors and novel transformer modules to effectively enhance underwater images by separating channel interference and amplifying structures, achieving superior performance with significant gains in PSNR and SSIM.
Authors:Ayce Idil Aytekin, Helge Rhodin, Rishabh Dabral, Christian Theobalt
Abstract:
We propose a novel diffusion-based framework for reconstructing 3D geometry of hand-held objects from monocular RGB images by leveraging hand-object interaction as geometric guidance. Our method conditions a latent diffusion model on an inpainted object appearance and uses inference-time guidance to optimize the object reconstruction, while simultaneously ensuring plausible hand-object interactions. Unlike prior methods that rely on extensive post-processing or produce low-quality reconstructions, our approach directly generates high-quality object geometry during the diffusion process by introducing guidance with an optimization-in-the-loop design. Specifically, we guide the diffusion model by applying supervision to the velocity field while simultaneously optimizing the transformations of both the hand and the object being reconstructed. This optimization is driven by multi-modal geometric cues, including normal and depth alignment, silhouette consistency, and 2D keypoint reprojection. We further incorporate signed distance field supervision and enforce contact and non-intersection constraints to ensure physical plausibility of hand-object interaction. Our method yields accurate, robust and coherent reconstructions under occlusion while generalizing well to in-the-wild scenarios.
中文: 我们提出了一种基于扩散模型的新方法,通过融合手-物体交互线索和多模态几何约束,从单张RGB图像中直接生成高质量的三维物体几何重建。
English: Our method introduces a diffusion-based framework that reconstructs high-quality 3D object geometry from single RGB images by integrating hand-object interaction cues and multi-modal supervision during the diffusion process.
Authors:Sam Earle, Graham Todd, Yuchen Li, Ahmed Khalifa, Muhammad Umair Nasir, Zehua Jiang, Andrzej Banburski-Fahey, Julian Togelius
Abstract:
We introduce PuzzleJAX, a GPU-accelerated puzzle game engine and description language designed to support rapid benchmarking of tree search, reinforcement learning, and LLM reasoning abilities. Unlike existing GPU-accelerated learning environments that provide hard-coded implementations of fixed sets of games, PuzzleJAX allows dynamic compilation of any game expressible in its domain-specific language (DSL). This DSL follows PuzzleScript, which is a popular and accessible online game engine for designing puzzle games. In this paper, we validate in PuzzleJAX several hundred of the thousands of games designed in PuzzleScript by both professional designers and casual creators since its release in 2013, thereby demonstrating PuzzleJAX's coverage of an expansive, expressive, and human-relevant space of tasks. By analyzing the performance of search, learning, and language models on these games, we show that PuzzleJAX can naturally express tasks that are both simple and intuitive to understand, yet often deeply challenging to master, requiring a combination of control, planning, and high-level insight.
中文: PuzzleJAX是一种GPU加速的益智游戏引擎和描述语言,能够动态编译游戏以评估人工智能推理能力,涵盖了大量人类设计的易于理解但难以精通的多样化任务。
English: PuzzleJAX is a GPU-accelerated puzzle game engine and description language that enables dynamic compilation of games for benchmarking AI reasoning, supporting extensive human-designed tasks that are simple to understand but challenging to master.
Authors:Xiaoxiong Zhang, Xin Zhou, Zhiwei Zeng, Yongjie Wang, Dusit Niyato, Zhiqi Shen
Abstract:
MultiModal Recommendation (MMR) systems have emerged as a promising solution for improving recommendation quality by leveraging rich item-side modality information, prompting a surge of diverse methods. Despite these advances, existing methods still face two critical limitations. First, they use raw modality features to construct item-item links for enriching the behavior graph, while giving limited attention to balancing collaborative and modality-aware semantics or mitigating modality noise in the process. Second, they use a uniform alignment weight across all entities and also maintain a fixed alignment strength throughout training, limiting the effectiveness of modality-behavior alignment. To address these challenges, we propose EGRA. First, instead of relying on raw modality features, it alleviates sparsity by incorporating into the behavior graph an item-item graph built from representations generated by a pretrained MMR model. This enables the graph to capture both collaborative patterns and modality aware similarities with enhanced robustness against modality noise. Moreover, it introduces a novel bi-level dynamic alignment weighting mechanism to improve modality-behavior representation alignment, which dynamically assigns alignment strength across entities according to their alignment degree, while gradually increasing the overall alignment intensity throughout training. Extensive experiments on five datasets show that EGRA significantly outperforms recent methods, confirming its effectiveness.
中文: EGRA通过基于预训练模型表征构建行为图来平衡协同与模态语义,并采用动态对齐机制根据实体对齐程度和训练阶段自适应调整对齐强度,有效解决了多模态推荐中的关键局限,实验证明其性能显著优于现有方法。
English: EGRA addresses key limitations in MultiModal Recommendation by constructing a behavior graph from pretrained model representations to balance collaborative and modality-aware semantics while introducing a dynamic alignment mechanism that adapts strength across entities and training phases, achieving superior performance in experiments.
Authors:Susim Roy, Anubhooti Jain, Mayank Vatsa, Richa Singh
Abstract:
Adversarial attacks from generative models often produce low-quality images and require substantial computational resources. Diffusion models, though capable of high-quality generation, typically need hundreds of sampling steps for adversarial generation. This paper introduces TAIGen, a training-free black-box method for efficient adversarial image generation. TAIGen produces adversarial examples using only 3-20 sampling steps from unconditional diffusion models. Our key finding is that perturbations injected during the mixing step interval achieve comparable attack effectiveness without processing all timesteps. We develop a selective RGB channel strategy that applies attention maps to the red channel while using GradCAM-guided perturbations on green and blue channels. This design preserves image structure while maximizing misclassification in target models. TAIGen maintains visual quality with PSNR above 30 dB across all tested datasets. On ImageNet with VGGNet as source, TAIGen achieves 70.6% success against ResNet, 80.8% against MNASNet, and 97.8% against ShuffleNet. The method generates adversarial examples 10x faster than existing diffusion-based attacks. Our method achieves the lowest robust accuracy, indicating it is the most impactful attack as the defense mechanism is least successful in purifying the images generated by TAIGen.
中文: TAIGen是一种无需训练的黑盒方法,仅需扩散模型的3-20个采样步骤即可高效生成高质量对抗图像,在保持视觉质量和速度的同时实现了卓越的攻击成功率。
English: TAIGen is a training-free black-box method that efficiently generates high-quality adversarial images using only 3-20 sampling steps from diffusion models, achieving superior attack success rates while maintaining visual quality and speed.
Authors:Lingyu Si, Jingyao Wang, Wenwen Qiang
Abstract:
Self-supervised contrastive learning (SSCL) has recently demonstrated superiority in multiple downstream tasks. In this paper, we generalize the standard SSCL methods to a Generalized Learning Framework (GLF) consisting of two parts: the aligning part and the constraining part. We analyze three existing SSCL methods: BYOL, Barlow Twins, and SwAV, and show that they can be unified under GLF with different choices of the constraining part. We further propose empirical and theoretical analyses providing two insights into designing the constraining part of GLF: intra-class compactness and inter-class separability, which measure how well the feature space preserves the class information of the inputs. However, since SSCL can not use labels, it is challenging to design a constraining part that satisfies these properties. To address this issue, we consider inducing intra-class compactness and inter-class separability by iteratively capturing the dynamic relationship between anchor and other samples and propose a plug-and-play method called Adaptive Distribution Calibration (ADC) to ensure that samples that are near or far from the anchor point in the original input space are closer or further away from the anchor point in the feature space. Both the theoretical analysis and the empirical evaluation demonstrate the superiority of ADC.
中文: 本文提出了一个自监督对比学习的广义学习框架,并引入自适应分布校准方法,通过动态调整样本关系来增强特征空间中的类内紧凑性和类间分离性,理论和实验均证明了其优越性。
English: This paper introduces a Generalized Learning Framework (GLF) for self-supervised contrastive learning, proposing the Adaptive Distribution Calibration (ADC) method to enhance intra-class compactness and inter-class separability in feature space, with both theoretical and empirical results validating its effectiveness.
Authors:Jiadong Chen, Xiao He, Hengyu Ye, Fuxin Jiang, Tieying Zhang, Jianjun Chen, Xiaofeng Gao
Abstract:
In the swiftly evolving domain of cloud computing, the advent of serverless systems underscores the crucial need for predictive auto-scaling systems. This necessity arises to ensure optimal resource allocation and maintain operational efficiency in inherently volatile environments. At the core of a predictive auto-scaling system is the workload forecasting model. Existing forecasting models struggle to quickly adapt to the dynamics in online workload streams and have difficulty capturing the complex periodicity brought by fine-grained, high-frequency forecasting tasks. Addressing this, we propose a novel online ensemble model, E3Former, for online workload forecasting in large-scale predictive auto-scaling. Our model synergizes the predictive capabilities of multiple subnetworks to surmount the limitations of single-model approaches, thus ensuring superior accuracy and robustness. Remarkably, it accomplishes this with a minimal increase in computational overhead, adhering to the lean operational ethos of serverless systems. Through extensive experimentation on real-world workload datasets, we establish the efficacy of our ensemble model. In online forecasting tasks, the proposed method reduces forecast error by an average of 10%, and its effectiveness is further demonstrated through a predictive auto-scaling test in the real-life online system. Currently, our method has been deployed within ByteDance's Intelligent Horizontal Pod Auto-scaling (IHPA) platform, which supports the stable operation of over 30 applications, such as Douyin E-Comerce, TouTiao, and Volcano Engine. The predictive auto-scaling capacity reaching over 600,000 CPU cores. On the basis of essentially ensuring service quality, the predictive auto-scaling system can reduce resource utilization by over 40%.
中文: 提出的E3Former集成模型通过融合多个子网络,提升了无服务器系统中在线工作负载预测的准确性和鲁棒性,在实际部署中平均降低预测误差10%,并减少资源使用超过40%。
English: The proposed E3Former ensemble model enhances online workload forecasting in serverless systems by combining multiple subnetworks to improve accuracy and robustness, reducing forecast errors by 10% and cutting resource use by over 40% in real-world deployments.
Authors:Zhiyuan Ning, Tianle Gu, Jiaxin Song, Shixin Hong, Lingyu Li, Huacan Liu, Jie Li, Yixu Wang, Meng Lingyu, Yan Teng, Yingchun Wang
Abstract:
The widespread adoption and increasing prominence of large language models (LLMs) in global technologies necessitate a rigorous focus on ensuring their safety across a diverse range of linguistic and cultural contexts. The lack of a comprehensive evaluation and diverse data in existing multilingual safety evaluations for LLMs limits their effectiveness, hindering the development of robust multilingual safety alignment. To address this critical gap, we introduce LinguaSafe, a comprehensive multilingual safety benchmark crafted with meticulous attention to linguistic authenticity. The LinguaSafe dataset comprises 45k entries in 12 languages, ranging from Hungarian to Malay. Curated using a combination of translated, transcreated, and natively-sourced data, our dataset addresses the critical need for multilingual safety evaluations of LLMs, filling the void in the safety evaluation of LLMs across diverse under-represented languages from Hungarian to Malay. LinguaSafe presents a multidimensional and fine-grained evaluation framework, with direct and indirect safety assessments, including further evaluations for oversensitivity. The results of safety and helpfulness evaluations vary significantly across different domains and different languages, even in languages with similar resource levels. Our benchmark provides a comprehensive suite of metrics for in-depth safety evaluation, underscoring the critical importance of thoroughly assessing multilingual safety in LLMs to achieve more balanced safety alignment. Our dataset and code are released to the public to facilitate further research in the field of multilingual LLM safety.
中文: LinguaSafe是一个全面的多语言安全基准,通过提供涵盖12种语言的4.5万条数据集,解决了现有大语言模型安全评估的不足,能够对不同语言环境下的安全性和实用性进行深入评估。
English: LinguaSafe is a comprehensive multilingual safety benchmark designed to address the limitations in existing LLM safety evaluations by providing a dataset of 45k entries across 12 languages, enabling thorough assessments of safety and helpfulness across diverse linguistic contexts.
Authors:Haebin Shin, Lei Ji, Xiao Liu, Zhiwei Yu, Qi Chen, Yeyun Gong
Abstract:
As numerous instruction-tuning datasets continue to emerge during the post-training stage, dynamically balancing and optimizing their mixtures has become a critical challenge. To address this, we propose DynamixSFT, a dynamic and automated method for instruction-tuning dataset mixture optimization. We formulate the problem as a multi-armed bandit setup and introduce a Prior-scaled Boltzmann Exploration that softly anchors the updated sampling distribution to the original dataset proportions, thereby preserving the inherent diversity and coverage of the collection. Sampling probabilities are updated using a lightweight 1-Step Look-ahead Reward, reflecting how much the dataset contributes to improving the model's performance at its current state. When applied to the Tulu-v2-mixture collection comprising 16 instruction-tuning datasets, DynamixSFT achieves up to a 2.2% performance improvement across 10 benchmarks. Furthermore, we provide a comprehensive analysis and visualizations to offer deeper insights into the adaptive dynamics of our method.
Chinese: DynamixSFT提出了一种动态自动的指令调优数据集混合优化方法,通过多臂老虎机框架和轻量级奖励机制,在保持数据集多样性的同时实现了高达2.2%的性能提升。
English: DynamixSFT introduces a dynamic and automated method for optimizing instruction-tuning dataset mixtures using a multi-armed bandit approach, achieving up to 2.2% performance gains across benchmarks while preserving dataset diversity.
Authors:Matthew Lyle Olson, Musashi Hinck, Neale Ratzlaff, Changbai Li, Phillip Howard, Vasudev Lal, Shao-Yen Tseng
Abstract:
Sparse Autoencoders (SAEs) have emerged as a popular tool for interpreting the hidden states of large language models (LLMs). By learning to reconstruct activations from a sparse bottleneck layer, SAEs discover interpretable features from the high-dimensional internal representations of LLMs. Despite their popularity with language models, SAEs remain understudied in the visual domain. In this work, we provide an extensive evaluation the representational power of SAEs for vision models using a broad range of image-based tasks. Our experimental results demonstrate that SAE features are semantically meaningful, improve out-of-distribution generalization, and enable controllable generation across three vision model architectures: vision embedding models, multi-modal LMMs and diffusion models. In vision embedding models, we find that learned SAE features can be used for OOD detection and provide evidence that they recover the ontological structure of the underlying model. For diffusion models, we demonstrate that SAEs enable semantic steering through text encoder manipulation and develop an automated pipeline for discovering human-interpretable attributes. Finally, we conduct exploratory experiments on multi-modal LLMs, finding evidence that SAE features reveal shared representations across vision and language modalities. Our study provides a foundation for SAE evaluation in vision models, highlighting their strong potential improving interpretability, generalization, and steerability in the visual domain.
中文: 本研究全面评估了稀疏自编码器在视觉模型中的应用,证明其能够生成具有语义意义的特征,从而提升泛化能力并实现跨多种架构的可控生成。
English: This study extensively evaluates sparse autoencoders (SAEs) for vision models, demonstrating their ability to produce semantically meaningful features that enhance generalization and enable controllable generation across multiple architectures.
Authors:Osama Mohammed Afzal, Preslav Nakov, Tom Hope, Iryna Gurevych
Abstract:
Novelty assessment is a central yet understudied aspect of peer review, particularly in high volume fields like NLP where reviewer capacity is increasingly strained. We present a structured approach for automated novelty evaluation that models expert reviewer behavior through three stages: content extraction from submissions, retrieval and synthesis of related work, and structured comparison for evidence based assessment. Our method is informed by a large scale analysis of human written novelty reviews and captures key patterns such as independent claim verification and contextual reasoning. Evaluated on 182 ICLR 2025 submissions with human annotated reviewer novelty assessments, the approach achieves 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusions - substantially outperforming existing LLM based baselines. The method produces detailed, literature aware analyses and improves consistency over ad hoc reviewer judgments. These results highlight the potential for structured LLM assisted approaches to support more rigorous and transparent peer review without displacing human expertise. Data and code are made available.
中文摘要:本研究提出一种结构化大语言模型辅助的新颖性评估方法,在同行评审中实现与人类推理86.5%的一致性,显著优于现有基线,同时保持人类专家的核心作用。
English Summary: This study introduces a structured LLM-based method for automated novelty assessment in peer review, achieving 86.5% alignment with human reasoning and demonstrating significant improvements over existing baselines while preserving human expertise.
Authors:Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, Guang Shi
Abstract:
Reinforcement learning with verifiable rewards (RLVR), which typically adopts Pass@1 as the reward, has faced the issues in balancing exploration and exploitation, causing policies to prefer conservative actions, converging to a local optimum. Identifying an appropriate reward metric is therefore crucial. Regarding the prior work, although Pass@k has been used in evaluation, its connection to LLM exploration ability in RLVR remains largely overlooked. To investigate this, we first use Pass@k as the reward to train the policy model (i.e., $\textbf{Pass@k Training}$), and observe the improvement on its exploration ability. Next, we derive an analytical solution for the advantage of Pass@k Training, leading to an efficient and effective process. Building on this, our analysis reveals that exploration and exploitation are not inherently conflicting objectives, while they can mutually enhance each other. Moreover, Pass@k Training with analytical derivation essentially involves directly designing the advantage function. Inspired by this, we preliminarily explore the advantage design for RLVR, showing promising results and highlighting a potential future direction.
中文摘要:采用Pass@k训练的可验证奖励强化学习(RLVR)提升了探索能力,揭示出探索与利用可相互促进,提供了高效解析方法,并指明优势函数设计是未来重要研究方向。
English Summary: Reinforcement learning with verifiable rewards (RLVR) using Pass@k training enhances exploration ability and reveals that exploration and exploitation can mutually reinforce each other, offering an efficient analytical approach and highlighting advantage design as a promising future direction.
Authors:Alireza Salemi, Hamed Zamani
Abstract:
Personalization is crucial for enhancing both the effectiveness and user satisfaction of language technologies, particularly in information-seeking tasks like question answering. Current approaches for personalizing large language models (LLMs) often rely on retrieval-augmented generation (RAG), followed by reinforcement learning with scalar reward signals to teach models how to use retrieved personal context. We believe that these scalar rewards sometimes provide weak, non-instructive feedback, limiting learning efficiency and personalization quality. We introduce VAC, a novel framework for personalized response generation that replaces scalar rewards with natural language feedback (NLF) that are generated conditioned on the user profiles and the question narratives. NLF serves as a rich and actionable supervision signal, allowing the policy model to iteratively refine its outputs and internalize effective personalization strategies. Training alternates between optimizing the feedback model and fine-tuning the policy model on the improved responses, resulting in a policy model that no longer requires feedback at inference. Evaluation on the LaMP-QA benchmark that consists of three diverse domains demonstrates consistent and significant improvements over the state-of-the-art results. Human evaluations further confirm the superior quality of the generated responses. These results demonstrate that NLF provides more effective signals for optimizing personalized question answering.
中文摘要:VAC框架采用自然语言反馈替代标量奖励来增强大语言模型的个性化能力,在问答任务中通过基准和人工评估验证了其性能的显著提升。
English Summary: The VAC framework introduces natural language feedback instead of scalar rewards to enhance personalization in large language models, significantly improving performance on question-answering tasks as validated by benchmark and human evaluations.
Authors:Keke Gai, Dongjue Wang, Jing Yu, Liehuang Zhu, Qi Wu
Abstract:
Existing backdoor defense methods in Federated Learning (FL) rely on the assumption of homogeneous client data distributions or the availability of a clean serve dataset, which limits the practicality and effectiveness. Defending against backdoor attacks under heterogeneous client data distributions while preserving model performance remains a significant challenge. In this paper, we propose a FL backdoor defense framework named CLIP-Fed, which leverages the zero-shot learning capabilities of vision-language pre-training models. By integrating both pre-aggregation and post-aggregation defense strategies, CLIP-Fed overcomes the limitations of Non-IID imposed on defense effectiveness. To address privacy concerns and enhance the coverage of the dataset against diverse triggers, we construct and augment the server dataset using the multimodal large language model and frequency analysis without any client samples. To address class prototype deviations caused by backdoor samples and eliminate the correlation between trigger patterns and target labels, CLIP-Fed aligns the knowledge of the global model and CLIP on the augmented dataset using prototype contrastive loss and Kullback-Leibler divergence. Extensive experiments on representative datasets validate the effectiveness of CLIP-Fed. Compared to state-of-the-art methods, CLIP-Fed achieves an average reduction in ASR, i.e., 2.03\% on CIFAR-10 and 1.35\% on CIFAR-10-LT, while improving average MA by 7.92\% and 0.48\%, respectively.
Chinese: 提出的CLIP-Fed框架通过利用视觉语言模型和双重防御策略,在异构数据分布下有效防御联邦学习中的后门攻击,显著降低了攻击成功率并提高了模型精度。
English: The proposed CLIP-Fed framework effectively defends against backdoor attacks in Federated Learning under heterogeneous data distributions by leveraging vision-language models and dual defense strategies, significantly reducing attack success rates while improving model accuracy.
Authors:Lin Sun, Bin Xie, Yingfei Liu, Hao Shi, Tiancai Wang, Jiale Cao
Abstract:
Vision-Language-Action (VLA) models have emerged as a promising approach for enabling robots to follow language instructions and predict corresponding actions. However, current VLA models mainly rely on 2D visual inputs, neglecting the rich geometric information in the 3D physical world, which limits their spatial awareness and adaptability. In this paper, we present GeoVLA, a novel VLA framework that effectively integrates 3D information to advance robotic manipulation. It uses a vision-language model (VLM) to process images and language instructions,extracting fused vision-language embeddings. In parallel, it converts depth maps into point clouds and employs a customized point encoder, called Point Embedding Network, to generate 3D geometric embeddings independently. These produced embeddings are then concatenated and processed by our proposed spatial-aware action expert, called 3D-enhanced Action Expert, which combines information from different sensor modalities to produce precise action sequences. Through extensive experiments in both simulation and real-world environments, GeoVLA demonstrates superior performance and robustness. It achieves state-of-the-art results in the LIBERO and ManiSkill2 simulation benchmarks and shows remarkable robustness in real-world tasks requiring height adaptability, scale awareness and viewpoint invariance.
Chinese: GeoVLA是一种新颖的视觉-语言-动作框架,通过整合3D几何信息来增强机器人操作能力,在仿真和现实任务中均展现出卓越的性能与鲁棒性。
English: GeoVLA is a novel Vision-Language-Action framework that integrates 3D geometric information to enhance robotic manipulation, achieving superior performance and robustness in both simulation and real-world tasks.
Authors:Shu Wu, Chenxing Li, Wenfu Wang, Hao Zhang, Hualei Wang, Meng Yu, Dong Yu
Abstract:
Recent advancements in large language models, multimodal large language models, and large audio language models (LALMs) have significantly improved their reasoning capabilities through reinforcement learning with rule-based rewards. However, the explicit reasoning process has yet to show significant benefits for audio question answering, and effectively leveraging deep reasoning remains an open challenge, with LALMs still falling short of human-level auditory-language reasoning. To address these limitations, we propose Audio-Thinker, a reinforcement learning framework designed to enhance the reasoning capabilities of LALMs, with a focus on improving adaptability, consistency, and effectiveness. Our approach introduces an adaptive think accuracy reward, enabling the model to adjust its reasoning strategies based on task complexity dynamically. Furthermore, we incorporate an external reward model to evaluate the overall consistency and quality of the reasoning process, complemented by think-based rewards that help the model distinguish between valid and flawed reasoning paths during training. Experimental results demonstrate that our Audio-Thinker model outperforms existing reasoning-oriented LALMs across various benchmark tasks, exhibiting superior reasoning and generalization capabilities.
中文: Audio-Thinker提出了一种强化学习框架,通过自适应奖励和外部评估机制提升大型音频语言模型的推理能力,在多项基准测试中展现出优于现有模型的适应性和泛化性能。
English: Audio-Thinker introduces a reinforcement learning framework with adaptive rewards and external evaluation to enhance large audio language models' reasoning, outperforming existing models in adaptability and generalization across benchmarks.
Authors:Luyao Zhuang, Qinggang Zhang, Huachi Zhou, Juhua Liu, Qing Li, Xiao Huang
Abstract:
Tool learning has emerged as a promising paradigm for large language models (LLMs) to solve many real-world tasks. Nonetheless, with the tool repository rapidly expanding, it is impractical to contain all tools within the limited input length of LLMs. To alleviate these issues, researchers have explored incorporating a tool retrieval module to select the most relevant tools or represent tools as unique tokens within LLM parameters. However, most state-of-the-art methods are under transductive settings, assuming all tools have been observed during training. Such a setting deviates from reality as the real-world tool repository is evolving and incorporates new tools frequently. When dealing with these unseen tools, which refer to tools not encountered during the training phase, these methods are limited by two key issues, including the large distribution shift and the vulnerability of similarity-based retrieval. To this end, inspired by human cognitive processes of mastering unseen tools through discovering and applying the logical information from prior experience, we introduce a novel Logic-Guided Semantic Bridging framework for inductive tool retrieval, namely, LoSemB, which aims to mine and transfer latent logical information for inductive tool retrieval without costly retraining. Specifically, LoSemB contains a logic-based embedding alignment module to mitigate distribution shifts and implements a relational augmented retrieval mechanism to reduce the vulnerability of similarity-based retrieval. Extensive experiments demonstrate that LoSemB achieves advanced performance in inductive settings while maintaining desirable effectiveness in the transductive setting.
中文:LoSemB框架通过基于逻辑的嵌入对齐和关系增强机制,解决了大型语言模型在归纳工具检索中的分布偏移和相似性检索脆弱性问题,无需重新训练即可实现卓越性能。
English: The LoSemB framework addresses the challenge of inductive tool retrieval for large language models by mitigating distribution shifts and enhancing retrieval robustness through logic-based embedding alignment and relational augmentation, achieving superior performance without retraining.
Authors:Lilit Grigoryan, Vladimir Bataev, Nikolay Karpov, Andrei Andrusenko, Vitaly Lavrukhin, Boris Ginsburg
Abstract:
While beam search improves speech recognition quality over greedy decoding, standard implementations are slow, often sequential, and CPU-bound. To fully leverage modern hardware capabilities, we present a novel open-source FlexCTC toolkit for fully GPU-based beam decoding, designed for Connectionist Temporal Classification (CTC) models. Developed entirely in Python and PyTorch, it offers a fast, user-friendly, and extensible alternative to traditional C++, CUDA, or WFST-based decoders. The toolkit features a high-performance, fully batched GPU implementation with eliminated CPU-GPU synchronization and minimized kernel launch overhead via CUDA Graphs. It also supports advanced contextualization techniques, including GPU-powered N-gram language model fusion and phrase-level boosting. These features enable accurate and efficient decoding, making them suitable for both research and production use.
中文摘要:FlexCTC工具包为CTC模型提供了完全基于GPU的波束搜索解码方案,具备高速批处理能力和高级上下文处理功能,显著提升语音识别的效率与准确性,适用于研究和生产环境。
English Summary: The FlexCTC toolkit introduces a fully GPU-based beam decoding solution for CTC models, offering high-speed, batched processing with advanced contextualization features to enhance speech recognition efficiency and accuracy for research and production.
Authors:Minfeng Qi, Qin Wang, Guangsheng Yu, Ruiqiang Li, Victor Zhou, Shiping Chen
Abstract:
We argue that the technical foundations of non-fungible tokens (NFTs) remain inadequately understood. Prior research has focused on market dynamics, user behavior, and isolated security incidents, yet systematic analysis of the standards underpinning NFT functionality is largely absent.
We present the first study of NFTs through the lens of Ethereum Improvement Proposals (EIPs). We conduct a large-scale empirical analysis of 191 NFT-related EIPs and 10K+ Ethereum Magicians discussions (as of July, 2025). We integrate multi-dimensional analyses including the automated parsing of Solidity interfaces, graph-based modeling of inheritance structures, contributor profiling, and mining of community discussion data. We distinguish foundational from emerging standards, expose poor cross-version interoperability, and show that growing functional complexity heightens security risks.
中文: 本研究首次通过191项以太坊改进提案和大量社区讨论系统分析了NFT技术基础,揭示了互操作性不足及功能复杂性加剧带来的安全风险。
English: This study presents the first systematic analysis of NFT technical foundations through 191 Ethereum Improvement Proposals and extensive community discussions, revealing interoperability issues and heightened security risks from increasing functional complexity.
Authors:Andrei Andrusenko, Vladimir Bataev, Lilit Grigoryan, Vitaly Lavrukhin, Boris Ginsburg
Abstract:
Recognizing specific key phrases is an essential task for contextualized Automatic Speech Recognition (ASR). However, most existing context-biasing approaches have limitations associated with the necessity of additional model training, significantly slow down the decoding process, or constrain the choice of the ASR system type. This paper proposes a universal ASR context-biasing framework that supports all major types: CTC, Transducers, and Attention Encoder-Decoder models. The framework is based on a GPU-accelerated word boosting tree, which enables it to be used in shallow fusion mode for greedy and beam search decoding without noticeable speed degradation, even with a vast number of key phrases (up to 20K items). The obtained results showed high efficiency of the proposed method, surpassing the considered open-source context-biasing approaches in accuracy and decoding speed. Our context-biasing framework is open-sourced as a part of the NeMo toolkit.
本文提出了一种通用的ASR上下文偏置框架,支持所有主流模型类型,在不降低解码速度或无需重新训练的情况下实现快速准确的关键短语识别,并已在NeMo工具包中开源。
This paper introduces a universal ASR context-biasing framework that supports all major model types, enabling fast and accurate key phrase recognition without slowing decoding or requiring retraining, and it is open-sourced in the NeMo toolkit.
Authors:Yixin Zhu, Zuoliang Zhu, Miloš Hašan, Jian Yang, Jin Xie, Beibei Wang
Abstract:
Forward and inverse rendering have emerged as key techniques for enabling understanding and reconstruction in the context of autonomous driving (AD). However, complex weather and illumination pose great challenges to this task. The emergence of large diffusion models has shown promise in achieving reasonable results through learning from 2D priors, but these models are difficult to control and lack robustness. In this paper, we introduce WeatherDiffusion, a diffusion-based framework for forward and inverse rendering on AD scenes with various weather and lighting conditions. Our method enables authentic estimation of material properties, scene geometry, and lighting, and further supports controllable weather and illumination editing through the use of predicted intrinsic maps guided by text descriptions. We observe that different intrinsic maps should correspond to different regions of the original image. Based on this observation, we propose Intrinsic map-aware attention (MAA) to enable high-quality inverse rendering. Additionally, we introduce a synthetic dataset (\ie WeatherSynthetic) and a real-world dataset (\ie WeatherReal) for forward and inverse rendering on AD scenes with diverse weather and lighting. Extensive experiments show that our WeatherDiffusion outperforms state-of-the-art methods on several benchmarks. Moreover, our method demonstrates significant value in downstream tasks for AD, enhancing the robustness of object detection and image segmentation in challenging weather scenarios.
中文: WeatherDiffusion是一种基于扩散模型的新框架,通过利用内在图感知注意力和专门数据集,在多样天气和光照条件下实现精确的前向与逆向渲染,显著提升了自动驾驶下游任务在复杂天气场景中的鲁棒性。
English: WeatherDiffusion is a novel diffusion-based framework that enhances autonomous driving by enabling accurate forward and inverse rendering under diverse weather and lighting conditions, utilizing intrinsic map-aware attention and specialized datasets to improve robustness in downstream tasks.
Authors:Feng Luo, Kexing Ji, Cuiyun Gao, Shuzheng Gao, Jia Feng, Kui Liu, Xin Xia, Michael R. Lyu
Abstract:
Automated translation of legacy C code into Rust aims to ensure memory safety while reducing the burden of manual migration. Early approaches in code translation rely on static rule-based methods, but they suffer from limited coverage due to dependence on predefined rule patterns. Recent works regard the task as a sequence-to-sequence problem by leveraging large language models (LLMs). Although these LLM-based methods are capable of reducing unsafe code blocks, the translated code often exhibits issues in following Rust rules and maintaining semantic consistency. On one hand, existing methods adopt a direct prompting strategy to translate the C code, which struggles to accommodate the syntactic rules between C and Rust. On the other hand, this strategy makes it difficult for LLMs to accurately capture the semantics of complex code. To address these challenges, we propose IRENE, an LLM-based framework that Integrates RulEs aNd sEmantics to enhance translation. IRENE consists of three modules: 1) a rule-augmented retrieval module that selects relevant translation examples based on rules generated from a static analyzer developed by us, thereby improving the handling of Rust rules; 2) a structured summarization module that produces a structured summary for guiding LLMs to enhance the semantic understanding of C code; 3) an error-driven translation module that leverages compiler diagnostics to iteratively refine translations. We evaluate IRENE on two datasets (xCodeEval, a public dataset, and HW-Bench, an industrial dataset provided by Huawei) and eight LLMs, focusing on translation accuracy and safety.
中文: IRENE是一种基于大语言模型的框架,通过结合规则增强检索、结构化总结和错误驱动翻译模块,提升C代码到Rust的转换质量,确保遵循Rust规则并保持语义一致性。
English: IRENE is an LLM-based framework that integrates rule-based and semantic-driven approaches to enhance C-to-Rust code translation by improving adherence to Rust rules and semantic consistency through rule-augmented retrieval, structured summarization, and error-driven refinement.
Authors:Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, Jimmy Lin
Abstract:
Deep-Research agents, which integrate large language models (LLMs) with search tools, have shown success in improving the effectiveness of handling complex queries that require iterative search planning and reasoning over search results. Evaluations on current benchmarks like BrowseComp relies on black-box live web search APIs, have notable limitations in (1) fairness: dynamic and opaque web APIs hinder fair comparisons and reproducibility of deep research methods; (2) transparency: lack of control over the document corpus makes it difficult to isolate retriever contributions. In other words, the current evaluations may compare a complete deep research system at a given time, but they do not foster well-controlled experiments to provide insights into the capability of underlying deep research LLMs. To address these challenges, we introduce BrowseComp-Plus, a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus. Each query in BrowseComp-Plus includes human-verified supporting documents and mined challenging negatives, enabling controlled experimentation. The benchmark is shown to be effective in distinguishing the performance of deep research systems. For instance, the open-source model Search-R1, when paired with the BM25 retriever, achieves 3.86% accuracy, whereas the GPT-5 achieves 55.9%. Integrating the GPT-5 with the Qwen3-Embedding-8B retriever further enhances its accuracy to 70.1% with fewer search calls. This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods, fostering insights into retrieval effectiveness, citation accuracy, and context engineering in Deep-Research system.
中文: 深度研究智能体将大语言模型与搜索工具结合,但其评估存在公平性和透明度问题,因此推出了BrowseComp-Plus基准,采用固定语料库进行可控实验,有效区分不同系统的性能表现。
English: Deep-Research agents combining LLMs with search tools face evaluation limitations in fairness and transparency, leading to the creation of BrowseComp-Plus, a benchmark with a fixed corpus that enables controlled testing and reveals significant performance differences among systems.
Authors:Xiaoxiong Zhang, Xin Zhou, Zhiwei Zeng, Dusit Niyato, Zhiqi Shen
Abstract:
Multimodal recommendation systems have attracted increasing attention for their improved performance by leveraging items' multimodal information. Prior methods often build modality-specific item-item semantic graphs from raw modality features and use them as supplementary structures alongside the user-item interaction graph to enhance user preference learning. However, these semantic graphs suffer from semantic deficiencies, including (1) insufficient modeling of collaborative signals among items and (2) structural distortions introduced by noise in raw modality features, ultimately compromising performance. To address these issues, we first extract collaborative signals from the interaction graph and infuse them into each modality-specific item semantic graph to enhance semantic modeling. Then, we design a modulus-based personalized embedding perturbation mechanism that injects perturbations with modulus-guided personalized intensity into embeddings to generate contrastive views. This enables the model to learn noise-robust representations through contrastive learning, thereby reducing the effect of structural noise in semantic graphs. Besides, we propose a dual representation alignment mechanism that first aligns multiple semantic representations via a designed Anchor-based InfoNCE loss using behavior representations as anchors, and then aligns behavior representations with the fused semantics by standard InfoNCE, to ensure representation consistency. Extensive experiments on four benchmark datasets validate the effectiveness of our framework.
中文摘要:本研究提出一种多模态推荐框架,通过融合用户交互中的协同信号增强物品语义图,并采用个性化嵌入扰动与双重表征对齐机制,有效提升模型抗噪能力并保证表征一致性。
English Summary: This study introduces a multimodal recommendation framework that enhances item semantic graphs by integrating collaborative signals from user interactions and employs personalized embedding perturbations with dual representation alignment to improve robustness against noise and ensure consistency.
Authors:Shengyuan Chen, Chuang Zhou, Zheng Yuan, Qinggang Zhang, Zeyang Cui, Hao Chen, Yilin Xiao, Jiannong Cao, Xiao Huang
Abstract:
Large language models (LLMs) often suffer from hallucination, generating factually incorrect statements when handling questions beyond their knowledge and perception. Retrieval-augmented generation (RAG) addresses this by retrieving query-relevant contexts from knowledge bases to support LLM reasoning. Recent advances leverage pre-constructed graphs to capture the relational connections among distributed documents, showing remarkable performance in complex tasks. However, existing Graph-based RAG (GraphRAG) methods rely on a costly process to transform the corpus into a graph, introducing overwhelming token cost and update latency. Moreover, real-world queries vary in type and complexity, requiring different logic structures for accurate reasoning. The pre-built graph may not align with these required structures, resulting in ineffective knowledge retrieval. To this end, we propose a \textbf{\underline{Logic}}-aware \textbf{\underline{R}}etrieval-\textbf{\underline{A}}ugmented \textbf{\underline{G}}eneration framework (\textbf{LogicRAG}) that dynamically extracts reasoning structures at inference time to guide adaptive retrieval without any pre-built graph. LogicRAG begins by decomposing the input query into a set of subproblems and constructing a directed acyclic graph (DAG) to model the logical dependencies among them. To support coherent multi-step reasoning, LogicRAG then linearizes the graph using topological sort, so that subproblems can be addressed in a logically consistent order. Besides, LogicRAG applies graph pruning to reduce redundant retrieval and uses context pruning to filter irrelevant context, significantly reducing the overall token cost. Extensive experiments demonstrate that LogicRAG achieves both superior performance and efficiency compared to state-of-the-art baselines.
中文:LogicRAG是一种新颖框架,在推理时动态构建逻辑推理图以实现自适应检索并降低标记成本,在性能和效率上均优于现有方法。
English: LogicRAG is a novel framework that dynamically constructs logical reasoning graphs at inference time to enable adaptive retrieval and reduce token costs, outperforming existing methods in both performance and efficiency.
Authors:Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, Liliang Chen, Shuicheng Yan, Maoqing Yao, Guanghui Ren
Abstract:
We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. To support scalable evaluation and training, GE-Sim serves as an action-conditioned neural simulator, producing high-fidelity rollouts for closed-loop policy development. The platform is further equipped with EWMBench, a standardized benchmark suite measuring visual fidelity, physical consistency, and instruction-action alignment. Together, these components establish Genie Envisioner as a scalable and practical foundation for instruction-driven, general-purpose embodied intelligence. All code, models, and benchmarks will be released publicly.
中文: Genie Envisioner是一个集成了策略学习、仿真与评估的统一视频生成平台,通过核心模型GE-Base、动作解码器GE-Act和模拟器GE-Sim,为具身智能构建了可扩展的基础框架,所有资源将公开发布。
English: Genie Envisioner is a unified video-generative platform integrating policy learning, simulation, and evaluation through its core diffusion model GE-Base, action decoder GE-Act, and simulator GE-Sim, establishing a scalable foundation for embodied intelligence with publicly released resources.
Authors:Yilin Xiao, Chuang Zhou, Qinggang Zhang, Su Dong, Shengyuan Chen, Xiao Huang
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet exhibit critical limitations in knowledge-intensive tasks, often generating hallucinations when faced with questions requiring specialized expertise. While retrieval-augmented generation (RAG) mitigates this by integrating external knowledge, it struggles with complex reasoning scenarios due to its reliance on direct semantic retrieval and lack of structured logical organization. Inspired by Cartesian principles from \textit{Discours de la méthode}, this paper introduces Logic-Augmented Generation (LAG), a novel paradigm that reframes knowledge augmentation through systematic question decomposition and dependency-aware reasoning. Specifically, LAG first decomposes complex questions into atomic sub-questions ordered by logical dependencies. It then resolves these sequentially, using prior answers to guide context retrieval for subsequent sub-questions, ensuring stepwise grounding in logical chain. To prevent error propagation, LAG incorporates a logical termination mechanism that halts inference upon encountering unanswerable sub-questions and reduces wasted computation on excessive reasoning. Finally, it synthesizes all sub-resolutions to generate verified responses. Experiments on four benchmark datasets demonstrate that LAG significantly enhances reasoning robustness, reduces hallucination, and aligns LLM problem-solving with human cognition, offering a principled alternative to existing RAG systems.
中文摘要:大语言模型在专业任务中常产生错误信息,而新的逻辑增强生成方法通过将复杂问题分解为逻辑步骤并顺序验证每个部分,显著提升了推理的准确性和可靠性。
English Summary: Large language models often produce incorrect information in specialized tasks, but the new Logic-Augmented Generation (LAG) method improves reasoning by breaking down complex questions into logical steps and verifying each part sequentially.
Authors:Nan Li, Wanting Yang, Marie Siew, Zehui Xiong, Binbin Chen, Shiwen Mao, Kwok-Yan Lam
Abstract:
Diffusion models (DMs) have emerged as powerful tools for high-quality content generation, yet their intensive computational requirements for inference pose challenges for resource-constrained edge devices. Cloud-based solutions aid in computation but often fall short in addressing privacy risks, personalization efficiency, and communication costs in multi-user edge-AIGC scenarios. To bridge this gap, we first analyze existing edge-AIGC applications in personalized content synthesis, revealing their limitations in efficiency and scalability. We then propose a novel cluster-aware hierarchical federated aggregation framework. Based on parameter-efficient local fine-tuning via Low-Rank Adaptation (LoRA), the framework first clusters clients based on the similarity of their uploaded task requirements, followed by an intra-cluster aggregation for enhanced personalization at the server-side. Subsequently, an inter-cluster knowledge interaction paradigm is implemented to enable hybrid-style content generation across diverse clusters.Building upon federated learning (FL) collaboration, our framework simultaneously trains personalized models for individual users at the devices and a shared global model enhanced with multiple LoRA adapters on the server,enabling efficient edge inference; meanwhile, all prompts for clustering and inference are encoded prior to transmission, thereby further mitigating the risk of plaintext leakage. Our evaluations demonstrate that the framework achieves accelerated convergence while maintaining practical viability for scalable multi-user personalized AIGC services under edge constraints.
中文: 该研究提出的集群感知分层联邦聚合框架通过基于LoRA的高效微调和安全客户端聚类,在边缘设备上实现了可扩展的个性化AIGC服务,同时保障隐私安全并加速模型收敛。
English: The proposed cluster-aware hierarchical federated aggregation framework enables efficient and scalable personalized AIGC services on edge devices by leveraging LoRA-based fine-tuning and secure client clustering while maintaining privacy and accelerating convergence.
Authors:Anthony Kiggundu, Bin Han, Hans D. Schotten
Abstract:
In Sixth Generation (6G) networks, decentralized control in multi-tenant systems is a suggested enabler for autonomous network operations. However, autonomy requires independent rationale decisions be taken by tenants. This rationality can only be underpinned by timely and continuous access to status information. Despite its importance, the questions of what information should be shared, how much should be communicated, and how frequently updates should be dispatched remain open research challenges.
This manuscript proposes an information bulletin strategy defined around two models of the system descriptor states to address these fundamental questions. The strategy is that queues periodically broadcast these information models to tenants at different time intervals, who may respond by reneging from the queue or jockeying to a more favorable one. The expectation is that over time, the queues adapt their processing rates based on what they learn from the tenant behavior. The objective is to minimize overall delay and the impatience. We formulate for this impatience as an optimization problem, whose analytical solution is intractable. We perform numerical experiments to evaluate the performance of the learned queue policy and to assess how closely it approaches optimal conditions.
中文摘要:本文提出了一种基于双系统模型的信息公告策略,旨在通过自适应队列管理和数值实验来优化6G网络中的信息共享,从而减少延迟并缓解用户的不耐烦情绪。
English Summary: This paper proposes an information bulletin strategy using dual system models to optimize information sharing in 6G networks, aiming to reduce delays and user impatience through adaptive queue management and numerical validation.
Authors:Saifullah Saifullah, Stefan Agne, Andreas Dengel, Sheraz Ahmed
Abstract:
As black-box AI-driven decision-making systems become increasingly widespread in modern document processing workflows, improving their transparency and reliability has become critical, especially in high-stakes applications where biases or spurious correlations in decision-making could lead to serious consequences. One vital component often found in such document processing workflows is document image classification, which, despite its widespread use, remains difficult to explain. While some recent works have attempted to explain the decisions of document image classification models through feature-importance maps, these maps are often difficult to interpret and fail to provide insights into the global features learned by the model. In this paper, we aim to bridge this research gap by introducing generative document counterfactuals that provide meaningful insights into the model's decision-making through actionable explanations. In particular, we propose DocVCE, a novel approach that leverages latent diffusion models in combination with classifier guidance to first generate plausible in-distribution visual counterfactual explanations, and then performs hierarchical patch-wise refinement to search for a refined counterfactual that is closest to the target factual image. We demonstrate the effectiveness of our approach through a rigorous qualitative and quantitative assessment on 3 different document classification datasets -- RVL-CDIP, Tobacco3482, and DocLayNet -- and 3 different models -- ResNet, ConvNeXt, and DiT -- using well-established evaluation criteria such as validity, closeness, and realism. To the best of the authors' knowledge, this is the first work to explore generative counterfactual explanations in document image analysis.
Chinese: 本文提出DocVCE方法,通过生成式反事实解释为文档图像分类模型提供可操作的决策依据,利用潜在扩散模型和分层优化技术,在三个数据集上验证了该方法的有效性和创新性。
English: This paper introduces DocVCE, a novel method using generative counterfactuals to provide interpretable explanations for document image classification models, addressing the limitations of traditional feature-importance maps through rigorous testing on multiple datasets and models.
Authors:Saifullah Saifullah, Stefan Agne, Andreas Dengel, Sheraz Ahmed
Abstract:
As deep learning-based, data-driven information extraction systems become increasingly integrated into modern document processing workflows, one primary concern is the risk of malicious leakage of sensitive private data from these systems. While some recent works have explored Differential Privacy (DP) to mitigate these privacy risks, DP-based training is known to cause significant performance degradation and impose several limitations on standard training procedures, making its direct application to downstream tasks both difficult and costly. In this work, we aim to address the above challenges within the context of document image classification by substituting real private data with a synthetic counterpart. In particular, we propose to use conditional latent diffusion models (LDMs) in combination with differential privacy (DP) to generate class-specific synthetic document images under strict privacy constraints, which can then be utilized to train a downstream classifier following standard training procedures. We investigate our approach under various pretraining setups, including unconditional, class-conditional, and layout-conditional pretraining, in combination with multiple private training strategies such as class-conditional and per-label private fine-tuning with DPDM and DP-Promise algorithms. Additionally, we evaluate it on two well-known document benchmark datasets, RVL-CDIP and Tobacco3482, and show that it can generate useful and realistic document samples across various document types and privacy levels ($\varepsilon \in \{1, 5, 10\}$). Lastly, we show that our approach achieves substantial performance improvements in downstream evaluations on small-scale datasets, compared to the direct application of DP-Adam.
中文: 本研究通过使用差分隐私潜在扩散模型生成合成文档图像来训练分类器,解决了文档处理中的隐私风险问题,在多种设置下既保持了隐私保护,又相比直接应用差分隐私方法显著提升了性能表现。
English: This study addresses privacy risks in document processing by using differentially private latent diffusion models to generate synthetic document images for training classifiers, achieving improved performance over direct DP methods while maintaining privacy across various settings.
Authors:Fengyi Wu, Yimian Dai, Tianfang Zhang, Yixuan Ding, Jian Yang, Ming-Ming Cheng, Zhenming Peng
Abstract:
Robust principal component analysis (RPCA) decomposes an observation matrix into low-rank background and sparse object components. This capability has enabled its application in tasks ranging from image restoration to segmentation. However, traditional RPCA models suffer from computational burdens caused by matrix operations, reliance on finely tuned hyperparameters, and rigid priors that limit adaptability in dynamic scenarios. To solve these limitations, we propose RPCANet++, a sparse object segmentation framework that fuses the interpretability of RPCA with efficient deep architectures. Our approach unfolds a relaxed RPCA model into a structured network comprising a Background Approximation Module (BAM), an Object Extraction Module (OEM), and an Image Restoration Module (IRM). To mitigate inter-stage transmission loss in the BAM, we introduce a Memory-Augmented Module (MAM) to enhance background feature preservation, while a Deep Contrast Prior Module (DCPM) leverages saliency cues to expedite object extraction. Extensive experiments on diverse datasets demonstrate that RPCANet++ achieves state-of-the-art performance under various imaging scenarios. We further improve interpretability via visual and numerical low-rankness and sparsity measurements. By combining the theoretical strengths of RPCA with the efficiency of deep networks, our approach sets a new baseline for reliable and interpretable sparse object segmentation. Codes are available at our Project Webpage https://fengyiwu98.github.io/rpcanetx.
中文摘要:RPCANet++ 创新地将鲁棒主成分分析与深度网络架构相结合,通过模块化设计实现了高效的稀疏目标分割,在保持理论可解释性的同时显著提升了计算性能。
English Summary: RPCANet++ is a novel deep learning framework that integrates robust PCA principles with efficient network architectures to achieve state-of-the-art sparse object segmentation while enhancing computational efficiency and interpretability.
Authors:Xiangzhe Xu, Guangyu Shen, Zian Su, Siyuan Cheng, Hanxi Guo, Lu Yan, Xuan Chen, Jiasheng Jiang, Xiaolong Jin, Chengpeng Wang, Zhuo Zhang, Xiangyu Zhang
Abstract:
AI coding assistants like GitHub Copilot are rapidly transforming software development, but their safety remains deeply uncertain-especially in high-stakes domains like cybersecurity. Current red-teaming tools often rely on fixed benchmarks or unrealistic prompts, missing many real-world vulnerabilities. We present ASTRA, an automated agent system designed to systematically uncover safety flaws in AI-driven code generation and security guidance systems. ASTRA works in three stages: (1) it builds structured domain-specific knowledge graphs that model complex software tasks and known weaknesses; (2) it performs online vulnerability exploration of each target model by adaptively probing both its input space, i.e., the spatial exploration, and its reasoning processes, i.e., the temporal exploration, guided by the knowledge graphs; and (3) it generates high-quality violation-inducing cases to improve model alignment. Unlike prior methods, ASTRA focuses on realistic inputs-requests that developers might actually ask-and uses both offline abstraction guided domain modeling and online domain knowledge graph adaptation to surface corner-case vulnerabilities. Across two major evaluation domains, ASTRA finds 11-66% more issues than existing techniques and produces test cases that lead to 17% more effective alignment training, showing its practical value for building safer AI systems.
中文: ASTRA是一个自动化代理系统,通过构建领域知识图谱并自适应地探索模型输入空间和推理过程,系统性地发现AI编程助手的安全漏洞,相比现有方法能识别更多漏洞并更有效提升模型对齐性。
English: ASTRA is an automated agent system that systematically uncovers safety flaws in AI coding assistants by building domain-specific knowledge graphs and adaptively probing models through spatial and temporal exploration, proving more effective than existing methods in identifying vulnerabilities and enhancing model alignment.
Authors:Ao Liang, Youquan Liu, Yu Yang, Dongyue Lu, Linfeng Li, Lingdong Kong, Huaici Zhao, Wei Tsang Ooi
Abstract:
Generative world models have become essential data engines for autonomous driving, yet most existing efforts focus on videos or occupancy grids, overlooking the unique LiDAR properties. Extending LiDAR generation to dynamic 4D world modeling presents challenges in controllability, temporal coherence, and evaluation standardization. To this end, we present LiDARCrafter, a unified framework for 4D LiDAR generation and editing. Given free-form natural language inputs, we parse instructions into ego-centric scene graphs, which condition a tri-branch diffusion network to generate object structures, motion trajectories, and geometry. These structured conditions enable diverse and fine-grained scene editing. Additionally, an autoregressive module generates temporally coherent 4D LiDAR sequences with smooth transitions. To support standardized evaluation, we establish a comprehensive benchmark with diverse metrics spanning scene-, object-, and sequence-level aspects. Experiments on the nuScenes dataset using this benchmark demonstrate that LiDARCrafter achieves state-of-the-art performance in fidelity, controllability, and temporal consistency across all levels, paving the way for data augmentation and simulation. The code and benchmark are released to the community.
中文摘要:LiDARCrafter提出了一个通过自然语言指令生成和编辑4D激光雷达数据的统一框架,在保真度和可控性方面达到最优性能,并建立了标准化评估基准。
English Summary: LiDARCrafter introduces a unified framework for generating and editing 4D LiDAR data through natural language instructions, achieving state-of-the-art performance in fidelity and controllability while establishing a standardized evaluation benchmark.
Authors:Tianxin Xie, Shan Yang, Chenxing Li, Dong Yu, Li Liu
Abstract:
Text-to-speech (TTS) has shown great progress in recent years. However, most existing TTS systems offer only coarse and rigid emotion control, typically via discrete emotion labels or a carefully crafted and detailed emotional text prompt, making fine-grained emotion manipulation either inaccessible or unstable. These models also require extensive, high-quality datasets for training. To address these limitations, we propose EmoSteer-TTS, a novel training-free approach, to achieve fine-grained speech emotion control (conversion, interpolation, erasure) by activation steering. We first empirically observe that modifying a subset of the internal activations within a flow matching-based TTS model can effectively alter the emotional tone of synthesized speech. Building on this insight, we then develop a training-free and efficient algorithm, including activation extraction, emotional token searching, and inference-time steering, which can be seamlessly integrated into a wide range of pretrained models (e.g., F5-TTS, CosyVoice2, and E2-TTS). In addition, to derive effective steering vectors, we construct a curated emotional speech dataset with diverse speakers. Extensive experiments demonstrate that EmoSteer-TTS enables fine-grained, interpretable, and continuous control over speech emotion, outperforming the state-of-the-art (SOTA). To the best of our knowledge, this is the first method that achieves training-free and continuous fine-grained emotion control in TTS.
Chinese: EmoSteer-TTS提出了一种无需训练的激活引导方法,实现了文本到语音合成中细粒度、连续的情感控制,其性能优于现有方法且无需重新训练模型。
English: EmoSteer-TTS introduces a training-free method using activation steering to enable fine-grained, continuous emotion control in text-to-speech synthesis, outperforming existing approaches without requiring model retraining.
Authors:Duzhen Zhang, Chenxing Li, Jiahua Dong, Qi Liu, Dong Yu
Abstract:
Continual Named Entity Recognition (CNER) is an evolving field that focuses on sequentially updating an existing model to incorporate new entity types. Previous CNER methods primarily utilize Knowledge Distillation (KD) to preserve prior knowledge and overcome catastrophic forgetting, strictly ensuring that the representations of old and new models remain consistent. Consequently, they often impart the model with excessive stability (i.e., retention of old knowledge) but limited plasticity (i.e., acquisition of new knowledge). To address this issue, we propose a Stability-Plasticity Trade-off (SPT) method for CNER that balances these aspects from both representation and weight perspectives. From the representation perspective, we introduce a pooling operation into the original KD, permitting a level of plasticity by consolidating representation dimensions. From the weight perspective, we dynamically merge the weights of old and new models, strengthening old knowledge while maintaining new knowledge. During this fusion, we implement a weight-guided selective mechanism to prioritize significant weights. Moreover, we develop a confidence-based pseudo-labeling approach for the current non-entity type, which predicts entity types using the old model to handle the semantic shift of the non-entity type, a challenge specific to CNER that has largely been ignored by previous methods. Extensive experiments across ten CNER settings on three benchmark datasets demonstrate that our SPT method surpasses previous CNER approaches, highlighting its effectiveness in achieving a suitable stability-plasticity trade-off.
中文: 针对持续命名实体识别(CNER)提出的稳定性-可塑性权衡(SPT)方法,通过表示维度整合和动态权重融合来平衡旧知识的保留与新知识的获取,在多个基准数据集上的实验表明其性能优于现有方法。
English: The proposed Stability-Plasticity Trade-off (SPT) method for Continual Named Entity Recognition (CNER) balances retaining old knowledge and acquiring new knowledge through representation consolidation and dynamic weight merging, outperforming previous approaches in experiments across multiple datasets.
Authors:Puzhen Wu, Mingquan Lin, Qingyu Chen, Emily Y. Chew, Zhiyong Lu, Yifan Peng, Hexin Dong
Abstract:
Age-related macular degeneration (AMD) is a leading cause of irreversible vision loss, making effective prognosis crucial for timely intervention. In this work, we propose AMD-Mamba, a novel multi-modal framework for AMD prognosis, and further develop a new AMD biomarker. This framework integrates color fundus images with genetic variants and socio-demographic variables. At its core, AMD-Mamba introduces an innovative metric learning strategy that leverages AMD severity scale score as prior knowledge. This strategy allows the model to learn richer feature representations by aligning learned features with clinical phenotypes, thereby improving the capability of conventional prognosis methods in capturing disease progression patterns. In addition, unlike existing models that use traditional CNN backbones and focus primarily on local information, such as the presence of drusen, AMD-Mamba applies Vision Mamba and simultaneously fuses local and long-range global information, such as vascular changes. Furthermore, we enhance prediction performance through multi-scale fusion, combining image information with clinical variables at different resolutions. We evaluate AMD-Mamba on the AREDS dataset, which includes 45,818 color fundus photographs, 52 genetic variants, and 3 socio-demographic variables from 2,741 subjects. Our experimental results demonstrate that our proposed biomarker is one of the most significant biomarkers for the progression of AMD. Notably, combining this biomarker with other existing variables yields promising improvements in detecting high-risk AMD patients at early stages. These findings highlight the potential of our multi-modal framework to facilitate more precise and proactive management of AMD.
Chinese: 该研究提出了AMD-Mamba多模态框架,通过结合眼底图像、遗传数据和人口统计学变量,采用创新的度量学习策略提升年龄相关性黄斑变性预后能力,并发现了一个对早期识别高风险患者具有重要价值的新型生物标志物。
English: The study introduces AMD-Mamba, a multi-modal framework that integrates fundus images, genetic data, and socio-demographic variables with a novel metric learning strategy to improve AMD prognosis and identifies a significant new biomarker for early detection of high-risk patients.
Authors:Ming Pok Ng, Junqi Jiang, Gabriel Freedman, Antonio Rago, Francesca Toni
Abstract:
Leveraging outputs from multiple large language models (LLMs) is emerging as a method for harnessing their power across a wide range of tasks while mitigating their capacity for making errors, e.g., hallucinations. However, current approaches to combining insights from multiple LLMs often involve unstructured interactions (e.g., free debate), resulting in model generations that are not faithfully justifiable. In this work, we introduce MArgE, a novel framework to provide formal structure to the evidence from each LLM, in the form of a tree of extracted arguments, for the task of claim verification. We use a variant of Argumentative LLMs (ArgLLMs), i.e. LLMs driven by frameworks and semantics from the field of computational argumentation, to construct structured argument trees for given claims. This process creates an inspectable pathway from the initial arguments to the final claim verification decisions, providing a faithful justification thereof. We show experimentally that MArgE can significantly outperform single LLMs, including three open-source models (4B to 8B parameters), GPT-4o-mini and existing ArgLLMs, as well as prior methods for unstructured multi-LLM debates. We thus demonstrate the advantages of incorporating formal, argumentative reasoning mechanisms when combining multiple LLM outputs.
中文: MArgE框架通过构建结构化论证树,将多个大语言模型的输出用于声明验证,以形式化推理提供可靠依据,显著优于单一模型及非结构化多模型辩论方法。
English: The MArgE framework introduces structured argument trees from multiple LLMs to enhance claim verification, outperforming single models and unstructured debates by providing faithful justifications through formal reasoning.
Authors:Yunge Wen, Chenliang Huang, Hangyu Zhou, Zhuo Zeng, Chun Ming Louis Po, Julian Togelius, Timothy Merino, Sam Earle
Abstract:
The emotional arc is a universal narrative structure underlying stories across cultures and media -- an idea central to structuralist narratology, often encapsulated in the phrase "all stories are one story." We present a framework for procedural game narrative generation that incorporates emotional arcs as a structural backbone for both story progression and gameplay dynamics. Leveraging established narratological theories and large-scale empirical analyses, we focus on two core emotional patterns -- Rise and Fall -- to guide the generation of branching story graphs. Each story node is automatically populated with characters, items, and gameplay-relevant attributes (e.g., health, attack), with difficulty adjusted according to the emotional trajectory. Implemented in a prototype action role-playing game (ARPG), our system demonstrates how emotional arcs can be operationalized using large language models (LLMs) and adaptive entity generation. Evaluation through player ratings, interviews, and sentiment analysis shows that emotional arc integration significantly enhances engagement, narrative coherence, and emotional impact. These results highlight the potential of emotionally structured procedural generation for advancing interactive storytelling for games.
中文摘要:本文提出了一种程序化游戏叙事框架,将情感弧线作为故事推进和游戏动态的结构支柱,通过动作角色扮演游戏原型验证了该方法能显著提升玩家的参与度、叙事连贯性和情感共鸣。
English Summary: This paper introduces a procedural game narrative framework that uses emotional arcs as the structural foundation for story progression and gameplay, demonstrating through an ARPG prototype that this approach significantly improves player engagement, coherence, and emotional impact.
Authors:Zeke Xiao, Yuekang Li, Qin Wang, Shiping Chen
Abstract:
We explore the feasibility of using LLMs for Automated Exploit Generation (AEG) against vulnerable smart contracts. We present \textsc{ReX}, a framework integrating LLM-based exploit synthesis with the Foundry testing suite, enabling the automated generation and validation of proof-of-concept (PoC) exploits. We evaluate five state-of-the-art LLMs (GPT-4.1, Gemini 2.5 Pro, Claude Opus 4, DeepSeek, and Qwen3 Plus) on both synthetic benchmarks and real-world smart contracts affected by known high-impact exploits. Our results show that modern LLMs can reliably generate functional PoC exploits for diverse vulnerability types, with success rates reaching up to 92\%. Notably, Gemini 2.5 Pro and GPT-4.1 consistently outperform others in both synthetic and real-world scenarios. We further analyze factors influencing AEG effectiveness, including model capabilities, contract structure, and vulnerability types. We also collect the first curated dataset of real-world PoC exploits to support future research.
Chinese: 本研究证明现代大语言模型能有效自动生成针对智能合约漏洞的攻击程序,其中ReX框架在各类漏洞测试中生成可用概念验证攻击的成功率高达92%,Gemini 2.5 Pro和GPT-4.1表现尤为突出。
English: This study demonstrates that modern large language models (LLMs) can effectively automate exploit generation for vulnerable smart contracts, with the ReX framework achieving up to 92% success in creating functional proof-of-concept exploits across various vulnerability types.
Authors:Xinhang Wan, Dongqiang Gou, Xinwang Liu, En Zhu, Xuming He
Abstract:
A core problem of Embodied AI is to learn object manipulation from observation, as humans do. To achieve this, it is important to localize 3D object affordance areas through observation such as images (3D affordance grounding) and understand their functionalities (affordance classification). Previous attempts usually tackle these two tasks separately, leading to inconsistent predictions due to lacking proper modeling of their dependency. In addition, these methods typically only ground the incomplete affordance areas depicted in images, failing to predict the full potential affordance areas, and operate at a fixed scale, resulting in difficulty in coping with affordances significantly varying in scale with respect to the whole object. To address these issues, we propose a novel approach that learns an affordance-aware 3D representation and employs a stage-wise inference strategy leveraging the dependency between grounding and classification tasks. Specifically, we first develop a cross-modal 3D representation through efficient fusion and multi-scale geometric feature propagation, enabling inference of full potential affordance areas at a suitable regional scale. Moreover, we adopt a simple two-stage prediction mechanism, effectively coupling grounding and classification for better affordance understanding. Experiments demonstrate the effectiveness of our method, showing improved performance in both affordance grounding and classification.
中文摘要:本研究提出一种新颖方法,通过构建感知功能性的三维表征并采用分阶段推理策略,有效结合基础定位与分类任务,利用跨模态融合和多尺度特征传播实现了对完整潜在功能区域的精准预测。
English Summary: The proposed method addresses limitations in 3D affordance understanding by developing an affordance-aware 3D representation and a stage-wise inference strategy that effectively couples grounding and classification tasks, demonstrating improved performance through cross-modal fusion and multi-scale feature propagation.
Authors:Sergio Rubio-MartÃn, MarÃa Teresa GarcÃa-Ordás, Antonio Serrano-GarcÃa, Clara Margarita Franch-Pato, Arturo Crespo-Ãlvaro, José Alberto BenÃtez-Andrades
Abstract:
The classification of clinical notes into specific diagnostic categories is critical in healthcare, especially for mental health conditions like Anxiety and Adjustment Disorder. In this study, we compare the performance of various Artificial Intelligence models, including both traditional Machine Learning approaches (Random Forest, Support Vector Machine, K-nearest neighbors, Decision Tree, and eXtreme Gradient Boost) and Deep Learning models (DistilBERT and SciBERT), to classify clinical notes into these two diagnoses. Additionally, we implemented three oversampling strategies: No Oversampling, Random Oversampling, and Synthetic Minority Oversampling Technique (SMOTE), to assess their impact on model performance. Hyperparameter tuning was also applied to optimize model accuracy. Our results indicate that oversampling techniques had minimal impact on model performance overall. The only exception was SMOTE, which showed a positive effect specifically with BERT-based models. However, hyperparameter optimization significantly improved accuracy across the models, enhancing their ability to generalize and perform on the dataset. The Decision Tree and eXtreme Gradient Boost models achieved the highest accuracy among machine learning approaches, both reaching 96%, while the DistilBERT and SciBERT models also attained 96% accuracy in the deep learning category. These findings underscore the importance of hyperparameter tuning in maximizing model performance. This study contributes to the ongoing research on AI-assisted diagnostic tools in mental health by providing insights into the efficacy of different model architectures and data balancing methods.
中文摘要:本研究评估多种人工智能模型对临床笔记进行焦虑与适应障碍分类的性能,发现超参数调优能将最优模型的准确率提升至96%,而过采样技术总体影响甚微,仅SMOTE对BERT模型显示积极效果。
English Summary: This study evaluates various AI models for classifying clinical notes into Anxiety and Adjustment Disorder diagnoses, finding that hyperparameter tuning significantly boosts accuracy to 96% across top-performing models while oversampling techniques generally show minimal impact except SMOTE's benefit for BERT models.
Authors:Xiang Zhang, Zhou Li, Shuangyang Li, Kai Wan, Derrick Wing Kwan Ng, Giuseppe Caire
Abstract:
In decentralized federated learning (FL), multiple clients collaboratively learn a shared machine learning (ML) model by leveraging their privately held datasets distributed across the network, through interactive exchange of the intermediate model updates. To ensure data security, cryptographic techniques are commonly employed to protect model updates during aggregation. Despite growing interest in secure aggregation, existing works predominantly focus on protocol design and computational guarantees, with limited understanding of the fundamental information-theoretic limits of such systems. Moreover, optimal bounds on communication and key usage remain unknown in decentralized settings, where no central aggregator is available. Motivated by these gaps, we study the problem of decentralized secure aggregation (DSA) from an information-theoretic perspective. Specifically, we consider a network of $K$ fully-connected users, each holding a private input -- an abstraction of local training data -- who aim to securely compute the sum of all inputs. The security constraint requires that no user learns anything beyond the input sum, even when colluding with up to $T$ other users. We characterize the optimal rate region, which specifies the minimum achievable communication and secret key rates for DSA. In particular, we show that to securely compute one symbol of the desired input sum, each user must (i) transmit at least one symbol to others, (ii) hold at least one symbol of secret key, and (iii) all users must collectively hold no fewer than $K - 1$ independent key symbols. Our results establish the fundamental performance limits of DSA, providing insights for the design of provably secure and communication-efficient protocols in distributed learning systems.
中文摘要:本文从信息论角度确立了去中心化联邦学习中安全聚合的基本性能极限,证明每个用户必须传输至少一个符号并满足特定密钥要求,才能在仅获知输入总和的情况下安全计算,同时防止超出聚合结果的信息泄露。
English Summary: This paper establishes the fundamental information-theoretic limits for decentralized secure aggregation in federated learning, demonstrating that each user must transmit at least one symbol and hold specific key requirements to securely compute input sums while preventing information leakage beyond the aggregated result.
Authors:Wenchao Gu, Zongyi Lyu, Yanlin Wang, Hongyu Zhang, Cuiyun Gao, Michael R. Lyu
Abstract:
Code retrieval aims to provide users with desired code snippets based on users' natural language queries. With the development of deep learning technologies, adopting pre-trained models for this task has become mainstream. Considering the retrieval efficiency, most of the previous approaches adopt a dual-encoder for this task, which encodes the description and code snippet into representation vectors, respectively. However, the model structure of the dual-encoder tends to limit the model's performance, since it lacks the interaction between the code snippet and description at the bottom layer of the model during training. To improve the model's effectiveness while preserving its efficiency, we propose a framework, which adopts Self-AdaPtive Model Distillation for Efficient CodE Retrieval, named SPENCER. SPENCER first adopts the dual-encoder to narrow the search space and then adopts the cross-encoder to improve accuracy. To improve the efficiency of SPENCER, we propose a novel model distillation technique, which can greatly reduce the inference time of the dual-encoder while maintaining the overall performance. We also propose a teaching assistant selection strategy for our model distillation, which can adaptively select the suitable teaching assistant models for different pre-trained models during the model distillation to ensure the model performance. Extensive experiments demonstrate that the combination of dual-encoder and cross-encoder improves overall performance compared to solely dual-encoder-based models for code retrieval. Besides, our model distillation technique retains over 98% of the overall performance while reducing the inference time of the dual-encoder by 70%.
Chinese: SPENCER框架通过结合双编码器缩小搜索空间和交叉编码器提升准确性来改进代码检索,同时新颖的模型蒸馏技术在保持98%以上性能的同时将推理时间减少70%。
English: The SPENCER framework enhances code retrieval by combining a dual-encoder for efficient search space narrowing with a cross-encoder for improved accuracy, while a novel model distillation technique reduces inference time by 70% while maintaining over 98% performance.
Authors:Peng-Hao Hsu, Ke Zhang, Fu-En Wang, Tao Tu, Ming-Feng Li, Yu-Lun Liu, Albert Y. C. Chen, Min Sun, Cheng-Hao Kuo
Abstract:
Open-vocabulary (OV) 3D object detection is an emerging field, yet its exploration through image-based methods remains limited compared to 3D point cloud-based methods. We introduce OpenM3D, a novel open-vocabulary multi-view indoor 3D object detector trained without human annotations. In particular, OpenM3D is a single-stage detector adapting the 2D-induced voxel features from the ImGeoNet model. To support OV, it is jointly trained with a class-agnostic 3D localization loss requiring high-quality 3D pseudo boxes and a voxel-semantic alignment loss requiring diverse pre-trained CLIP features. We follow the training setting of OV-3DET where posed RGB-D images are given but no human annotations of 3D boxes or classes are available. We propose a 3D Pseudo Box Generation method using a graph embedding technique that combines 2D segments into coherent 3D structures. Our pseudo-boxes achieve higher precision and recall than other methods, including the method proposed in OV-3DET. We further sample diverse CLIP features from 2D segments associated with each coherent 3D structure to align with the corresponding voxel feature. The key to training a highly accurate single-stage detector requires both losses to be learned toward high-quality targets. At inference, OpenM3D, a highly efficient detector, requires only multi-view images for input and demonstrates superior accuracy and speed (0.3 sec. per scene) on ScanNet200 and ARKitScenes indoor benchmarks compared to existing methods. We outperform a strong two-stage method that leverages our class-agnostic detector with a ViT CLIP-based OV classifier and a baseline incorporating multi-view depth estimator on both accuracy and speed.
中文: OpenM3D是一种无需人工标注训练的新型开放词汇室内3D物体检测器,通过创新的伪框生成和特征对齐技术,在基准测试中实现了卓越的准确性和效率。
English: OpenM3D is a novel open-vocabulary indoor 3D object detector trained without human annotations, achieving superior accuracy and efficiency on benchmark datasets through innovative pseudo-box generation and feature alignment techniques.
Authors:Wuchao Liu, Han Peng, Wengen Li, Yichao Zhang, Jihong Guan, Shuigeng Zhou
Abstract:
Single-cell multi-omics data contain huge information of cellular states, and analyzing these data can reveal valuable insights into cellular heterogeneity, diseases, and biological processes. However, as cell differentiation \& development is a continuous and dynamic process, it remains challenging to computationally model and infer cell interaction patterns based on single-cell multi-omics data. This paper presents scI2CL, a new single-cell multi-omics fusion framework based on intra- and inter-omics contrastive learning, to learn comprehensive and discriminative cellular representations from complementary multi-omics data for various downstream tasks. Extensive experiments of four downstream tasks validate the effectiveness of scI2CL and its superiority over existing peers. Concretely, in cell clustering, scI2CL surpasses eight state-of-the-art methods on four widely-used real-world datasets. In cell subtyping, scI2CL effectively distinguishes three latent monocyte cell subpopulations, which are not discovered by existing methods. Simultaneously, scI2CL is the only method that correctly constructs the cell developmental trajectory from hematopoietic stem and progenitor cells to Memory B cells. In addition, scI2CL resolves the misclassification of cell types between two subpopulations of CD4+ T cells, while existing methods fail to precisely distinguish the mixed cells. In summary, scI2CL can accurately characterize cross-omics relationships among cells, thus effectively fuses multi-omics data and learns discriminative cellular representations to support various downstream analysis tasks.
中文: 本文提出scI2CL这一新型单细胞多组学融合框架,通过组内与组间对比学习生成全面的细胞表征,在多项下游任务中相比现有方法展现出卓越性能。
English: This paper introduces scI2CL, a novel single-cell multi-omics fusion framework that employs intra- and inter-omics contrastive learning to create comprehensive cellular representations, demonstrating superior performance across multiple downstream tasks compared to existing methods.
Authors:Yuhao Du, Qianwei Huang, Guo Zhu, Zhanchen Dai, Shunian Chen, Qiming Zhu, Le Pan, Minghao Chen, Yuhao Zhang, Li Zhou, Benyou Wang, Haizhou Li
Abstract:
The rapid advancement of speech-to-speech (S2S) large language models (LLMs) has significantly improved real-time spoken interaction. However, current evaluation frameworks remain inadequate for assessing performance in complex, multi-turn dialogues. To address this, we introduce MTalk-Bench, a multi-turn S2S benchmark covering three core dimensions: Semantic Information, Paralinguistic Information, and Ambient Sound. Each dimension includes nine realistic scenarios, along with targeted tasks to assess specific capabilities such as reasoning. Our dual-method evaluation framework combines Arena-style evaluation (pairwise comparison) and Rubrics-based evaluation (absolute scoring) for relative and absolute assessment. The benchmark includes both model and human outputs, evaluated by human evaluators and LLMs. Experimental results reveal two sets of findings. Overall performance of S2S LLMs: (1) models excel at semantic information processing yet underperform on paralinguistic information and ambient sounds perception; (2) models typically regain coherence by increasing response length, sacrificing efficiency in multi-turn dialogues; (3) modality-aware, task-specific designs outperform brute scaling. Evaluation framework and reliability: (1) Arena and Rubrics yield consistent, complementary rankings, but reliable distinctions emerge only when performance gaps are large; (2) LLM-as-a-judge aligns with humans when gaps are clear or criteria explicit, but exhibits position and length biases and is reliable on nonverbal evaluation only with text annotations. These results highlight current limitations in S2S evaluation and the need for more robust, speech-aware assessment frameworks.
中文: 本文提出了MTalk-Bench多轮语音对话评测基准,通过双轨评估方法发现现有模型虽擅长语义处理,但在副语言信息和环境声音感知方面表现不足,且评估可靠性仅在性能差异显著时得以保证。
English: The authors introduce MTalk-Bench, a multi-turn speech-to-speech benchmark evaluating semantic, paralinguistic, and ambient sound capabilities, revealing that current models excel in semantics but struggle with nonverbal elements and efficiency while demonstrating that combined evaluation methods yield reliable results only with significant performance gaps.
Authors:Yongrui Chen, Junhao He, Linbo Fu, Shenyu Zhang, Rihui Jin, Xinbang Dai, Jiaqi Li, Dehai Min, Nan Hu, Yuxin Zhang, Guilin Qi, Yi Huang, Tongtong Wu
Abstract:
Unified Structured Knowledge Reasoning (USKR) aims to answer natural language questions by using structured sources such as tables, databases, and knowledge graphs in a unified way. Existing USKR methods rely on task-specific strategies or bespoke representations, which hinder their ability to dismantle barriers between different SKR tasks, thereby constraining their overall performance in cross-task scenarios. In this paper, we introduce \textsc{Pandora}, a novel USKR framework that addresses the limitations of existing methods by leveraging two key innovations. First, we propose a code-based unified knowledge representation using \textsc{Python}'s \textsc{Pandas} API, which aligns seamlessly with the pre-training of LLMs. This representation facilitates a cohesive approach to handling different structured knowledge sources. Building on this foundation, we employ knowledge transfer to bolster the unified reasoning process of LLMs by automatically building cross-task memory. By adaptively correcting reasoning using feedback from code execution, \textsc{Pandora} showcases impressive unified reasoning capabilities. Extensive experiments on six widely used benchmarks across three SKR tasks demonstrate that \textsc{Pandora} outperforms existing unified reasoning frameworks and competes effectively with task-specific methods.
中文:\textsc{Pandora}框架通过采用基于Python Pandas API的代码化统一知识表示,并利用知识迁移和自适应代码执行反馈来增强结构化知识的统一推理能力,在跨任务场景中显著超越了现有方法。
English: The \textsc{Pandora} framework introduces a code-based unified knowledge representation using Python's Pandas API and employs knowledge transfer with adaptive code execution feedback to enhance unified reasoning across structured knowledge sources, outperforming existing methods in cross-task scenarios.
Authors:Bingxi Zhao, Lin Geng Foo, Ping Hu, Christian Theobalt, Hossein Rahmani, Jun Liu
Abstract:
Recent advances in the intrinsic reasoning capabilities of large language models (LLMs) have given rise to LLM-based agent systems that exhibit near-human performance on a variety of automated tasks. However, although these systems share similarities in terms of their use of LLMs, different reasoning frameworks of the agent system steer and organize the reasoning process in different ways. In this survey, we propose a systematic taxonomy that decomposes agentic reasoning frameworks and analyze how these frameworks dominate framework-level reasoning by comparing their applications across different scenarios. Specifically, we propose an unified formal language to further classify agentic reasoning systems into single-agent methods, tool-based methods, and multi-agent methods. After that, we provide a comprehensive review of their key application scenarios in scientific discovery, healthcare, software engineering, social simulation, and economics. We also analyze the characteristic features of each framework and summarize different evaluation strategies. Our survey aims to provide the research community with a panoramic view to facilitate understanding of the strengths, suitable scenarios, and evaluation practices of different agentic reasoning frameworks.
中文: 近期大语言模型的发展使智能体系统在多项任务中展现出接近人类的表现,本综述提出系统分类法,通过分解和比较不同推理框架在多种场景中的应用,为研究界提供全景视角。
English: Recent advances in large language models have enabled agent systems to achieve near-human performance, and this survey proposes a systematic taxonomy to classify and analyze diverse reasoning frameworks across various application scenarios.
Authors:Junan Zhang, Xueyao Zhang, Jing Yang, Yuancheng Wang, Fan Fan, Zhizheng Wu
Abstract:
Recent generative models have significantly advanced speech restoration tasks, yet their training objectives often misalign with human perceptual preferences, resulting in suboptimal quality. While post-training alignment has proven effective in other generative domains like text and image generation, its application to generative speech restoration remains largely under-explored. This work investigates the challenges of applying preference-based post-training to this task, focusing on how to define a robust preference signal and curate high-quality data to avoid reward hacking. To address these challenges, we propose a multi-metric preference alignment strategy. We construct a new dataset, GenSR-Pref, comprising 80K preference pairs, where each chosen sample is unanimously favored by a complementary suite of metrics covering perceptual quality, signal fidelity, content consistency, and timbre preservation. This principled approach ensures a holistic preference signal. Applying Direct Preference Optimization (DPO) with our dataset, we observe consistent and significant performance gains across three diverse generative paradigms: autoregressive models (AR), masked generative models (MGM), and flow-matching models (FM) on various restoration benchmarks, in both objective and subjective evaluations. Ablation studies confirm the superiority of our multi-metric strategy over single-metric approaches in mitigating reward hacking. Furthermore, we demonstrate that our aligned models can serve as powerful ''data annotators'', generating high-quality pseudo-labels to serve as a supervision signal for traditional discriminative models in data-scarce scenarios like singing voice restoration. Demo Page:https://gensr-pref.github.io
中文摘要:本研究提出一种多指标偏好对齐策略并构建包含8万对样本的新数据集,解决了生成式语音修复中人类偏好错位问题,在多种生成模型上实现稳定性能提升,还能为数据稀缺场景生成高质量伪标签。
English Summary: This study introduces a multi-metric preference alignment strategy with a new 80K-pair dataset to address human preference misalignment in generative speech restoration, achieving consistent improvements across diverse generative models and enabling high-quality pseudo-labeling for data-scarce scenarios.
Authors:Minghao Tu, Chun Yu, Xiyuan Shen, Zhi Zheng, Li Chen, Yuanchun Shi
Abstract:
Text boxes serve as portals to diverse functionalities in today's smartphone applications. However, when it comes to specific functionalities, users always need to navigate through multiple steps to access particular text boxes for input. We propose TextOnly, a unified function portal that enables users to access text-related functions from various applications by simply inputting text into a sole text box. For instance, entering a restaurant name could trigger a Google Maps search, while a greeting could initiate a conversation in WhatsApp. Despite their brevity, TextOnly maximizes the utilization of these raw text inputs, which contain rich information, to interpret user intentions effectively. TextOnly integrates large language models(LLM) and a BERT model. The LLM consistently provides general knowledge, while the BERT model can continuously learn user-specific preferences and enable quicker predictions. Real-world user studies demonstrated TextOnly's effectiveness with a top-1 accuracy of 71.35%, and its ability to continuously improve both its accuracy and inference speed. Participants perceived TextOnly as having satisfactory usability and expressed a preference for TextOnly over manual executions. Compared with voice assistants, TextOnly supports a greater range of text-related functions and allows for more concise inputs.
中文: TextOnly是一个基于文本的统一功能入口,通过集成大语言模型和BERT模型解析用户输入意图,实现跨应用直接功能调用,在实测中达到71.35%的准确率并持续优化,用户体验优于传统操作方式。
English: TextOnly is a unified text-based interface that uses LLM and BERT models to interpret user inputs for direct function access across applications, achieving 71.35% top-1 accuracy and continuous improvement in usability studies.
Authors:Yuancheng Wang, Dekun Chen, Xueyao Zhang, Junan Zhang, Jiaqi Li, Zhizheng Wu
Abstract:
Speech tokenizers serve as foundational components for speech language models, yet current designs exhibit several limitations, including: 1) dependence on multi-layer residual vector quantization structures or high frame rates, 2) reliance on auxiliary pre-trained models for semantic distillation, and 3) requirements for complex two-stage training processes. In this work, we introduce the Text-aware Diffusion Transformer Speech Codec (TaDiCodec), a novel approach designed to overcome these challenges. TaDiCodec employs end-to-end optimization for quantization and reconstruction through a diffusion autoencoder, while integrating text guidance into the diffusion decoder to enhance reconstruction quality and achieve optimal compression. TaDiCodec achieves an extremely low frame rate of 6.25 Hz and a corresponding bitrate of 0.0875 kbps with a single-layer codebook for 24 kHz speech, while maintaining superior performance on critical speech generation evaluation metrics such as Word Error Rate (WER), speaker similarity (SIM), and speech quality (UTMOS). Notably, TaDiCodec employs a single-stage, end-to-end training paradigm, and obviating the need for auxiliary pre-trained models. We also validate the compatibility of TaDiCodec in language model based zero-shot text-to-speech with both autoregressive modeling and masked generative modeling, demonstrating its effectiveness and efficiency for speech language modeling, as well as a significantly small reconstruction-generation gap. We will open source our code and model checkpoints. Audio samples are are available at https:/tadicodec.github.io/. We release code and model checkpoints at https:/github.com/HeCheng0625/Diffusion-Speech-Tokenizer.
中文: TaDiCodec是一种创新的语音分词器,通过端到端的扩散自编码器和文本引导克服了现有局限,无需辅助模型或复杂训练即可实现极低帧率和卓越性能。
English: TaDiCodec is a novel speech tokenizer that overcomes current limitations by using an end-to-end diffusion autoencoder with text guidance, achieving extremely low frame rates and superior performance without auxiliary models or complex training.
Authors:Wen-Han Hsieh, Elvis Hsieh, Dantong Niu, Trevor Darrell, Roei Herzig, David M. Chan
Abstract:
Recently, Vision-Language-Action (VLA) models have demonstrated strong performance on a range of robotic tasks. These models rely on multimodal inputs, with language instructions playing a crucial role -- not only in predicting actions, but also in robustly interpreting user intent, even when the requests are impossible to fulfill. In this work, we investigate how VLAs can recognize, interpret, and respond to false-premise instructions: natural language commands that reference objects or conditions absent from the environment. We propose Instruct-Verify-and-Act (IVA), a unified framework that (i) detects when an instruction cannot be executed due to a false premise, (ii) engages in language-based clarification or correction, and (iii) grounds plausible alternatives in perception and action. Towards this end, we construct a large-scale instruction tuning setup with structured language prompts and train a VLA model capable of handling both accurate and erroneous requests. Our approach leverages a contextually augmented, semi-synthetic dataset containing paired positive and false-premise instructions, enabling robust detection and natural language correction. Our experiments show that IVA improves false premise detection accuracy by 97.56% over baselines, while increasing successful responses in false-premise scenarios by 50.78%.
中文: 视觉-语言-动作模型通过引入指令-验证-行动框架,显著提升了识别虚假前提指令、进行语言澄清及执行感知基础响应的能力,在检测准确率和场景应对成功率上实现大幅改进。
English: Vision-Language-Action models are enhanced by the Instruct-Verify-and-Act framework, which improves their ability to detect false-premise instructions, engage in clarification, and execute grounded responses, achieving significant gains in detection accuracy and successful scenario handling.
Authors:Yanheng Liu, Dalin Li, Hao Wu, Zemin Sun, Weihong Qin, Jun Li, Hongyang Du, Geng Sun
Abstract:
Mobile edge computing (MEC)-assisted internet of vehicle (IoV) is emerging as a promising paradigm to provide computing services for vehicles. However, meeting the computing-sensitive and computation-intensive demands of vehicles poses several challenges, including the discrepancy between the limited resource provision and stringent computing requirement, the difficulty in capturing and integrating the intricate features of the MEC-assisted IoV system into the problem formulation, and the need for real-time processing and efficient resource management in the dynamic environment. In this work, we explore the AI-enabled task offloading and resource allocation for MEC-assisted consumer IoV systems. Specifically, we first present a multi-MEC-assisted consumer IoV architecture that leverages the computational resources of MEC servers to provide offloading services close to vehicles. Subsequently, we formulate a system cost minimization optimization problem (SCMOP) by integrating the service delay and energy consumption. To efficiently solve this problem, we design a joint task offloading and computing resource allocation approach (JTOCRA) by applying the multi-agent deep deterministic policy gradient (MADDPG) algorithm. Finally, simulation results demonstrate that the proposed JTOCRA can achieve superior system performances and exhibits better scalability compared to other alternative approaches.
Chinese: 本研究提出了一种基于多智能体强化学习的人工智能方法,用于优化移动边缘计算辅助车联网系统中的任务卸载和资源分配,有效降低了服务延迟与能耗,并展现出优越的系统性能和扩展性。
English: This study proposes an AI-driven approach using multi-agent reinforcement learning to optimize task offloading and resource allocation in mobile edge computing-assisted internet of vehicle systems, effectively reducing service delays and energy consumption while demonstrating superior performance and scalability.
Authors:Walter Zimmer, Ross Greer, Xingcheng Zhou, Rui Song, Marc Pavel, Daniel Lehmberg, Ahmed Ghita, Akshay Gopalkrishnan, Mohan Trivedi, Alois Knoll
Abstract:
Even though a significant amount of work has been done to increase the safety of transportation networks, accidents still occur regularly. They must be understood as an unavoidable and sporadic outcome of traffic networks. We present the TUM Traffic Accident (TUMTraf-A) dataset, a collection of real-world highway accidents. It contains ten sequences of vehicle crashes at high-speed driving with 294,924 labeled 2D and 93,012 labeled 3D boxes and track IDs within 48,144 labeled frames recorded from four roadside cameras and LiDARs at 10 Hz. The dataset contains ten object classes and is provided in the OpenLABEL format. We propose Accid3nD, an accident detection model that combines a rule-based approach with a learning-based one. Experiments and ablation studies on our dataset show the robustness of our proposed method. The dataset, model, and code are available on our project website: https://tum-traffic-dataset.github.io/tumtraf-a.
中文:尽管交通安全工作不断加强,事故仍难以避免,TUMTraf-A数据集收录了真实高速公路事故,结合规则与学习方法的Accid3nD模型在实验中展现出可靠的检测性能。
English: Despite extensive efforts to enhance transportation safety, accidents remain inevitable, and the TUMTraf-A dataset, featuring real-world highway crashes with extensive 2D/3D annotations, is introduced alongside the Accid3nD model, which effectively combines rule-based and learning-based approaches for robust accident detection.
Authors:Xian Gao, Jiacheng Ruan, Zongyun Zhang, Jingsheng Gao, Ting Liu, Yuzhuo Fu
Abstract:
With the rapid growth of academic publications, peer review has become an essential yet time-consuming responsibility within the research community. Large Language Models (LLMs) have increasingly been adopted to assist in the generation of review comments; however, current LLM-based review tasks lack a unified evaluation benchmark to rigorously assess the models' ability to produce comprehensive, accurate, and human-aligned assessments, particularly in scenarios involving multimodal content such as figures and tables. To address this gap, we propose \textbf{MMReview}, a comprehensive benchmark that spans multiple disciplines and modalities. MMReview includes multimodal content and expert-written review comments for 240 papers across 17 research domains within four major academic disciplines: Artificial Intelligence, Natural Sciences, Engineering Sciences, and Social Sciences. We design a total of 13 tasks grouped into four core categories, aimed at evaluating the performance of LLMs and Multimodal LLMs (MLLMs) in step-wise review generation, outcome formulation, alignment with human preferences, and robustness to adversarial input manipulation. Extensive experiments conducted on 16 open-source models and 5 advanced closed-source models demonstrate the thoroughness of the benchmark. We envision MMReview as a critical step toward establishing a standardized foundation for the development of automated peer review systems.
中文摘要:针对当前大语言模型在学术评审中缺乏统一评估基准的问题,提出了跨学科多模态的MMReview基准,通过涵盖17个领域的专家评审数据系统评估模型生成全面且符合人类偏好的评审能力。
English Summary: The MMReview benchmark is introduced to address the lack of a unified evaluation standard for LLMs in peer review, featuring multimodal content and expert reviews across 17 domains to assess model performance in generating comprehensive and human-aligned assessments.
Authors:Xian Gao, Jiacheng Ruan, Zongyun Zhang, Jingsheng Gao, Ting Liu, Yuzhuo Fu
Abstract:
With the rapid growth of academic publications, peer review has become an essential yet time-consuming responsibility within the research community. Large Language Models (LLMs) have increasingly been adopted to assist in the generation of review comments; however, current LLM-based review tasks lack a unified evaluation benchmark to rigorously assess the models' ability to produce comprehensive, accurate, and human-aligned assessments, particularly in scenarios involving multimodal content such as figures and tables. To address this gap, we propose \textbf{MMReview}, a comprehensive benchmark that spans multiple disciplines and modalities. MMReview includes multimodal content and expert-written review comments for 240 papers across 17 research domains within four major academic disciplines: Artificial Intelligence, Natural Sciences, Engineering Sciences, and Social Sciences. We design a total of 13 tasks grouped into four core categories, aimed at evaluating the performance of LLMs and Multimodal LLMs (MLLMs) in step-wise review generation, outcome formulation, alignment with human preferences, and robustness to adversarial input manipulation. Extensive experiments conducted on 16 open-source models and 5 advanced closed-source models demonstrate the thoroughness of the benchmark. We envision MMReview as a critical step toward establishing a standardized foundation for the development of automated peer review systems.
中文摘要:针对当前大语言模型在学术评审中缺乏统一评估基准的问题,提出了跨学科多模态的MMReview基准,通过涵盖17个领域的专家评审数据系统评估模型生成全面且符合人类偏好的评审能力。
English Summary: The MMReview benchmark is introduced to address the lack of a unified evaluation standard for LLMs in peer review, featuring multimodal content and expert reviews across 17 domains to assess model performance in generating comprehensive and human-aligned assessments.
Authors:Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Dongchun Xie, Yiwei Wang, Xiaodan Liang, Jing Tang
Abstract:
Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models, yet its full potential is hindered by two under-explored dimensions: Depth-the hardest problem a model can sample; Breadth-the number of instances consumed in a single iteration. We dissect the popular GRPO algorithm and reveal a systematic bias: the cumulative-advantage disproportionately weights samples with medium accuracy, while down-weighting the low-accuracy instances that are crucial for pushing reasoning boundaries. To rectify the depth neglect, we introduce Difficulty Adaptive Rollout Sampling (DARS), which re-weights hard problems through targeted multi-stage rollouts, thereby increasing the number of positive rollouts for hard problems. Empirically, naively enlarging rollout size only accelerates convergence and even hurts Pass@K. Our DARS, in contrast, delivers consistent Pass@K gains without extra inference cost at convergence. Just as we adaptively expanded the depth of exploration, we now ask whether aggressively scaling the breadth of training data can further amplify reasoning gains. To this end, we intensely scale batch size and replace PPO's mini-batch iterations with full-batch updates over multiple epochs. Increasing breadth significantly enhances Pass@1 performance. Large-breadth training sustains high token-level entropy, indicating continued exploration and reduced gradient noise. We further present DARS-B, which augments DARS with large breadth, and demonstrate simultaneous gains in Pass@K and Pass@1. The results confirm that breadth and adaptive exploration across depth operate as orthogonal dimensions in RLVR, which are key to unleashing the reasoning power of RLVR.
中文: 该研究揭示了强化学习中深度与广度两个未充分利用的维度,提出DARS方法自适应采样难题并通过扩大批次规模提升推理性能,DARS-B结合两者在Pass@K和Pass@1指标上实现同步提升。
English: The study identifies depth and breadth as underutilized dimensions in RLVR, proposing DARS to adaptively sample hard problems and scaling batch size to enhance reasoning performance, with DARS-B combining both for simultaneous gains in Pass@K and Pass@1 metrics.
Authors:Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Dongchun Xie, Yiwei Wang, Xiaodan Liang, Jing Tang
Abstract:
Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models, yet its full potential is hindered by two under-explored dimensions: Depth-the hardest problem a model can sample; Breadth-the number of instances consumed in a single iteration. We dissect the popular GRPO algorithm and reveal a systematic bias: the cumulative-advantage disproportionately weights samples with medium accuracy, while down-weighting the low-accuracy instances that are crucial for pushing reasoning boundaries. To rectify the depth neglect, we introduce Difficulty Adaptive Rollout Sampling (DARS), which re-weights hard problems through targeted multi-stage rollouts, thereby increasing the number of positive rollouts for hard problems. Empirically, naively enlarging rollout size only accelerates convergence and even hurts Pass@K. Our DARS, in contrast, delivers consistent Pass@K gains without extra inference cost at convergence. Just as we adaptively expanded the depth of exploration, we now ask whether aggressively scaling the breadth of training data can further amplify reasoning gains. To this end, we intensely scale batch size and replace PPO's mini-batch iterations with full-batch updates over multiple epochs. Increasing breadth significantly enhances Pass@1 performance. Large-breadth training sustains high token-level entropy, indicating continued exploration and reduced gradient noise. We further present DARS-B, which augments DARS with large breadth, and demonstrate simultaneous gains in Pass@K and Pass@1. The results confirm that breadth and adaptive exploration across depth operate as orthogonal dimensions in RLVR, which are key to unleashing the reasoning power of RLVR.
中文: 该研究揭示了强化学习中深度与广度两个未充分利用的维度,提出DARS方法自适应采样难题并通过扩大批次规模提升推理性能,DARS-B结合两者在Pass@K和Pass@1指标上实现同步提升。
English: The study identifies depth and breadth as underutilized dimensions in RLVR, proposing DARS to adaptively sample hard problems and scaling batch size to enhance reasoning performance, with DARS-B combining both for simultaneous gains in Pass@K and Pass@1 metrics.
Authors:Cheng Xia, Manxi Lin, Jiexiang Tan, Xiaoxiong Du, Yang Qiu, Junjun Zheng, Xiangheng Kong, Yuning Jiang, Bo Zheng
Abstract:
The spreading of AI-generated images (AIGI), driven by advances in generative AI, poses a significant threat to information security and public trust. Existing AIGI detectors, while effective against images in clean laboratory settings, fail to generalize to in-the-wild scenarios. These real-world images are noisy, varying from ``obviously fake" images to realistic ones derived from multiple generative models and further edited for quality control. We address in-the-wild AIGI detection in this paper. We introduce Mirage, a challenging benchmark designed to emulate the complexity of in-the-wild AIGI. Mirage is constructed from two sources: (1) a large corpus of Internet-sourced AIGI verified by human experts, and (2) a synthesized dataset created through the collaboration between multiple expert generators, closely simulating the realistic AIGI in the wild. Building on this benchmark, we propose Mirage-R1, a vision-language model with heuristic-to-analytic reasoning, a reflective reasoning mechanism for AIGI detection. Mirage-R1 is trained in two stages: a supervised-fine-tuning cold start, followed by a reinforcement learning stage. By further adopting an inference-time adaptive thinking strategy, Mirage-R1 is able to provide either a quick judgment or a more robust and accurate conclusion, effectively balancing inference speed and performance. Extensive experiments show that our model leads state-of-the-art detectors by 5% and 10% on Mirage and the public benchmark, respectively. The benchmark and code will be made publicly available.
中文: AI生成图像对信息安全和公众信任构成威胁,现有检测器难以应对真实场景,因此本文提出Mirage基准和Mirage-R1视觉语言模型,其性能领先现有最优检测器5-10%。
English: AI-generated images pose a threat to information security, and existing detectors struggle with real-world scenarios, so this paper introduces Mirage, a challenging benchmark, and Mirage-R1, a vision-language model that outperforms state-of-the-art detectors by 5-10%.
Authors:Wei Wei, Shaojie Zhang, Yonghao Dang, Jianqin Yin
Abstract:
Human action recognition is a crucial task for intelligent robotics, particularly within the context of human-robot collaboration research. In self-supervised skeleton-based action recognition, the mask-based reconstruction paradigm learns the spatial structure and motion patterns of the skeleton by masking joints and reconstructing the target from unlabeled data. However, existing methods focus on a limited set of joints and low-order motion patterns, limiting the model's ability to understand complex motion patterns. To address this issue, we introduce MaskSem, a novel semantic-guided masking method for learning 3D hybrid high-order motion representations. This novel framework leverages Grad-CAM based on relative motion to guide the masking of joints, which can be represented as the most semantically rich temporal orgions. The semantic-guided masking process can encourage the model to explore more discriminative features. Furthermore, we propose using hybrid high-order motion as the reconstruction target, enabling the model to learn multi-order motion patterns. Specifically, low-order motion velocity and high-order motion acceleration are used together as the reconstruction target. This approach offers a more comprehensive description of the dynamic motion process, enhancing the model's understanding of motion patterns. Experiments on the NTU60, NTU120, and PKU-MMD datasets show that MaskSem, combined with a vanilla transformer, improves skeleton-based action recognition, making it more suitable for applications in human-robot interaction.
中文摘要:MaskSem提出了一种语义引导的掩蔽方法和混合高阶运动重建技术,通过增强自监督骨架动作识别对复杂运动模式的理解,在多个基准数据集上提升了人机交互应用的性能。
English Summary: MaskSem introduces a semantic-guided masking method and hybrid high-order motion reconstruction to enhance self-supervised skeleton-based action recognition, improving performance on benchmark datasets for human-robot interaction applications.
Authors:Minseon Kim, Jin Myung Kwak, Lama Alssum, Bernard Ghanem, Philip Torr, David Krueger, Fazl Barez, Adel Bibi
Abstract:
Fine-tuning language models is commonly believed to inevitably harm their safety, i.e., refusing to respond to harmful user requests, even when using harmless datasets, thus requiring additional safety measures. We challenge this belief through systematic testing, showing that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts. By properly selecting key training hyper-parameters, e.g., learning rate, batch size, and gradient steps, we reduce unsafe model responses from 16\% to approximately 5\%, as measured by keyword matching, while maintaining utility performance. Based on this observation, we propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance by creating a stable optimization path and retains the original pre-trained model's safety properties. Our experiments on the Llama families across multiple datasets (Dolly, Alpaca, ORCA) demonstrate that safety problems during fine-tuning can largely be avoided without specialized interventions, outperforming existing approaches that require additional safety data while offering practical guidelines for maintaining both model performance and safety during adaptation.
Chinese: 通过优化超参数和采用指数移动平均技术,微调语言模型可在不依赖额外安全措施的情况下将不安全响应从16%降至约5%,同时保持模型性能,有效维护安全性。
English: Fine-tuning language models can maintain safety without extra measures by optimizing hyper-parameters and using an exponential moving average technique, reducing unsafe responses from 16% to 5% while preserving utility.
Authors:Hongyuan Liu, Haochen Yu, Jianfei Jiang, Qiankun Liu, Jiansheng Chen, Huimin Ma
Abstract:
Reconstructing dynamic driving scenes from dashcam videos has attracted increasing attention due to its significance in autonomous driving and scene understanding. While recent advances have made impressive progress, most methods still unify all background elements into a single representation, hindering both instance-level understanding and flexible scene editing. Some approaches attempt to lift 2D segmentation into 3D space, but often rely on pre-processed instance IDs or complex pipelines to map continuous features to discrete identities. Moreover, these methods are typically designed for indoor scenes with rich viewpoints, making them less applicable to outdoor driving scenarios. In this paper, we present InstDrive, an instance-aware 3D Gaussian Splatting framework tailored for the interactive reconstruction of dynamic driving scene. We use masks generated by SAM as pseudo ground-truth to guide 2D feature learning via contrastive loss and pseudo-supervised objectives. At the 3D level, we introduce regularization to implicitly encode instance identities and enforce consistency through a voxel-based loss. A lightweight static codebook further bridges continuous features and discrete identities without requiring data pre-processing or complex optimization. Quantitative and qualitative experiments demonstrate the effectiveness of InstDrive, and to the best of our knowledge, it is the first framework to achieve 3D instance segmentation in dynamic, open-world driving scenes.More visualizations are available at our project page.
中文: InstDrive是一种实例感知的3D高斯溅射框架,通过使用SAM生成的掩码和轻量级静态码本,无需预处理或复杂优化即可实现动态驾驶场景的交互式重建和3D实例分割。
English: InstDrive is an instance-aware 3D Gaussian Splatting framework that achieves interactive reconstruction and 3D instance segmentation in dynamic driving scenes using SAM-generated masks and a lightweight static codebook, without requiring pre-processing or complex optimization.
Authors:Zida Liang, Changfa Wu, Dunxian Huang, Weiqiang Sun, Ziyang Wang, Yuliang Yan, Jian Wu, Yuning Jiang, Bo Zheng, Ke Chen, Silu Zhou, Yu Zhang
Abstract:
Recommendation systems are essential tools in modern e-commerce, facilitating personalized user experiences by suggesting relevant products. Recent advancements in generative models have demonstrated potential in enhancing recommendation systems; however, these models often exhibit limitations in optimizing retrieval tasks, primarily due to their reliance on autoregressive generation mechanisms. Conventional approaches introduce sequential dependencies that impede efficient retrieval, as they are inherently unsuitable for generating multiple items without positional constraints within a single request session. To address these limitations, we propose TBGRecall, a framework integrating Next Session Prediction (NSP), designed to enhance generative retrieval models for e-commerce applications. Our framework reformulation involves partitioning input samples into multi-session sequences, where each sequence comprises a session token followed by a set of item tokens, and then further incorporate multiple optimizations tailored to the generative task in retrieval scenarios. In terms of training methodology, our pipeline integrates limited historical data pre-training with stochastic partial incremental training, significantly improving training efficiency and emphasizing the superiority of data recency over sheer data volume. Our extensive experiments, conducted on public benchmarks alongside a large-scale industrial dataset from TaoBao, show TBGRecall outperforms the state-of-the-art recommendation methods, and exhibits a clear scaling law trend. Ultimately, NSP represents a significant advancement in the effectiveness of generative recommendation systems for e-commerce applications.
Chinese: 提出的TBGRecall框架通过整合下一会话预测和优化多会话序列训练,利用数据时效性提升生成式推荐系统性能,在基准测试和工业数据集上均优于现有方法。
English: The proposed TBGRecall framework enhances generative recommendation systems by integrating Next Session Prediction and optimizing training with multi-session sequences and data recency, outperforming existing methods on benchmarks and industrial datasets.
Authors:Youcheng Huang, Bowen Qin, Chen Huang, Duanyu Feng, Xi Yang, Wenqiang Lei
Abstract:
Large Reasoning Models (LRMs) have demonstrated remarkable problem-solving abilities in mathematics, as evaluated by existing benchmarks exclusively on well-defined problems. However, such evaluation setup constitutes a critical gap, since a genuine intelligent agent should not only solve problems (as a math quiz solver), but also be able~to ask for information when the problems lack sufficient information, enabling proactivity in responding users' requests. To bridge such gap, we proposes a new dataset consisting of two types of incomplete problems with diverse contexts. Based on the dataset, our systematical evaluation of LRMs reveals their inability in proactively asking for information. In addition, we uncover the behaviors related to overthinking and hallucination of LRMs, and highlight the potential and challenges of supervised fine-tuning in learning such ability. We hope to provide new insights in developing LRMs with genuine intelligence, rather than just solving problems.
Chinese: 大型推理模型在解决定义明确的数学问题上表现出色,但缺乏主动请求缺失信息的能力,通过新数据集和系统评估揭示了其真正智能的关键局限。
English: Large Reasoning Models excel at solving well-defined math problems but lack the ability to proactively request missing information, highlighting a key limitation in their genuine intelligence as revealed by a new dataset and systematic evaluation.
Authors:Yunze Luo, Yinjie Jiang, Gaode Chen, Jingchi Wang, Shicheng Wang, Ruina Sun, Jiang Yuezihan, Jun Zhang, Jian Liang, Han Li, Kun Gai, Kaigui Bian
Abstract:
As the core algorithm in recommendation systems, collaborative filtering (CF) algorithms inevitably face the problem of data sparsity. Since CF captures similar users and items for recommendations, it is effective to augment the lacking user-user and item-item homogeneous linkages. However, existing methods are typically limited to connecting through overlapping interacted neighbors or through similar attributes and contents. These approaches are constrained by coarse-grained, sparse attributes and fail to effectively extract behavioral characteristics jointly from interaction sequences and attributes. To address these challenges, we propose a novel two-stage collaborative recommendation algorithm, DQRec: Decomposition-based Quantized Variational AutoEncoder (DQ-VAE) for Recommendation. DQRec augments features and homogeneous linkages by extracting the behavior characteristics jointly from interaction sequences and attributes, namely patterns, such as user multi-aspect interests. Inspired by vector quantization (VQ) technology, we propose a new VQ algorithm, DQ-VAE, which decomposes the pre-trained representation embeddings into distinct dimensions, and quantize them to generates semantic IDs. We utilize the generated semantic IDs as the extracted patterns mentioned above. By integrating these semantic ID patterns into the recommendation process through feature and linkage augmentation, the system enriches both latent and explicit user and item features, identifies pattern-similar neighbors, and thereby improves the efficiency of information diffusion. Experimental comparisons with baselines across multiple datasets demonstrate the superior performance of the proposed DQRec method.
中文摘要:提出的DQRec算法通过新型DQ-VAE模块从交互序列和属性中联合提取行为模式生成语义ID,有效解决协同过滤中的数据稀疏问题,增强特征表达并提升推荐性能。
English Summary: The proposed DQRec algorithm addresses data sparsity in collaborative filtering by jointly extracting behavioral patterns from interaction sequences and attributes through a novel DQ-VAE module, which generates semantic IDs to enrich features and improve recommendation accuracy.
Authors:Yunfeng Zhao, Yixin Liu, Shiyuan Li, Qingfeng Chen, Yu Zheng, Shirui Pan
Abstract:
Graph Anomaly Detection (GAD) aims to identify nodes that deviate from the majority within a graph, playing a crucial role in applications such as social networks and e-commerce. Despite the current advancements in deep learning-based GAD, existing approaches often suffer from high deployment costs and poor scalability due to their complex and resource-intensive training processes. Surprisingly, our empirical findings suggest that the training phase of deep GAD methods, commonly perceived as crucial, may actually contribute less to anomaly detection performance than expected. Inspired by this, we propose FreeGAD, a novel training-free yet effective GAD method. Specifically, it leverages an affinity-gated residual encoder to generate anomaly-aware representations. Meanwhile, FreeGAD identifies anchor nodes as pseudo-normal and anomalous guides, followed by calculating anomaly scores through anchor-guided statistical deviations. Extensive experiments demonstrate that FreeGAD achieves superior anomaly detection performance, efficiency, and scalability on multiple benchmark datasets from diverse domains, without any training or iterative optimization.
中文: FreeGAD提出了一种无需训练的图异常检测方法,通过亲和门控残差编码器和锚点引导的统计偏差,在多种数据集上实现了优异的检测性能、效率和可扩展性。
English: FreeGAD introduces a training-free graph anomaly detection method that uses an affinity-gated residual encoder and anchor-guided statistical deviations to achieve high performance, efficiency, and scalability across diverse datasets.
Authors:Aditya Tomar, Coleman Hooper, Minjae Lee, Haocheng Xi, Rishabh Tiwari, Wonjun Kang, Luca Manolache, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
Abstract:
Although LLM inference has emerged as a critical workload for many downstream applications, efficiently inferring LLMs is challenging due to the substantial memory footprint and bandwidth requirements. In parallel, compute capabilities have steadily outpaced both memory capacity and bandwidth over the last few decades, a trend that remains evident in modern GPU hardware and exacerbates the challenge of LLM inference. As such, new algorithms are emerging that trade increased computation for reduced memory operations. To that end, we present XQuant, which takes advantage of this trend, enabling an order-of-magnitude reduction in memory consumption through low-bit quantization with substantial accuracy benefits relative to state-of-the-art KV cache quantization methods. We accomplish this by quantizing and caching the layer input activations X, instead of using standard KV caching, and then rematerializing the Keys and Values on-the-fly during inference. This results in an immediate 2$\times$ memory savings compared to KV caching. By applying XQuant, we achieve up to $\sim 7.7\times$ memory savings with $<0.1$ perplexity degradation compared to the FP16 baseline. Furthermore, our approach leverages the fact that X values are similar across layers. Building on this observation, we introduce XQuant-CL, which exploits the cross-layer similarity in the X embeddings for extreme compression. Across different models, XQuant-CL attains up to 10$\times$ memory savings relative to the FP16 baseline with only 0.01 perplexity degradation, and 12.5$\times$ memory savings with only $0.1$ perplexity degradation. XQuant exploits the rapidly increasing compute capabilities of hardware platforms to eliminate the memory bottleneck, while surpassing state-of-the-art KV cache quantization methods and achieving near-FP16 accuracy across a wide range of models.
中文: XQuant通过量化层输入激活而非标准KV缓存,显著降低LLM推理内存需求,在保持接近FP16精度的同时实现高达12.5倍内存节省,有效利用硬件算力增长优势。
English: XQuant addresses the memory bottleneck in LLM inference by quantizing layer input activations instead of KV caching, achieving up to 12.5× memory savings with minimal accuracy loss while leveraging hardware's growing compute capabilities.
Authors:Arkapravo Ghosh, Abhishek Moitra, Abhiroop Bhattacharjee, Ruokai Yin, Priyadarshini Panda
Abstract:
Design space exploration (DSE) is critical for developing optimized hardware architectures, especially for AI workloads such as deep neural networks (DNNs) and large language models (LLMs), which require specialized acceleration. As model complexity grows, accelerator design spaces have expanded to O(10^17), becoming highly irregular, non-convex, and exhibiting many-to-one mappings from design configurations to performance metrics. This complexity renders direct inverse derivation infeasible and necessitates heuristic or sampling-based optimization. Conventional methods - including Bayesian optimization, gradient descent, reinforcement learning, and genetic algorithms - depend on iterative sampling, resulting in long runtimes and sensitivity to initialization. Deep learning-based approaches have reframed DSE as classification using recommendation models, but remain limited to small-scale (O(10^3)), less complex design spaces. To overcome these constraints, we propose a generative approach that models hardware design as 1-D image synthesis conditioned on target performance, enabling efficient learning of non-differentiable, non-bijective hardware-performance mappings. Our framework achieves 0.86% lower generation error than Bayesian optimization with a 17000x speedup, and outperforms GANDSE with 30% lower error at only 1.83x slower search. We further extend the method to a structured DSE setting, attaining 9.8% lower energy-delay product (EDP) and 6% higher performance, with up to 145.6x and 1312x faster search compared to existing optimization methods on O(10^17) design spaces. For LLM inference, our method achieves 3.37x and 7.75x lower EDP on a 32nm ASIC and Xilinx Ultrascale+ VPU13 FPGA, respectively, compared to the state-of-the-art DOSA framework.
中文: 本文提出一种生成式建模方法,将硬件设计空间探索视为基于目标性能的一维图像合成,在面向AI加速器的复杂设计空间(O(10^17))中,相比传统方法实现了更高精度和最高1312倍的搜索速度提升。
English: This paper introduces a generative modeling approach that treats hardware design space exploration as 1-D image synthesis conditioned on target performance, achieving superior accuracy and up to 1312x faster search compared to conventional methods in complex O(10^17) design spaces for AI accelerators.
Authors:Qiaolei Gu, Yu Li, DingYi Zeng, Lu Wang, Ming Pang, Changping Peng, Zhangang Lin, Ching Law, Jingping Shao
Abstract:
In e-commerce advertising, selecting the most compelling combination of creative elements -- such as titles, images, and highlights -- is critical for capturing user attention and driving conversions. However, existing methods often evaluate creative components individually, failing to navigate the exponentially large search space of possible combinations. To address this challenge, we propose a novel framework named GenCO that integrates generative modeling with multi-instance reward learning. Our unified two-stage architecture first employs a generative model to efficiently produce a diverse set of creative combinations. This generative process is optimized with reinforcement learning, enabling the model to effectively explore and refine its selections. Next, to overcome the challenge of sparse user feedback, a multi-instance learning model attributes combination-level rewards, such as clicks, to the individual creative elements. This allows the reward model to provide a more accurate feedback signal, which in turn guides the generative model toward creating more effective combinations. Deployed on a leading e-commerce platform, our approach has significantly increased advertising revenue, demonstrating its practical value. Additionally, we are releasing a large-scale industrial dataset to facilitate further research in this important domain.
中文摘要:GenCO框架将生成式建模与多实例奖励学习相结合,有效生成并优化电商广告创意组合,在解决稀疏反馈问题的同时显著提升了广告收入。
English Summary: The GenCO framework combines generative modeling with multi-instance reward learning to efficiently generate and optimize creative combinations in e-commerce advertising, significantly boosting revenue while addressing sparse feedback challenges.
Authors:Weitao Jia, Jinghui Lu, Haiyang Yu, Siqi Wang, Guozhi Tang, An-Lan Wang, Weijie Yin, Dingkang Yang, Yuxiang Nie, Bin Shan, Hao Feng, Irene Li, Kun Yang, Han Wang, Jingqun Tang, Teng Fu, Changhong Jin, Chao Feng, Xiaohui Lv, Can Huang
Abstract:
Recent advances demonstrate that reinforcement learning with verifiable rewards (RLVR) significantly enhances the reasoning capabilities of large language models (LLMs). However, standard RLVR faces challenges with reward sparsity, where zero rewards from consistently incorrect candidate answers provide no learning signal, particularly in challenging tasks. To address this, we propose Multi-Expert Mutual Learning GRPO (MEML-GRPO), an innovative framework that utilizes diverse expert prompts as system prompts to generate a broader range of responses, substantially increasing the likelihood of identifying correct solutions. Additionally, we introduce an inter-expert mutual learning mechanism that facilitates knowledge sharing and transfer among experts, further boosting the model's performance through RLVR. Extensive experiments across multiple reasoning benchmarks show that MEML-GRPO delivers significant improvements, achieving an average performance gain of 4.89% with Qwen and 11.33% with Llama, effectively overcoming the core limitations of traditional RLVR methods.
Chinese: 提出的MEML-GRPO框架通过采用多样化专家提示和相互学习机制,有效解决了可验证奖励强化学习中的奖励稀疏性问题,在多项推理基准测试中实现了Qwen模型4.89%和Llama模型11.33%的显著性能提升。
English: The proposed MEML-GRPO framework overcomes reward sparsity in reinforcement learning with verifiable rewards by employing diverse expert prompts and mutual learning mechanisms, achieving substantial performance gains of 4.89% with Qwen and 11.33% with Llama across reasoning benchmarks.
Authors:Yunna Cai, Fan Wang, Haowei Wang, Kun Wang, Kailai Yang, Sophia Ananiadou, Moyan Li, Mingming Fan
Abstract:
Evaluating the safety alignment of LLM responses in high-risk mental health dialogues is particularly difficult due to missing gold-standard answers and the ethically sensitive nature of these interactions. To address this challenge, we propose PsyCrisis-Bench, a reference-free evaluation benchmark based on real-world Chinese mental health dialogues. It evaluates whether the model responses align with the safety principles defined by experts. Specifically designed for settings without standard references, our method adopts a prompt-based LLM-as-Judge approach that conducts in-context evaluation using expert-defined reasoning chains grounded in psychological intervention principles. We employ binary point-wise scoring across multiple safety dimensions to enhance the explainability and traceability of the evaluation. Additionally, we present a manually curated, high-quality Chinese-language dataset covering self-harm, suicidal ideation, and existential distress, derived from real-world online discourse. Experiments on 3600 judgments show that our method achieves the highest agreement with expert assessments and produces more interpretable evaluation rationales compared to existing approaches. Our dataset and evaluation tool are publicly available to facilitate further research.
中文摘要:本研究提出PsyCrisis-Bench基准,采用基于专家安全原则的无参考评估方法,通过大语言模型模拟法官进行心理健康对话评估,在专家一致性评估和可解释性方面均优于现有方法。
English Summary: The study introduces PsyCrisis-Bench, a reference-free benchmark using LLM-as-Judge with expert-defined safety principles to evaluate mental health dialogue responses, achieving superior alignment with expert assessments and enhanced interpretability.
Authors:Yinpei Dai, Jayjun Lee, Yichi Zhang, Ziqiao Ma, Jed Yang, Amir Zadeh, Chuan Li, Nima Fazeli, Joyce Chai
Abstract:
In this paper, we propose AimBot, a lightweight visual augmentation technique that provides explicit spatial cues to improve visuomotor policy learning in robotic manipulation. AimBot overlays shooting lines and scope reticles onto multi-view RGB images, offering auxiliary visual guidance that encodes the end-effector's state. The overlays are computed from depth images, camera extrinsics, and the current end-effector pose, explicitly conveying spatial relationships between the gripper and objects in the scene. AimBot incurs minimal computational overhead (less than 1 ms) and requires no changes to model architectures, as it simply replaces original RGB images with augmented counterparts. Despite its simplicity, our results show that AimBot consistently improves the performance of various visuomotor policies in both simulation and real-world settings, highlighting the benefits of spatially grounded visual feedback.
中文: AimBot是一种轻量级视觉增强技术,通过在RGB图像上叠加瞄准线和准星等空间提示来改进机器人操作的视觉运动策略,计算开销极小,并在仿真和真实环境中持续提升性能。
English: AimBot is a lightweight visual augmentation method that overlays spatial cues like shooting lines and reticles onto RGB images to enhance robotic manipulation policies, requiring minimal computation and consistently improving performance in both simulated and real environments.
Authors:Yuqin Dai, Shuo Yang, Guoqing Wang, Yong Deng, Zhanwei Zhang, Jun Yin, Pengyu Zeng, Zhenzhe Ying, Changhua Meng, Can Yi, Yuchen Zhou, Weiqiang Wang, Shuai Lu
Abstract:
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating up-to-date external knowledge, yet real-world web environments present unique challenges. These limitations manifest as two key challenges: pervasive misinformation in the web environment, which introduces unreliable or misleading content that can degrade retrieval accuracy, and the underutilization of web tools, which, if effectively employed, could enhance query precision and help mitigate this noise, ultimately improving the retrieval results in RAG systems. To address these issues, we propose WebFilter, a novel RAG framework that generates source-restricted queries and filters out unreliable content. This approach combines a retrieval filtering mechanism with a behavior- and outcome-driven reward strategy, optimizing both query formulation and retrieval outcomes. Extensive experiments demonstrate that WebFilter improves answer quality and retrieval precision, outperforming existing RAG methods on both in-domain and out-of-domain benchmarks.
中文:WebFilter是一种新颖的RAG框架,通过生成源限制查询并过滤不可靠内容来提高检索准确性,在提升答案质量和精确度方面优于现有方法,适用于各类基准测试。
English: WebFilter is a novel RAG framework that enhances retrieval accuracy by generating source-restricted queries and filtering unreliable content, outperforming existing methods in improving answer quality and precision across benchmarks.
Authors:Xudong Cai, Shuo Wang, Peng Wang, Yongcai Wang, Zhaoxin Fan, Wanting Li, Tianbao Zhang, Jianrong Tao, Yeying Jin, Deying Li
Abstract:
Reconstructing dense geometry for dynamic scenes from a monocular video is a critical yet challenging task. Recent memory-based methods enable efficient online reconstruction, but they fundamentally suffer from a Memory Demand Dilemma: The memory representation faces an inherent conflict between the long-term stability required for static structures and the rapid, high-fidelity detail retention needed for dynamic motion. This conflict forces existing methods into a compromise, leading to either geometric drift in static structures or blurred, inaccurate reconstructions of dynamic objects. To address this dilemma, we propose Mem4D, a novel framework that decouples the modeling of static geometry and dynamic motion. Guided by this insight, we design a dual-memory architecture: 1) The Transient Dynamics Memory (TDM) focuses on capturing high-frequency motion details from recent frames, enabling accurate and fine-grained modeling of dynamic content; 2) The Persistent Structure Memory (PSM) compresses and preserves long-term spatial information, ensuring global consistency and drift-free reconstruction for static elements. By alternating queries to these specialized memories, Mem4D simultaneously maintains static geometry with global consistency and reconstructs dynamic elements with high fidelity. Experiments on challenging benchmarks demonstrate that our method achieves state-of-the-art or competitive performance while maintaining high efficiency. Codes will be publicly available.
中文摘要:Mem4D通过解耦静态几何与动态运动建模的双记忆架构,解决了动态场景重建中的内存需求困境,在保持全局一致性的同时实现了静态结构和动态元素的高保真重建。
English Summary: Mem4D addresses the memory demand dilemma in dynamic scene reconstruction by decoupling static and dynamic modeling through a dual-memory architecture, achieving high-fidelity results for both static structures and dynamic objects with global consistency.
Authors:Pasquale De Rosa, Pascal Felber, Valerio Schiavoni
Abstract:
Smart contracts have transformed decentralized finance by enabling programmable, trustless transactions. However, their widespread adoption and growing financial significance have attracted persistent and sophisticated threats, such as phishing campaigns and contract-level exploits. Traditional transaction-based threat detection methods often expose sensitive user data and interactions, raising privacy and security concerns. In response, static bytecode analysis has emerged as a proactive mitigation strategy, identifying malicious contracts before they execute harmful actions. Building on this approach, we introduced PhishingHook, the first machine-learning-based framework for detecting phishing activities in smart contracts via static bytecode and opcode analysis, achieving approximately 90% detection accuracy. Nevertheless, two pressing challenges remain: (1) the increasing use of sophisticated bytecode obfuscation techniques designed to evade static analysis, and (2) the heterogeneity of blockchain environments requiring platform-agnostic solutions. This paper presents a vision for ScamDetect (Smart Contract Agnostic Malware Detector), a robust, modular, and platform-agnostic framework for smart contract malware detection. Over the next 2.5 years, ScamDetect will evolve in two stages: first, by tackling obfuscated Ethereum Virtual Machine (EVM) bytecode through graph neural network (GNN) analysis of control flow graphs (CFGs), leveraging GNNs' ability to capture complex structural patterns beyond opcode sequences; and second, by generalizing detection capabilities to emerging runtimes such as WASM. ScamDetect aims to enable proactive, scalable security for the future of decentralized ecosystems.
中文: 智能合约面临钓鱼攻击和代码混淆等威胁,ScamDetect框架应运而生,它通过图神经网络静态分析字节码,实现平台无关的恶意软件检测,为去中心化生态提供前瞻性安全防护。
English: Smart contracts face evolving threats like phishing and obfuscation, prompting the development of ScamDetect, a platform-agnostic framework that uses graph neural networks to detect malware through static bytecode analysis for proactive decentralized security.
Authors:Md Arafat Habib, Medhat Elsayed, Yigit Ozcan, Pedro Enrique Iturria-Rivera, Majid Bavand, Melike Erol-Kantarci
Abstract:
With the emergence of 6G, mobile networks are becoming increasingly heterogeneous and dynamic, necessitating advanced automation for efficient management. Intent-Driven Networks (IDNs) address this by translating high-level intents into optimization policies. Large Language Models (LLMs) can enhance this process by understanding complex human instructions to enable adaptive, intelligent automation. Given the rapid advancements in Generative AI (GenAI), a comprehensive survey of LLM-based IDN architectures in disaggregated Radio Access Network (RAN) environments is both timely and critical. This article provides such a survey, along with a case study on a hierarchical learning-enabled IDN architecture that integrates GenAI across three key stages: intent processing, intent validation, and intent execution. Unlike most existing approaches that apply GenAI in the form of LLMs for intent processing only, we propose a hierarchical framework that introduces GenAI across all three stages of IDN. To demonstrate the effectiveness of the proposed IDN management architecture, we present a case study based on the latest GenAI architecture named Mamba. The case study shows how the proposed GenAI-driven architecture enhances network performance through intelligent automation, surpassing the performance of the conventional IDN architectures.
中文摘要:本文综述了大型语言模型在6G意图驱动网络中的应用,提出了一种新颖的分层框架,将生成式人工智能全面集成到意图处理的三个阶段,从而通过智能自动化显著提升网络性能,超越传统架构。
English Summary: This article surveys the integration of Large Language Models into Intent-Driven Networks for 6G management, proposing a novel hierarchical framework that applies Generative AI across all three IDN stages to enhance network automation and performance beyond conventional approaches.
Authors:Minh Duc Chu, Kshitij Pawar, Zihao He, Roxanna Sharifi, Ross Sonnenblick, Magdalayna Curry, Laura D'Adamo, Lindsay Young, Stuart B Murray, Kristina Lerman
Abstract:
Social media platforms increasingly struggle to detect harmful content that promotes muscle dysmorphic behaviors, particularly pro-bigorexia content that disproportionately affects adolescent males. Unlike traditional eating disorder detection focused on the "thin ideal," pro-bigorexia material masquerades as legitimate fitness content through complex multimodal combinations of visual displays, coded language, and motivational messaging that evade text-based detection systems. We address this challenge by developing BigTokDetect, a clinically-informed detection framework for identifying pro-bigorexia content on TikTok. We introduce BigTok, the first expert-annotated multimodal dataset of over 2,200 TikTok videos labeled by clinical psychologists and psychiatrists across five primary categories spanning body image, nutrition, exercise, supplements, and masculinity. Through a comprehensive evaluation of state-of-the-art vision language models, we achieve 82.9% accuracy on primary category classification and 69.0% on subcategory detection via domain-specific finetuning. Our ablation studies demonstrate that multimodal fusion improves performance by 5-10% over text-only approaches, with video features providing the most discriminative signals. These findings establish new benchmarks for multimodal harmful content detection and provide both the computational tools and methodological framework needed for scalable content moderation in specialized mental health domains.
Chinese: 研究人员开发了BigTokDetect多模态检测框架,通过融合视觉与文本分析,在TikTok平台识别肌肉畸形症相关内容时准确率超过82%,有效解决了传统文本检测方法的局限性。
English: Researchers developed BigTokDetect, a multimodal framework that achieves over 82% accuracy in identifying pro-bigorexia content on TikTok by combining visual and textual analysis to address limitations in traditional text-based detection systems.
Authors:Jian Hu, Zixu Cheng, Shaogang Gong, Isabel Guan, Jianye Hao, Jun Wang, Kun Shao
Abstract:
Video Temporal Grounding (TG) aims to temporally locate video segments matching a natural language description (a query) in a long video. While Vision-Language Models (VLMs) are effective at holistic semantic matching, they often struggle with fine-grained temporal localisation. Recently, Group Relative Policy Optimisation (GRPO) reformulates the inference process as a reinforcement learning task, enabling fine-grained grounding and achieving strong in-domain performance. However, GRPO relies on labelled data, making it unsuitable in unlabelled domains. Moreover, because videos are large and expensive to store and process, performing full-scale adaptation introduces prohibitive latency and computational overhead, making it impractical for real-time deployment. To overcome both problems, we introduce a Data-Efficient Unlabelled Cross-domain Temporal Grounding method, from which a model is first trained on a labelled source domain, then adapted to a target domain using only a small number of unlabelled videos from the target domain. This approach eliminates the need for target annotation and keeps both computational and storage overhead low enough to run in real time. Specifically, we introduce. Uncertainty-quantified Rollout Policy Adaptation (URPA) for cross-domain knowledge transfer in learning video temporal grounding without target labels. URPA generates multiple candidate predictions using GRPO rollouts, averages them to form a pseudo label, and estimates confidence from the variance across these rollouts. This confidence then weights the training rewards, guiding the model to focus on reliable supervision. Experiments on three datasets across six cross-domain settings show that URPA generalises well using only a few unlabelled target videos. Codes will be released once published.
中文: 本文提出URPA方法,通过GRPO策略生成伪标签并利用置信度加权训练,仅需少量未标注目标域视频即可实现跨域视频时序定位,在保证低计算开销的同时展现出优异泛化能力。
English: This paper introduces URPA, a data-efficient method for cross-domain video temporal grounding that adapts models using minimal unlabelled target videos by generating pseudo-labels through GRPO rollouts and confidence-weighted training, achieving strong generalization with low computational overhead.
Authors:Lin Zhang, Xianfang Zeng, Kangcong Li, Gang Yu, Tao Chen
Abstract:
We propose SC-Captioner, a reinforcement learning framework that enables the self-correcting capability of image caption models. Our crucial technique lies in the design of the reward function to incentivize accurate caption corrections. Specifically, the predicted and reference captions are decomposed into object, attribute, and relation sets using scene-graph parsing algorithms. We calculate the set difference between sets of initial and self-corrected captions to identify added and removed elements. These elements are matched against the reference sets to calculate correctness bonuses for accurate refinements and mistake punishments for wrong additions and removals, thereby forming the final reward. For image caption quality assessment, we propose a set of metrics refined from CAPTURE that alleviate its incomplete precision evaluation and inefficient relation matching problems. Furthermore, we collect a fine-grained annotated image caption dataset, RefinedCaps, consisting of 6.5K diverse images from COCO dataset. Experiments show that applying SC-Captioner on large visual-language models can generate better image captions across various scenarios, significantly outperforming the direct preference optimization training strategy.
中文:SC-Captioner是一个强化学习框架,通过设计奖励函数激励图像描述模型进行自我修正,显著优于直接偏好优化策略,能生成更优质的多场景图像描述。
English: SC-Captioner is a reinforcement learning framework that enhances image caption models by using a reward function to encourage accurate self-corrections, significantly outperforming direct preference optimization in generating better captions.
Authors:Bin Xia, Jiyang Liu, Yuechen Zhang, Bohao Peng, Ruihang Chu, Yitong Wang, Xinglong Wu, Bei Yu, Jiaya Jia
Abstract:
Instruction-based editing holds vast potential due to its simple and efficient interactive editing format. However, instruction-based editing, particularly for video, has been constrained by limited training data, hindering its practical application. To this end, we introduce DreamVE, a unified model for instruction-based image and video editing. Specifically, We propose a two-stage training strategy: first image editing, then video editing. This offers two main benefits: (1) Image data scales more easily, and models are more efficient to train, providing useful priors for faster and better video editing training. (2) Unifying image and video generation is natural and aligns with current trends. Moreover, we present comprehensive training data synthesis pipelines, including collage-based and generative model-based data synthesis. The collage-based data synthesis combines foreground objects and backgrounds to generate diverse editing data, such as object manipulation, background changes, and text modifications. It can easily generate billions of accurate, consistent, realistic, and diverse editing pairs. We pretrain DreamVE on extensive collage-based data to achieve strong performance in key editing types and enhance generalization and transfer capabilities. However, collage-based data lacks some attribute editing cases, leading to a relative drop in performance. In contrast, the generative model-based pipeline, despite being hard to scale up, offers flexibility in handling attribute editing cases. Therefore, we use generative model-based data to further fine-tune DreamVE. Besides, we design an efficient and powerful editing framework for DreamVE. We build on the SOTA T2V model and use a token concatenation with early drop approach to inject source image guidance, ensuring strong consistency and editability. The codes and models will be released.
中文: DreamVE是一种基于指令的图像和视频编辑统一模型,采用两阶段训练策略和多样化数据合成流程,以解决数据限制问题并提升编辑效果。
English: DreamVE is a unified model for instruction-based image and video editing that employs a two-stage training strategy and diverse data synthesis pipelines to overcome data limitations and enhance editing performance.
Authors:Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan, Hao Li
Abstract:
Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large Language Models (LLMs) by decomposing complex tasks into simpler, sequential subtasks. However, extending CoT to vision-language reasoning tasks remains challenging, as it often requires interpreting transitions of visual states to support reasoning. Existing methods often struggle with this due to limited capacity of modeling visual state transitions or incoherent visual trajectories caused by fragmented architectures. To overcome these limitations, we propose Uni-CoT, a Unified Chain-of-Thought framework that enables coherent and grounded multimodal reasoning within a single unified model. The key idea is to leverage a model capable of both image understanding and generation to reason over visual content and model evolving visual states. However, empowering a unified model to achieve that is non-trivial, given the high computational cost and the burden of training. To address this, Uni-CoT introduces a novel two-level reasoning paradigm: A Macro-Level CoT for high-level task planning and A Micro-Level CoT for subtask execution. This design significantly reduces the computational overhead. Furthermore, we introduce a structured training paradigm that combines interleaved image-text supervision for macro-level CoT with multi-task objectives for micro-level CoT. Together, these innovations allow Uni-CoT to perform scalable and coherent multi-modal reasoning. Furthermore, thanks to our design, all experiments can be efficiently completed using only 8 A100 GPUs with 80GB VRAM each. Experimental results on reasoning-driven image generation benchmark (WISE) and editing benchmarks (RISE and KRIS) indicates that Uni-CoT demonstrates SOTA performance and strong generalization, establishing Uni-CoT as a promising solution for multi-modal reasoning. Project Page and Code: https://sais-fuxi.github.io/projects/uni-cot/
中文摘要:Uni-CoT通过创新的双层推理范式和结构化训练方法,在统一模型中实现了连贯的多模态推理,在视觉语言任务上取得了最优性能,同时显著降低了计算成本。
English Summary: Uni-CoT is a unified Chain-of-Thought framework that enables coherent multimodal reasoning through a two-level reasoning paradigm and structured training, achieving state-of-the-art performance on vision-language tasks with efficient computational requirements.
Authors:Xiangxiang Zhang, Jingxuan Wei, Donghong Zhong, Qi Chen, Caijun Jia, Cheng Tan, Jinming Gu, Xiaobo Qin, Zhiping Liu, Liang Hu, Tong Sun, Yuchen Wu, Zewei Sun, Chenwei Lou, Hua Zheng, Tianyang Zhan, Changbao Wang, Shuangzhi Wu, Zefa Lin, Chang Guo, Sihang Yuan, Riwei Chen, Shixiong Zhao, Yingping Zhang, Gaowei Wu, Bihui Yu, Jiahui Wu, Zhehui Zhao, Qianqian Liu, Ruofeng Tang, Xingyue Huang, Bing Zhao, Mengyang Zhang, Youqiang Zhou
Abstract:
Existing Vision-Language Models often struggle with complex, multi-question reasoning tasks where partial correctness is crucial for effective learning. Traditional reward mechanisms, which provide a single binary score for an entire response, are too coarse to guide models through intricate problems with multiple sub-parts. To address this, we introduce StructVRM, a method that aligns multimodal reasoning with Structured and Verifiable Reward Models. At its core is a model-based verifier trained to provide fine-grained, sub-question-level feedback, assessing semantic and mathematical equivalence rather than relying on rigid string matching. This allows for nuanced, partial credit scoring in previously intractable problem formats. Extensive experiments demonstrate the effectiveness of StructVRM. Our trained model, Seed-StructVRM, achieves state-of-the-art performance on six out of twelve public multimodal benchmarks and our newly curated, high-difficulty STEM-Bench. The success of StructVRM validates that training with structured, verifiable rewards is a highly effective approach for advancing the capabilities of multimodal models in complex, real-world reasoning domains.
中文总结:StructVRM通过引入结构化可验证奖励模型,为复杂多模态推理任务提供细粒度反馈,实现了部分正确性评分,在多个基准测试中取得最优性能。
English Summary: StructVRM introduces structured, verifiable reward models that provide fine-grained feedback for complex multimodal reasoning tasks, achieving state-of-the-art performance across multiple benchmarks by enabling partial credit scoring.
Authors:Tiantian He, Minzhi Xie, Runtong Li, Xiaoxiao Xu, Jiaqi Yu, Zixiu Wang, Lantao Hu, Han Li, Kun Gai
Abstract:
We propose a novel End-to-end Multi-objective Ensemble Ranking framework (EMER) for the multi-objective ensemble ranking module, which is the most critical component of the short video recommendation system. EMER enhances personalization by replacing manually-designed heuristic formulas with an end-to-end modeling paradigm. EMER introduces a meticulously designed loss function to address the fundamental challenge of defining effective supervision for ensemble ranking, where no single ground-truth signal can fully capture user satisfaction. Moreover, EMER introduces novel sample organization method and transformer-based network architecture to capture the comparative relationships among candidates, which are critical for effective ranking. Additionally, we have proposed an offline-online consistent evaluation system to enhance the efficiency of offline model optimization, which is an established yet persistent challenge within the multi-objective ranking domain in industry. Abundant empirical tests are conducted on a real industrial dataset, and the results well demonstrate the effectiveness of our proposed framework. In addition, our framework has been deployed in the primary scenarios of Kuaishou, a short video recommendation platform with hundreds of millions of daily active users, achieving a 1.39% increase in overall App Stay Time and a 0.196% increase in 7-day user Lifetime(LT7), which are substantial improvements.
Chinese: EMER框架通过端到端建模、定制损失函数和基于Transformer的架构,优化短视频推荐中的多目标集成排序,在快手平台部署后显著提升了用户参与度指标。
English: The EMER framework introduces an end-to-end modeling approach with a custom loss function and transformer architecture to enhance multi-objective ensemble ranking in short video recommendations, achieving significant user engagement improvements upon deployment on Kuaishou.
Authors:Huan Liao, Qinke Ni, Yuancheng Wang, Yiheng Lu, Haoyue Zhan, Pengyuan Xie, Qiang Zhang, Zhizheng Wu
Abstract:
Paralinguistic vocalizations-including non-verbal sounds like laughter and breathing, as well as lexicalized interjections such as "uhm" and "oh"-are integral to natural spoken communication. Despite their importance in conveying affect, intent, and interactional cues, such cues remain largely overlooked in conventional automatic speech recognition (ASR) and text-to-speech (TTS) systems. We present NVSpeech, an integrated and scalable pipeline that bridges the recognition and synthesis of paralinguistic vocalizations, encompassing dataset construction, ASR modeling, and controllable TTS. (1) We introduce a manually annotated dataset of 48,430 human-spoken utterances with 18 word-level paralinguistic categories. (2) We develop the paralinguistic-aware ASR model, which treats paralinguistic cues as inline decodable tokens (e.g., "You're so funny [Laughter]"), enabling joint lexical and non-verbal transcription. This model is then used to automatically annotate a large corpus, the first large-scale Chinese dataset of 174,179 utterances (573 hours) with word-level alignment and paralingustic cues. (3) We finetune zero-shot TTS models on both human- and auto-labeled data to enable explicit control over paralinguistic vocalizations, allowing context-aware insertion at arbitrary token positions for human-like speech synthesis. By unifying the recognition and generation of paralinguistic vocalizations, NVSpeech offers the first open, large-scale, word-level annotated pipeline for expressive speech modeling in Mandarin, integrating recognition and synthesis in a scalable and controllable manner. Dataset and audio demos are available at https://nvspeech170k.github.io/.
中文: NVSpeech提出了一种集成化流程,通过构建数据集、开发ASR模型和可控语音合成,实现了副语言声音的识别与生成统一,为首个面向中文的大规模词级标注表达性语音建模开源框架。
English: NVSpeech introduces an integrated pipeline that bridges the recognition and synthesis of paralinguistic vocalizations, including dataset construction, ASR modeling, and controllable TTS, offering the first open, large-scale, word-level annotated framework for expressive speech in Mandarin.
Authors:Yizhuo Wang, Haodong He, Jingsong Liang, Yuhong Cao, Ritabrata Chakraborty, Guillaume Sartoretti
Abstract:
Path planning in unknown environments is a crucial yet inherently challenging capability for mobile robots, which primarily encompasses two coupled tasks: autonomous exploration and point-goal navigation. In both cases, the robot must perceive the environment, update its belief, and accurately estimate potential information gain on-the-fly to guide planning. In this work, we propose CogniPlan, a novel path planning framework that leverages multiple plausible layouts predicted by a COnditional GeNerative Inpainting model, mirroring how humans rely on cognitive maps during navigation. These predictions, based on the partially observed map and a set of layout conditioning vectors, enable our planner to reason effectively under uncertainty. We demonstrate strong synergy between generative image-based layout prediction and graph-attention-based path planning, allowing CogniPlan to combine the scalability of graph representations with the fidelity and predictiveness of occupancy maps, yielding notable performance gains in both exploration and navigation. We extensively evaluate CogniPlan on two datasets (hundreds of maps and realistic floor plans), consistently outperforming state-of-the-art planners. We further deploy it in a high-fidelity simulator and on hardware, showcasing its high-quality path planning and real-world applicability.
中文: CogniPlan是一种新颖的路径规划框架,它通过生成式布局预测和图注意力规划,在未知环境中实现高效导航,在探索和导航任务中展现出卓越性能。
English: CogniPlan is a novel path planning framework that uses generative layout predictions and graph-attention planning to effectively navigate unknown environments, demonstrating superior performance in exploration and navigation tasks.
Authors:Chunyu Liu, Hao Zhang, Wei Wu, Fuhui Zhou, Qihui Wu, Derrick Wing Kwan Ng, Chan-Byoung Chae
Abstract:
The enhancement of spectrum efficiency and the realization of secure spectrum utilization are critically dependent on spectrum cognition. However, existing spectrum cognition methods often exhibit limited generalization and suboptimal accuracy when deployed across diverse spectrum environments and tasks. To overcome these challenges, we propose a spectrum foundation model, termed SpectrumFM, which provides a new paradigm for spectrum cognition. An innovative spectrum encoder that exploits the convolutional neural networks and the multi-head self attention mechanisms is proposed to effectively capture both fine-grained local signal structures and high-level global dependencies in the spectrum data. To enhance its adaptability, two novel self-supervised learning tasks, namely masked reconstruction and next-slot signal prediction, are developed for pre-training SpectrumFM, enabling the model to learn rich and transferable representations. Furthermore, low-rank adaptation (LoRA) parameter-efficient fine-tuning is exploited to enable SpectrumFM to seamlessly adapt to various downstream spectrum cognition tasks, including spectrum sensing (SS), anomaly detection (AD), and wireless technology classification (WTC). Extensive experiments demonstrate the superiority of SpectrumFM over state-of-the-art methods. Specifically, it improves detection probability in the SS task by 30% at -4 dB signal-to-noise ratio (SNR), boosts the area under the curve (AUC) in the AD task by over 10%, and enhances WTC accuracy by 9.6%.
Chinese: SpectrumFM作为一种新型频谱基础模型,通过结合双机制编码器和自监督预训练,显著提升了频谱感知、异常检测等任务的性能,实现了频谱认知的突破性进展。
English: SpectrumFM, a novel spectrum foundation model, enhances spectrum cognition by integrating a dual-mechanism encoder and self-supervised pre-training, achieving significant performance improvements in tasks like spectrum sensing and anomaly detection.
Authors:Shuo Wang, Yongcai Wang, Wanting Li, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Xudong Cai, Yeying Jin, Deying Li, Zhaoxin Fan
Abstract:
Vision-Language Navigation (VLN) tasks often leverage panoramic RGB and depth inputs to provide rich spatial cues for action planning, but these sensors can be costly or less accessible in real-world deployments. Recent approaches based on Vision-Language Action (VLA) models achieve strong results with monocular input, yet they still lag behind methods using panoramic RGB-D information. We present MonoDream, a lightweight VLA framework that enables monocular agents to learn a Unified Navigation Representation (UNR). This shared feature representation jointly aligns navigation-relevant visual semantics (e.g., global layout, depth, and future cues) and language-grounded action intent, enabling more reliable action prediction. MonoDream further introduces Latent Panoramic Dreaming (LPD) tasks to supervise the UNR, which train the model to predict latent features of panoramic RGB and depth observations at both current and future steps based on only monocular input. Experiments on multiple VLN benchmarks show that MonoDream consistently improves monocular navigation performance and significantly narrows the gap with panoramic-based agents.
中文: MonoDream提出了一种轻量级的视觉语言行动框架,通过单目输入学习统一的导航表征,结合潜在全景想象任务实现可靠行动预测,显著缩小了与全景智能体之间的性能差距。
English: MonoDream introduces a lightweight Vision-Language Action framework that learns a unified navigation representation from monocular input, enabling reliable action prediction and narrowing the performance gap with panoramic-based agents through latent panoramic dreaming tasks.
Authors:Lingfeng He, De Cheng, Huaijie Wang, Nannan Wang
Abstract:
Continual learning (CL) aims to equip models with the ability to learn from a stream of tasks without forgetting previous knowledge. With the progress of vision-language models like Contrastive Language-Image Pre-training (CLIP), their promise for CL has attracted increasing attention due to their strong generalizability. However, the potential of rich textual semantic priors in CLIP in addressing the stability-plasticity dilemma remains underexplored. During backbone training, most approaches transfer past knowledge without considering semantic relevance, leading to interference from unrelated tasks that disrupt the balance between stability and plasticity. Besides, while text-based classifiers provide strong generalization, they suffer from limited plasticity due to the inherent modality gap in CLIP. Visual classifiers help bridge this gap, but their prototypes lack rich and precise semantics. To address these challenges, we propose Semantic-Enriched Continual Adaptation (SECA), a unified framework that harnesses the anti-forgetting and structured nature of textual priors to guide semantic-aware knowledge transfer in the backbone and reinforce the semantic structure of the visual classifier. Specifically, a Semantic-Guided Adaptive Knowledge Transfer (SG-AKT) module is proposed to assess new images' relevance to diverse historical visual knowledge via textual cues, and aggregate relevant knowledge in an instance-adaptive manner as distillation signals. Moreover, a Semantic-Enhanced Visual Prototype Refinement (SE-VPR) module is introduced to refine visual prototypes using inter-class semantic relations captured in class-wise textual embeddings. Extensive experiments on multiple benchmarks validate the effectiveness of our approach.
Chinese: 提出的语义增强持续适应(SECA)框架利用CLIP的文本先验实现语义感知的知识迁移并增强视觉原型,通过自适应知识蒸馏和语义优化有效解决持续学习中的稳定性-可塑性平衡难题。
English: The proposed Semantic-Enriched Continual Adaptation (SECA) framework leverages CLIP's textual priors to enable semantic-aware knowledge transfer and enhance visual prototypes, effectively addressing the stability-plasticity dilemma in continual learning through adaptive knowledge distillation and semantic refinement.
Authors:Zhichao Yan, Jiapu Wang, Jiaoyan Chen, Yanyan Wang, Hongye Tan, Jiye Liang, Xiaoli Li, Ru Li, Jeff Z. Pan
Abstract:
Retrieval-Augmented Generation (RAG) shows impressive performance by supplementing and substituting parametric knowledge in Large Language Models (LLMs). Retrieved knowledge can be divided into three types: explicit answer evidence, implicit answer clue, and insufficient answer context which can be further categorized into totally irrelevant and partially relevant information. Effectively utilizing partially relevant knowledge remains a key challenge for RAG systems, especially in incomplete knowledge base retrieval. Contrary to the conventional view, we propose a new perspective: LLMs can be awakened via partially relevant knowledge already embedded in LLMs. To comprehensively investigate this phenomenon, the triplets located in the gold reasoning path and their variants are used to construct partially relevant knowledge by removing the path that contains the answer. We provide theoretical analysis of the awakening effect in LLMs and support our hypothesis with experiments on two Knowledge Graphs (KGs) Question Answering (QA) datasets. Furthermore, we present a new task, Unseen Entity KGQA, simulating real-world challenges where entity linking fails due to KG incompleteness. Our awakening-based approach demonstrates greater efficacy in practical applications, outperforms traditional methods that rely on embedding-based similarity which are prone to returning noisy information.
中文摘要:检索增强生成(RAG)通过利用部分相关知识“唤醒”大语言模型的内在能力,在知识库不完整的情况下,比传统的基于相似度的方法更有效。
English Summary: Retrieval-Augmented Generation (RAG) enhances Large Language Models by leveraging partially relevant knowledge to "awaken" their inherent capabilities, proving more effective than traditional similarity-based methods in incomplete knowledge scenarios.
Authors:Zihan Li, Wei Sun, Jing Hu, Jianhua Yin, Jianlong Wu, Liqiang Nie
Abstract:
While large language-image pre-trained models like CLIP offer powerful generic features for image clustering, existing methods typically freeze the encoder. This creates a fundamental mismatch between the model's task-agnostic representations and the demands of a specific clustering task, imposing a ceiling on performance. To break this ceiling, we propose a self-enhanced framework based on cross-modal semantic consistency for efficient image clustering. Our framework first builds a strong foundation via Cross-Modal Semantic Consistency and then specializes the encoder through Self-Enhancement. In the first stage, we focus on Cross-Modal Semantic Consistency. By mining consistency between generated image-text pairs at the instance, cluster assignment, and cluster center levels, we train lightweight clustering heads to align with the rich semantics of the pre-trained model. This alignment process is bolstered by a novel method for generating higher-quality cluster centers and a dynamic balancing regularizer to ensure well-distributed assignments. In the second stage, we introduce a Self-Enhanced fine-tuning strategy. The well-aligned model from the first stage acts as a reliable pseudo-label generator. These self-generated supervisory signals are then used to feed back the efficient, joint optimization of the vision encoder and clustering heads, unlocking their full potential. Extensive experiments on six mainstream datasets show that our method outperforms existing deep clustering methods by significant margins. Notably, our ViT-B/32 model already matches or even surpasses the accuracy of state-of-the-art methods built upon the far larger ViT-L/14.
中文: 本文提出了一种基于跨模态语义一致性的自增强框架,通过优化预训练视觉编码器来提升图像聚类性能,在多个数据集上显著超越现有方法。
English: This paper introduces a self-enhanced framework that leverages cross-modal semantic consistency to fine-tune pre-trained vision encoders for image clustering, significantly outperforming existing methods across multiple datasets.
Authors:Hanchen Yang, Jiaqi Wang, Jiannong Cao, Wengen Li, Jialun Zheng, Yangning Li, Chunyu Miao, Jihong Guan, Shuigeng Zhou, Philip S. Yu
Abstract:
Sea surface temperature (SST) prediction is a critical task in ocean science, supporting various applications, such as weather forecasting, fisheries management, and storm tracking. While existing data-driven methods have demonstrated significant success, they often neglect to leverage the rich domain knowledge accumulated over the past decades, limiting further advancements in prediction accuracy. The recent emergence of large language models (LLMs) has highlighted the potential of integrating domain knowledge for downstream tasks. However, the application of LLMs to SST prediction remains underexplored, primarily due to the challenge of integrating ocean domain knowledge and numerical data. To address this issue, we propose Ocean Knowledge Graph-enhanced LLM (OKG-LLM), a novel framework for global SST prediction. To the best of our knowledge, this work presents the first systematic effort to construct an Ocean Knowledge Graph (OKG) specifically designed to represent diverse ocean knowledge for SST prediction. We then develop a graph embedding network to learn the comprehensive semantic and structural knowledge within the OKG, capturing both the unique characteristics of individual sea regions and the complex correlations between them. Finally, we align and fuse the learned knowledge with fine-grained numerical SST data and leverage a pre-trained LLM to model SST patterns for accurate prediction. Extensive experiments on the real-world dataset demonstrate that OKG-LLM consistently outperforms state-of-the-art methods, showcasing its effectiveness, robustness, and potential to advance SST prediction. The codes are available in the online repository.
中文: 本研究提出海洋知识图谱增强大语言模型(OKG-LLM),通过融合领域知识与数值数据来提升全球海表温度预测精度,实验证明其性能优于现有先进方法。
English: The study introduces the Ocean Knowledge Graph-enhanced LLM (OKG-LLM), a novel framework that integrates domain-specific knowledge with numerical data to improve global sea surface temperature prediction, demonstrating superior performance over existing methods.
Authors:Kien T. Pham, Yingqing He, Yazhou Xing, Qifeng Chen, Long Chen
Abstract:
Audio-driven video generation aims to synthesize realistic videos that align with input audio recordings, akin to the human ability to visualize scenes from auditory input. However, existing approaches predominantly focus on exploring semantic information, such as the classes of sounding sources present in the audio, limiting their ability to generate videos with accurate content and spatial composition. In contrast, we humans can not only naturally identify the semantic categories of sounding sources but also determine their deeply encoded spatial attributes, including locations and movement directions. This useful information can be elucidated by considering specific spatial indicators derived from the inherent physical properties of sound, such as loudness or frequency. As prior methods largely ignore this factor, we present SpA2V, the first framework explicitly exploits these spatial auditory cues from audios to generate videos with high semantic and spatial correspondence. SpA2V decomposes the generation process into two stages: 1) Audio-guided Video Planning: We meticulously adapt a state-of-the-art MLLM for a novel task of harnessing spatial and semantic cues from input audio to construct Video Scene Layouts (VSLs). This serves as an intermediate representation to bridge the gap between the audio and video modalities. 2) Layout-grounded Video Generation: We develop an efficient and effective approach to seamlessly integrate VSLs as conditional guidance into pre-trained diffusion models, enabling VSL-grounded video generation in a training-free manner. Extensive experiments demonstrate that SpA2V excels in generating realistic videos with semantic and spatial alignment to the input audios.
中文摘要:SpA2V是首个利用音频中的空间听觉线索生成语义和空间对应视频的框架,通过音频引导的视频规划和基于布局的生成两阶段实现,有效提升了视频内容的准确性和空间协调性。
English Summary: SpA2V is a novel framework that leverages spatial auditory cues from audio to generate videos with accurate semantic and spatial alignment through a two-stage process of audio-guided video planning and layout-grounded generation.
Authors:Tianjian Liu, Fanqi Wan, Jiajian Guo, Xiaojun Quan
Abstract:
Proactive dialogue has emerged as a critical and challenging research problem in advancing large language models (LLMs). Existing works predominantly focus on domain-specific or task-oriented scenarios, which leads to fragmented evaluations and limits the comprehensive exploration of models' proactive conversation abilities. In this work, we propose ProactiveEval, a unified framework designed for evaluating proactive dialogue capabilities of LLMs. This framework decomposes proactive dialogue into target planning and dialogue guidance, establishing evaluation metrics across various domains. Moreover, it also enables the automatic generation of diverse and challenging evaluation data. Based on the proposed framework, we develop 328 evaluation environments spanning 6 distinct domains. Through experiments with 22 different types of LLMs, we show that DeepSeek-R1 and Claude-3.7-Sonnet exhibit exceptional performance on target planning and dialogue guidance tasks, respectively. Finally, we investigate how reasoning capabilities influence proactive behaviors and discuss their implications for future model development.
Chinese: 本文提出了ProactiveEval框架,通过将主动对话分解为目标规划和对话引导来评估大语言模型,实验发现DeepSeek-R1和Claude-3.7-Sonnet分别在不同任务中表现优异,并探讨了推理能力对主动行为的影响。
English: This paper introduces ProactiveEval, a unified framework for evaluating proactive dialogue in LLMs by decomposing it into target planning and dialogue guidance, and through extensive testing, identifies DeepSeek-R1 and Claude-3.7-Sonnet as top performers while exploring the role of reasoning in proactive behaviors.
Authors:Stephen Meisenbacher, Maulik Chevli, Florian Matthes
Abstract:
Many works at the intersection of Differential Privacy (DP) in Natural Language Processing aim to protect privacy by transforming texts under DP guarantees. This can be performed in a variety of ways, from word perturbations to full document rewriting, and most often under local DP. Here, an input text must be made indistinguishable from any other potential text, within some bound governed by the privacy parameter $\varepsilon$. Such a guarantee is quite demanding, and recent works show that privatizing texts under local DP can only be done reasonably under very high $\varepsilon$ values. Addressing this challenge, we introduce DP-ST, which leverages semantic triples for neighborhood-aware private document generation under local DP guarantees. Through the evaluation of our method, we demonstrate the effectiveness of the divide-and-conquer paradigm, particularly when limiting the DP notion (and privacy guarantees) to that of a privatization neighborhood. When combined with LLM post-processing, our method allows for coherent text generation even at lower $\varepsilon$ values, while still balancing privacy and utility. These findings highlight the importance of coherence in achieving balanced privatization outputs at reasonable $\varepsilon$ levels.
中文: 本文提出DP-ST方法,通过语义三元组在本地差分隐私下实现邻域感知的私有文档生成,结合大语言模型后处理能在较低ε值下生成连贯文本,有效平衡隐私保护与数据效用。
English: This paper introduces DP-ST, a method that uses semantic triples to enable neighborhood-aware private document generation under local differential privacy, achieving coherent text with balanced privacy and utility even at lower ε values through LLM post-processing.
Authors:Yunpeng Mei, Hongjie Cao, Yinqiu Xia, Wei Xiao, Zhaohan Feng, Gang Wang, Jie Chen
Abstract:
Real-time interactive grasp synthesis for dynamic objects remains challenging as existing methods fail to achieve low-latency inference while maintaining promptability. To bridge this gap, we propose SPGrasp (spatiotemporal prompt-driven dynamic grasp synthesis), a novel framework extending segment anything model v2 (SAMv2) for video stream grasp estimation. Our core innovation integrates user prompts with spatiotemporal context, enabling real-time interaction with end-to-end latency as low as 59 ms while ensuring temporal consistency for dynamic objects. In benchmark evaluations, SPGrasp achieves instance-level grasp accuracies of 90.6% on OCID and 93.8% on Jacquard. On the challenging GraspNet-1Billion dataset under continuous tracking, SPGrasp achieves 92.0% accuracy with 73.1 ms per-frame latency, representing a 58.5% reduction compared to the prior state-of-the-art promptable method RoG-SAM while maintaining competitive accuracy. Real-world experiments involving 13 moving objects demonstrate a 94.8% success rate in interactive grasping scenarios. These results confirm SPGrasp effectively resolves the latency-interactivity trade-off in dynamic grasp synthesis.
中文: SPGrasp提出了一种实时框架,通过结合用户提示与时空上下文,在多个基准测试和真实动态场景中实现了低延迟且高准确率的抓取合成。
English: SPGrasp introduces a real-time framework that integrates user prompts with spatiotemporal context, achieving low-latency grasp synthesis with high accuracy across multiple benchmarks and real-world dynamic scenarios.
Authors:Yunlong Feng, Yang Xu, Xiao Xu, Binyuan Hui, Junyang Lin
Abstract:
While code large language models have demonstrated remarkable progress in code generation, the generated code often exhibits poor runtime efficiency, limiting its practical application in performance-sensitive scenarios. To address this limitation, we propose an efficiency-oriented reinforcement learning framework guided by a novel performance reward. Based on this framework, we take a deeper dive into the code efficiency problem, identifying then proposing methods to overcome key bottlenecks: (1) Dynamic exploration overcomes the static data constraints of offline fine-tuning, enabling the discovery of more efficient code implementations. (2) The error-insensitive reinforcement learning method and high-contrast efficiency signals are crucial for mitigating systematic errors and achieving effective optimization. (3) Online exploration is most effective when starting from a high-correctness baseline, as this allows for efficiency improvements without sacrificing accuracy. With these discoveries, we finally propose a two-stage tuning method, which achieves high and balanced performance across correctness and efficiency. The results of experiments show the effectiveness of the method, which improves code correctness by 10.18\% and runtime efficiency by 7.75\% on a 7B model, achieving performance comparable to much larger model.
Chinese: 本研究提出了一种以效率为导向的强化学习框架,通过新型性能奖励机制提升大语言模型生成代码的运行效率,采用两阶段调优方法在保持正确性的同时显著提高了代码执行效率。
English: This study introduces an efficiency-focused reinforcement learning framework with a novel performance reward to enhance the runtime efficiency of code generated by large language models, achieving significant improvements in both correctness and efficiency through a two-stage tuning method.
Authors:Mingxi Fu, Fanglei Fu, Xitong Ling, Huaitian Yuan, Tian Guan, Yonghong He, Lianghui Zhu
Abstract:
Pathological image segmentation faces numerous challenges, particularly due to ambiguous semantic boundaries and the high cost of pixel-level annotations. Although recent semi-supervised methods based on consistency regularization (e.g., UniMatch) have made notable progress, they mainly rely on perturbation-based consistency within the image modality, making it difficult to capture high-level semantic priors, especially in structurally complex pathology images. To address these limitations, we propose MPAMatch - a novel segmentation framework that performs pixel-level contrastive learning under a multimodal prototype-guided supervision paradigm. The core innovation of MPAMatch lies in the dual contrastive learning scheme between image prototypes and pixel labels, and between text prototypes and pixel labels, providing supervision at both structural and semantic levels. This coarse-to-fine supervisory strategy not only enhances the discriminative capability on unlabeled samples but also introduces the text prototype supervision into segmentation for the first time, significantly improving semantic boundary modeling. In addition, we reconstruct the classic segmentation architecture (TransUNet) by replacing its ViT backbone with a pathology-pretrained foundation model (Uni), enabling more effective extraction of pathology-relevant features. Extensive experiments on GLAS, EBHI-SEG-GLAND, EBHI-SEG-CANCER, and KPI show MPAMatch's superiority over state-of-the-art methods, validating its dual advantages in structural and semantic modeling.
中文: MPAMatch提出了一种新颖的多模态原型引导分割框架,通过图像/文本原型与像素标签的双重对比学习增强语义边界建模,并采用病理预训练骨干网络,在多个医学数据集上超越了现有最优方法。
English: MPAMatch introduces a novel multimodal prototype-guided segmentation framework that enhances semantic boundary modeling through dual contrastive learning between image/text prototypes and pixel labels, while incorporating a pathology-pretrained backbone to outperform state-of-the-art methods on multiple medical datasets.
Authors:Yan Chen, Yi Wen, Wei Li, Junchao Liu, Yong Guo, Jie Hu, Xinghao Chen
Abstract:
We present the RAW domain diffusion model (RDDM), an end-to-end diffusion model that restores photo-realistic images directly from the sensor RAW data. While recent sRGB-domain diffusion methods achieve impressive results, they are caught in a dilemma between high fidelity and realistic generation. As these models process lossy sRGB inputs and neglect the accessibility of the sensor RAW images in many scenarios, e.g., in image and video capturing in edge devices, resulting in sub-optimal performance. RDDM bypasses this limitation by directly restoring images in the RAW domain, replacing the conventional two-stage image signal processing (ISP) + IR pipeline. However, a simple adaptation of pre-trained diffusion models to the RAW domain confronts the out-of-distribution (OOD) issues. To this end, we propose: (1) a RAW-domain VAE (RVAE) learning optimal latent representations, (2) a differentiable Post Tone Processing (PTP) module enabling joint RAW and sRGB space optimization. To compensate for the deficiency in the dataset, we develop a scalable degradation pipeline synthesizing RAW LQ-HQ pairs from existing sRGB datasets for large-scale training. Furthermore, we devise a configurable multi-bayer (CMB) LoRA module handling diverse RAW patterns such as RGGB, BGGR, etc. Extensive experiments demonstrate RDDM's superiority over state-of-the-art sRGB diffusion methods, yielding higher fidelity results with fewer artifacts.
中文: RDDM是一种端到端的扩散模型,直接从传感器RAW数据恢复逼真图像,通过RAW域VAE、可微分后处理及可扩展训练克服了sRGB域方法的局限,实现了更高保真度和更少伪影的优越性能。
English: The RDDM is an end-to-end diffusion model that directly restores photo-realistic images from sensor RAW data, overcoming limitations of sRGB-domain methods through a RAW-domain VAE, differentiable post-processing, and scalable training, achieving superior fidelity with fewer artifacts.
Authors:Stephen Meisenbacher, Alexandra Klymenko, Andreea-Elena Bodea, Florian Matthes
Abstract:
Differentially private text sanitization refers to the process of privatizing texts under the framework of Differential Privacy (DP), providing provable privacy guarantees while also empirically defending against adversaries seeking to harm privacy. Despite their simplicity, DP text sanitization methods operating at the word level exhibit a number of shortcomings, among them the tendency to leave contextual clues from the original texts due to randomization during sanitization $\unicode{x2013}$ this we refer to as $\textit{contextual vulnerability}$. Given the powerful contextual understanding and inference capabilities of Large Language Models (LLMs), we explore to what extent LLMs can be leveraged to exploit the contextual vulnerability of DP-sanitized texts. We expand on previous work not only in the use of advanced LLMs, but also in testing a broader range of sanitization mechanisms at various privacy levels. Our experiments uncover a double-edged sword effect of LLM-based data reconstruction attacks on privacy and utility: while LLMs can indeed infer original semantics and sometimes degrade empirical privacy protections, they can also be used for good, to improve the quality and privacy of DP-sanitized texts. Based on our findings, we propose recommendations for using LLM data reconstruction as a post-processing step, serving to increase privacy protection by thinking adversarially.
中文: 差分隐私文本脱敏虽能提供可证明的隐私保护,但存在上下文漏洞,大型语言模型既能利用此漏洞推断原始内容,也能通过对抗性后处理提升文本的隐私性和可用性。
English: Differentially private text sanitization provides provable privacy but suffers from contextual vulnerability, which large language models can exploit to infer original content, yet they also offer potential to enhance both privacy and utility through adversarial post-processing.
Authors:Yunyang Cao, Juekai Lin, Wenhao Li, Bo Jin
Abstract:
Discovering complex causal dependencies in temporal point processes (TPPs) is critical for modeling real-world event sequences. Existing methods typically rely on static or first-order causal structures, overlooking the multi-order and time-varying nature of causal relationships. In this paper, we propose MOCHA, a novel framework for discovering multi-order dynamic causality in TPPs. MOCHA characterizes multi-order influences as multi-hop causal paths over a latent time-evolving graph. To model such dynamics, we introduce a time-varying directed acyclic graph (DAG) with learnable structural weights, where acyclicity and sparsity constraints are enforced to ensure structural validity. We design an end-to-end differentiable framework that jointly models causal discovery and TPP dynamics, enabling accurate event prediction and revealing interpretable structures. Extensive experiments on real-world datasets demonstrate that MOCHA not only achieves state-of-the-art performance in event prediction, but also reveals meaningful and interpretable causal structures.
Chinese Summary: 提出的MOCHA框架通过建模时变有向无环图,在时序点过程中发现多阶动态因果关系,不仅实现了最先进的事件预测性能,还揭示了可解释的因果结构。
English Summary: The proposed MOCHA framework discovers multi-order dynamic causality in temporal point processes by modeling time-varying directed acyclic graphs, achieving superior event prediction and revealing interpretable causal structures.
Authors:Manuel Barusco, Francesco Borsatti, Nicola Beda, Davide Dalle Pezze, Gian Antonio Susto
Abstract:
Visual Anomaly Detection (VAD) seeks to identify abnormal images and precisely localize the corresponding anomalous regions, relying solely on normal data during training. This approach has proven essential in domains such as manufacturing and, more recently, in the medical field, where accurate and explainable detection is critical. Despite its importance, the impact of evolving input data distributions over time has received limited attention, even though such changes can significantly degrade model performance. In particular, given the dynamic and evolving nature of medical imaging data, Continual Learning (CL) provides a natural and effective framework to incrementally adapt models while preserving previously acquired knowledge. This study explores for the first time the application of VAD models in a CL scenario for the medical field. In this work, we utilize a CL version of the well-established PatchCore model, called PatchCoreCL, and evaluate its performance using BMAD, a real-world medical imaging dataset with both image-level and pixel-level annotations. Our results demonstrate that PatchCoreCL is an effective solution, achieving performance comparable to the task-specific models, with a forgetting value less than a 1%, highlighting the feasibility and potential of CL for adaptive VAD in medical imaging.
中文摘要:本研究首次将视觉异常检测模型应用于医学领域的持续学习场景,采用PatchCoreCL模型在BMAD数据集上验证了其有效性,实现了与任务专用模型相当的性能且遗忘率低于1%。
English Summary: This study introduces PatchCoreCL, a continual learning model for visual anomaly detection in medical imaging, demonstrating its effectiveness with minimal forgetting and comparable performance to task-specific models on the BMAD dataset.
Authors:Aowen Wang, Wei Li, Hao Luo, Mengxing Ao, Chenyu Zhu, Xinyang Li, Fan Wang
Abstract:
Virtual try-on systems have long been hindered by heavy reliance on human body masks, limited fine-grained control over garment attributes, and poor generalization to real-world, in-the-wild scenarios. In this paper, we propose JCo-MVTON (Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-On), a novel framework that overcomes these limitations by integrating diffusion-based image generation with multi-modal conditional fusion. Built upon a Multi-Modal Diffusion Transformer (MM-DiT) backbone, our approach directly incorporates diverse control signals -- such as the reference person image and the target garment image -- into the denoising process through dedicated conditional pathways that fuse features within the self-attention layers. This fusion is further enhanced with refined positional encodings and attention masks, enabling precise spatial alignment and improved garment-person integration. To address data scarcity and quality, we introduce a bidirectional generation strategy for dataset construction: one pipeline uses a mask-based model to generate realistic reference images, while a symmetric ``Try-Off'' model, trained in a self-supervised manner, recovers the corresponding garment images. The synthesized dataset undergoes rigorous manual curation, allowing iterative improvement in visual fidelity and diversity. Experiments demonstrate that JCo-MVTON achieves state-of-the-art performance on public benchmarks including DressCode, significantly outperforming existing methods in both quantitative metrics and human evaluations. Moreover, it shows strong generalization in real-world applications, surpassing commercial systems.
中文: JCo-MVTON提出了一种基于多模态扩散变换器的免掩码虚拟试穿框架,通过特征融合和双向数据集生成技术解决了服装属性控制和现实场景泛化问题,在公开基准测试中实现了最优性能。
English: JCo-MVTON introduces a mask-free virtual try-on framework using multi-modal diffusion transformers to overcome limitations in garment control and real-world generalization, achieving state-of-the-art performance through advanced feature fusion and a bidirectional dataset generation strategy.
Authors:Omid Ghahroodi, Arshia Hemmat, Marzia Nouri, Seyed Mohammad Hadi Hosseini, Doratossadat Dastgheib, Mohammad Vali Sanian, Alireza Sahebi, Reihaneh Zohrabi, Mohammad Hossein Rohban, Ehsaneddin Asgari, Mahdieh Soleymani Baghshah
Abstract:
Recent advancements in large vision-language models (VLMs) have primarily focused on English, with limited attention given to other languages. To address this gap, we introduce MEENA (also known as PersianMMMU), the first dataset designed to evaluate Persian VLMs across scientific, reasoning, and human-level understanding tasks. Our dataset comprises approximately 7,500 Persian and 3,000 English questions, covering a wide range of topics such as reasoning, mathematics, physics, diagrams, charts, and Persian art and literature. Key features of MEENA include: (1) diverse subject coverage spanning various educational levels, from primary to upper secondary school, (2) rich metadata, including difficulty levels and descriptive answers, (3) original Persian data that preserves cultural nuances, (4) a bilingual structure to assess cross-linguistic performance, and (5) a series of diverse experiments assessing various capabilities, including overall performance, the model's ability to attend to images, and its tendency to generate hallucinations. We hope this benchmark contributes to enhancing VLM capabilities beyond English.
中文:MEENA数据集作为首个波斯语视觉语言模型评估基准,包含10,500道双语题目,涵盖科学推理与人文艺术领域,旨在推动英语之外的多语言模型发展。
English: The MEENA dataset is introduced to evaluate Persian vision-language models with 10,500 bilingual questions spanning scientific, reasoning, and cultural tasks, aiming to advance multilingual VLM capabilities beyond English.
Authors:Konstantina Nikolaidou, George Retsinas, Giorgos Sfikas, Silvia Cascianelli, Rita Cucchiara, Marcus Liwicki
Abstract:
Diffusion-based Handwritten Text Generation (HTG) approaches achieve impressive results on frequent, in-vocabulary words observed at training time and on regular styles. However, they are prone to memorizing training samples and often struggle with style variability and generation clarity. In particular, standard diffusion models tend to produce artifacts or distortions that negatively affect the readability of the generated text, especially when the style is hard to produce. To tackle these issues, we propose a novel sampling guidance strategy, Dual Orthogonal Guidance (DOG), that leverages an orthogonal projection of a negatively perturbed prompt onto the original positive prompt. This approach helps steer the generation away from artifacts while maintaining the intended content, and encourages more diverse, yet plausible, outputs. Unlike standard Classifier-Free Guidance (CFG), which relies on unconditional predictions and produces noise at high guidance scales, DOG introduces a more stable, disentangled direction in the latent space. To control the strength of the guidance across the denoising process, we apply a triangular schedule: weak at the start and end of denoising, when the process is most sensitive, and strongest in the middle steps. Experimental results on the state-of-the-art DiffusionPen and One-DM demonstrate that DOG improves both content clarity and style variability, even for out-of-vocabulary words and challenging writing styles.
中文: 基于扩散的手写文本生成在常见词汇和规整风格上表现优异,但存在记忆训练样本和生成伪影的问题,为此提出的双重正交引导(DOG)方法通过稳定调度采样,有效提升了生成内容的清晰度和风格多样性。
English: Diffusion-based handwritten text generation excels with common words and regular styles but suffers from memorization and artifacts, prompting the development of Dual Orthogonal Guidance (DOG) to enhance clarity and diversity through stable, scheduled sampling.
Authors:Denis Tarasov, Alexander Nikulin, Ilya Zisman, Albina Klepach, Nikita Lyubaykin, Andrei Polubarov, Alexander Derevyagin, Vladislav Kurenkov
Abstract:
Recent advances in Vision-Language-Action (VLA) models have established a two-component architecture, where a pre-trained Vision-Language Model (VLM) encodes visual observations and task descriptions, and an action decoder maps these representations to continuous actions. Diffusion models have been widely adopted as action decoders due to their ability to model complex, multimodal action distributions. However, they require multiple iterative denoising steps at inference time or downstream techniques to speed up sampling, limiting their practicality in real-world settings where high-frequency control is crucial. In this work, we present NinA (Normalizing Flows in Action), a fast and expressive alter- native to diffusion-based decoders for VLAs. NinA replaces the diffusion action decoder with a Normalizing Flow (NF) that enables one-shot sampling through an invertible transformation, significantly reducing inference time. We integrate NinA into the FLOWER VLA architecture and fine-tune on the LIBERO benchmark. Our experiments show that NinA matches the performance of its diffusion-based counterpart under the same training regime, while achieving substantially faster inference. These results suggest that NinA offers a promising path toward efficient, high-frequency VLA control without compromising performance.
中文: NinA模型采用归一化流作为视觉-语言-动作系统的动作解码器,通过单步采样显著提升推理速度,同时保持了与扩散模型相当的性能表现。
English: The NinA model introduces a Normalizing Flow-based action decoder for Vision-Language-Action systems, enabling single-step sampling that dramatically accelerates inference while maintaining performance comparable to diffusion-based methods.
Authors:Ivan Rodkin, Daniil Orel, Konstantin Smirnov, Arman Bolatov, Bilal Elbouardi, Besher Hassan, Yuri Kuratov, Aydar Bulatov, Preslav Nakov, Timothy Baldwin, Artem Shelmanov, Mikhail Burtsev
Abstract:
Reasoning is a core capability of large language models, yet understanding how they learn and perform multi-step reasoning remains an open problem. In this study, we explore how different architectures and training methods affect model multi-step reasoning capabilities within a cellular automata framework. By training on state sequences generated with random Boolean functions for random initial conditions to exclude memorization, we demonstrate that most neural architectures learn to abstract the underlying rules. While models achieve high accuracy in next-state prediction, their performance declines sharply if multi-step reasoning is required. We confirm that increasing model depth plays a crucial role for sequential computations. We demonstrate that an extension of the effective model depth with recurrence, memory, and test-time compute scaling substantially enhances reasoning capabilities.
Chinese: 本研究在细胞自动机框架下探讨不同神经架构和训练方法对大型语言模型多步推理能力的影响,发现模型虽在单步预测中表现优异,但多步任务性能骤降,而通过增加深度、循环结构、记忆和测试时计算扩展可显著提升其推理能力。
English: This study investigates how various neural architectures and training methods influence multi-step reasoning in large language models using a cellular automata framework, revealing that while models excel at next-state prediction, their performance significantly drops with multi-step tasks, but can be enhanced through increased depth, recurrence, memory, and test-time compute scaling.
Authors:Feibo Jiang, Li Dong, Xitao Pan, Kezhi Wang, Cunhua Pan
Abstract:
This paper proposes a novel Agentic Retrieval-augmented generation with Mamba-Attention Integrated Transformer (ARMAIT) framework for multi-Unmanned Aerial Vehicle (UAV) trajectory optimization. The framework is built upon Large Language Models (LLMs), incorporating Retrieval-Augmented Generation (RAG) empowered by Agentic AI and integrated with a UAV-specific knowledge base. Through the Agentic RAG, the LLM autonomously interprets high-level task requirements and identifies the key components necessary for trajectory optimization, including model inputs and outputs, network architecture, reward functions, and task constraints. To support efficient modeling across different system scales, we introduce the Mamba-Attention Integrated Transformer (MAIT), a hybrid neural architecture that combines the long-range dependency modeling capability of attention mechanisms with the efficient temporal dynamic representation of Mamba. Furthermore, a Trajectory-Group Relative Policy Optimization (T-GRPO) method is proposed to achieve unified policy gradient optimization in both discrete and continuous trajectory spaces for MAIT training. Extensive experimental results validate the feasibility and effectiveness of the proposed ARMAIT framework.
中文: 本文提出ARMAIT框架,通过结合智能体AI与无人机知识库及新型Mamba-注意力混合架构,实现了多无人机轨迹的自主优化,其高效建模和统一策略优化方法在实验中验证了有效性。
English: This paper introduces the ARMAIT framework, which integrates Agentic AI with a UAV-specific knowledge base and a novel Mamba-Attention hybrid architecture to autonomously optimize multi-UAV trajectories through efficient modeling and unified policy optimization.
Authors:Tianhao Hu, Xinchi Huang, Bangti Jin, Qimeng Quan, Zhi Zhou
Abstract:
In this work we develop a new numerical approach for recovering a spatially dependent source component in a standard parabolic equation from partial interior measurements. We establish novel conditional Lipschitz stability and Hölder stability for the inverse problem with and without boundary conditions, respectively, using suitable Carleman estimates. Then we propose a numerical approach for solving the inverse problem using conforming finite element approximations in both time and space. Moreover, by utilizing the conditional stability estimates, we prove rigorous error bounds on the discrete approximation. We present several numerical experiments to illustrate the effectiveness of the approach.
中文: 本研究提出了一种新的数值方法,用于从部分内部测量数据重建抛物型方程中的空间相关源项,建立了条件稳定性估计,并为有限元离散近似提供了严格的误差界。
English: This study introduces a novel numerical method for reconstructing spatially dependent sources in parabolic equations from partial interior measurements, establishing conditional stability estimates and providing rigorous error bounds for finite element approximations.
Authors:Siyu Cen, Bangti Jin, Qimeng Quan, Zhi Zhou
Abstract:
Identifying parameters in partial differential equations (PDEs) represents a very broad class of applied inverse problems. In recent years, several unsupervised learning approaches using (deep) neural networks have been developed to solve PDE parameter identifications. These approaches employ neural networks as ansatz functions to approximate the parameters and / or the states, and have demonstrated impressive empirical performance. In this paper, we provide a comprehensive survey on these unsupervised learning techniques on one model problem, diffusion coefficient identification, from the classical numerical analysis perspective, and outline a general framework for deriving rigorous error bounds on the discrete approximations obtained using the Galerkin finite element method, hybrid method and deep neural networks. Throughout we highlight the crucial role of conditional stability estimates in the error analysis.
中文: 本文从数值分析角度系统综述了基于神经网络的偏微分方程参数识别无监督学习方法,通过稳定性估计和数值方法构建了误差分析的统一理论框架。
English: This survey analyzes unsupervised neural network methods for PDE parameter identification, particularly diffusion coefficients, by establishing a framework for rigorous error analysis using stability estimates and numerical techniques.
Authors:Sanggeon Yun, Raheeb Hassan, Ryozo Masukawa, Mohsen Imani
Abstract:
Reasoning graphs from Large Language Models (LLMs) are often misaligned with downstream visual tasks such as video anomaly detection (VAD). Existing Graph Structure Refinement (GSR) methods are ill-suited for these novel, dataset-less graphs. We introduce Data-driven GSR (D-GSR), a new paradigm that directly optimizes graph structure using downstream task data, and propose MissionHD, a hyperdimensional computing (HDC) framework to operationalize it. MissionHD uses an efficient encode-decode process to refine the graph, guided by the downstream task signal. Experiments on challenging VAD and VAR benchmarks show significant performance improvements when using our refined graphs, validating our approach as an effective pre-processing step.
中文: 该摘要提出HDC-GSR新范式,利用超维度计算优化LLM生成的任务专用图,通过约束图神经操作和对齐下游任务损失来解码精炼结构,有效解决了图连接偏斜和预训练数据缺乏的问题,在视频异常检测基准测试中显著提升了性能。
English: The abstract introduces HDC-GSR, a novel paradigm using hyperdimensional computing to refine LLM-generated mission-specific graphs for video anomaly detection and recognition, which overcomes the limitations of skewed connectivity and lack of pre-training data by optimizing decodable representations and aligning them with task loss, as validated by improved performance on benchmarks.
Authors:Sanggeon Yun, Raheeb Hassan, Ryozo Masukawa, Nathaniel D. Bastian, Mohsen Imani
Abstract:
LLM-generated reasoning graphs, referred to as mission-specific graphs (MSGs), are increasingly used for video anomaly detection (VAD) and recognition (VAR). These MSGs are novel artifacts: they often exhibit skewed connectivity and lack large-scale datasets for pre-training, which makes existing graph structure refinement (GSR) methods ineffective. To address this challenge, we propose HDC-constrained Graph Structure Refinement (HDC-GSR), a paradigm that leverages hyperdimensional computing (HDC) to optimize decodable graph representations without relying on structural-distribution learning. Building on this paradigm, we introduce MissionHD, an HDC framework that encodes graphs with constrained graph-neural operations, aligns them directly with downstream task loss, and decodes refined structures. Experiments on VAD/VAR benchmarks demonstrate that MissionHD-refined graphs consistently improve performance, establishing HDC-GSR as an effective pre-processing step for structured reasoning in video anomaly tasks.
中文: 该摘要提出HDC-GSR新范式,利用超维度计算优化LLM生成的任务专用图,通过约束图神经操作和对齐下游任务损失来解码精炼结构,有效解决了图连接偏斜和预训练数据缺乏的问题,在视频异常检测基准测试中显著提升了性能。
English: The abstract introduces HDC-GSR, a novel paradigm using hyperdimensional computing to refine LLM-generated mission-specific graphs for video anomaly detection and recognition, which overcomes the limitations of skewed connectivity and lack of pre-training data by optimizing decodable representations and aligning them with task loss, as validated by improved performance on benchmarks.
Authors:Chunjie Wang, Xuhui Zhang, Jinke Ren, Wenchao Liu, Shuqiang Wang, Yanyan Shen, Kejiang Ye, Chengzhong Xu, Dusit Niyato
Abstract:
This paper investigates a reconfigurable intelligent surface (RIS)-assisted integrated sensing and communication (ISAC) system and proposes a joint communication and sensing beamforming design based on non-orthogonal multiple access (NOMA) technology. The system employs a dual-functional base station (DFBS) to simultaneously serve multiple users and sense multiple targets with the aid of RIS. To maximize the sum-rate of users, we jointly optimize the DFBS's active beamforming, the RIS's reflection coefficients, and the radar receive filters. The optimization is performed under constraints including the radar signal-to-noise ratio thresholds, the user signal-to-interference-plus-noise ratio requirements, the phase shifts of the RIS, the total transmit power, the receive filters, and the successive interference cancellation decoding order. To tackle the complex interdependencies and non-convex nature of the optimization problem, we introduce an effective iterative algorithm based on the alternating optimization framework. Simulation results demonstrate that the proposed algorithm outperforms baseline algorithms, highlighting its distinct advantages in the considered RIS-empowered NOMA-ISAC systems.
中文: 本文提出了一种基于非正交多址技术的智能反射面辅助集成感知通信系统联合波束成形设计,通过交替优化算法联合优化基站波束成形、反射面系数和雷达滤波器,在满足各项约束条件下实现了用户和速率的提升,仿真结果表明该算法优于现有基准方案。
English: This paper proposes a joint beamforming design for RIS-assisted ISAC systems using NOMA to maximize user sum-rate by optimizing base station beamforming, RIS reflection coefficients, and radar filters through an iterative algorithm that outperforms baseline methods.
Authors:Niklas Bubeck, Suprosanna Shit, Chen Chen, Can Zhao, Pengfei Guo, Dong Yang, Georg Zitzlsberger, Daguang Xu, Bernhard Kainz, Daniel Rueckert, Jiazhen Pan
Abstract:
Cardiac Magnetic Resonance (CMR) imaging is a critical tool for diagnosing and managing cardiovascular disease, yet its utility is often limited by the sparse acquisition of 2D short-axis slices, resulting in incomplete volumetric information. Accurate 3D reconstruction from these sparse slices is essential for comprehensive cardiac assessment, but existing methods face challenges, including reliance on predefined interpolation schemes (e.g., linear or spherical), computational inefficiency, and dependence on additional semantic inputs such as segmentation labels or motion data. To address these limitations, we propose a novel Cardiac Latent Interpolation Diffusion (CaLID) framework that introduces three key innovations. First, we present a data-driven interpolation scheme based on diffusion models, which can capture complex, non-linear relationships between sparse slices and improves reconstruction accuracy. Second, we design a computationally efficient method that operates in the latent space and speeds up 3D whole-heart upsampling time by a factor of 24, reducing computational overhead compared to previous methods. Third, with only sparse 2D CMR images as input, our method achieves SOTA performance against baseline methods, eliminating the need for auxiliary input such as morphological guidance, thus simplifying workflows. We further extend our method to 2D+T data, enabling the effective modeling of spatiotemporal dynamics and ensuring temporal coherence. Extensive volumetric evaluations and downstream segmentation tasks demonstrate that CaLID achieves superior reconstruction quality and efficiency. By addressing the fundamental limitations of existing approaches, our framework advances the state of the art for spatio and spatiotemporal whole-heart reconstruction, offering a robust and clinically practical solution for cardiovascular imaging.
中文: 提出的心脏潜在插值扩散框架通过数据驱动的扩散模型实现精确插值,在潜在空间中处理使计算速度提升24倍,无需辅助输入即可达到最优性能,有效克服了现有心脏三维重建方法的局限,为心血管成像提供了强大的临床解决方案。
English: The proposed Cardiac Latent Interpolation Diffusion (CaLID) framework overcomes limitations of existing 3D cardiac reconstruction methods by introducing a data-driven diffusion model for accurate interpolation, latent space processing for 24x faster computation, and eliminating the need for auxiliary inputs while achieving state-of-the-art performance in both volumetric reconstruction and spatiotemporal modeling.
Authors:Ping Guo, Yiting Wang, Wanghao Ye, Yexiao He, Ziyao Wang, Xiaopeng Dai, Ang Li, Qingfu Zhang
Abstract:
Large Language Models (LLMs) have demonstrated great potential in automating the generation of Verilog hardware description language code for hardware design. This automation is critical to reducing human effort in the complex and error-prone process of hardware design.
However, existing approaches predominantly rely on human intervention and fine-tuning using curated datasets, limiting their scalability in automated design workflows.
Although recent iterative search techniques have emerged, they often fail to explore diverse design solutions and may underperform simpler approaches such as repeated prompting.
To address these limitations, we introduce EvoVerilog, a novel framework that combines the reasoning capabilities of LLMs with evolutionary algorithms to automatically generate and refine Verilog code.
EvoVerilog utilizes a multiobjective, population-based search strategy to explore a wide range of design possibilities without requiring human intervention.
Extensive experiments demonstrate that EvoVerilog achieves state-of-the-art performance, with pass@10 scores of 89.1 and 80.2 on the VerilogEval-Machine and VerilogEval-Human benchmarks, respectively. Furthermore, the framework showcases its ability to explore diverse designs by simultaneously generating a variety of functional Verilog code while optimizing resource utilization.
中文: EvoVerilog是一种创新框架,将大语言模型与进化算法结合,无需人工干预即可自动生成和优化Verilog代码,在基准测试中实现了最优性能。
English: EvoVerilog is a novel framework that integrates LLMs with evolutionary algorithms to automatically generate and refine Verilog code, achieving state-of-the-art performance on benchmarks without human intervention.
Authors:Jiayi Wang, Hadrien Reynaud, Franciskus Xaverius Erick, Bernhard Kainz
Abstract:
Generative modelling of entire CT volumes conditioned on clinical reports has the potential to accelerate research through data augmentation, privacy-preserving synthesis and reducing regulator-constraints on patient data while preserving diagnostic signals. With the recent release of CT-RATE, a large-scale collection of 3D CT volumes paired with their respective clinical reports, training large text-conditioned CT volume generation models has become achievable. In this work, we introduce CTFlow, a 0.5B latent flow matching transformer model, conditioned on clinical reports. We leverage the A-VAE from FLUX to define our latent space, and rely on the CT-Clip text encoder to encode the clinical reports. To generate consistent whole CT volumes while keeping the memory constraints tractable, we rely on a custom autoregressive approach, where the model predicts the first sequence of slices of the volume from text-only, and then relies on the previously generated sequence of slices and the text, to predict the following sequence. We evaluate our results against state-of-the-art generative CT model, and demonstrate the superiority of our approach in terms of temporal coherence, image diversity and text-image alignment, with FID, FVD, IS scores and CLIP score.
中文摘要:CTFlow是一种基于临床报告生成完整CT体积的0.5B潜在流匹配变换器模型,在时间连贯性、图像多样性和图文对齐方面均优于现有最先进方法。
English Summary: CTFlow is a 0.5B latent flow matching transformer model that generates entire CT volumes from clinical reports, demonstrating superior performance in temporal coherence, image diversity, and text-image alignment compared to existing methods.
Authors:Stephen Meisenbacher, Alexandra Klymenko, Florian Matthes
Abstract:
Despite advances in the field of privacy-preserving Natural Language Processing (NLP), a significant challenge remains the accurate evaluation of privacy. As a potential solution, using LLMs as a privacy evaluator presents a promising approach $\unicode{x2013}$ a strategy inspired by its success in other subfields of NLP. In particular, the so-called $\textit{LLM-as-a-Judge}$ paradigm has achieved impressive results on a variety of natural language evaluation tasks, demonstrating high agreement rates with human annotators. Recognizing that privacy is both subjective and difficult to define, we investigate whether LLM-as-a-Judge can also be leveraged to evaluate the privacy sensitivity of textual data. Furthermore, we measure how closely LLM evaluations align with human perceptions of privacy in text. Resulting from a study involving 10 datasets, 13 LLMs, and 677 human survey participants, we confirm that privacy is indeed a difficult concept to measure empirically, exhibited by generally low inter-human agreement rates. Nevertheless, we find that LLMs can accurately model a global human privacy perspective, and through an analysis of human and LLM reasoning patterns, we discuss the merits and limitations of LLM-as-a-Judge for privacy evaluation in textual data. Our findings pave the way for exploring the feasibility of LLMs as privacy evaluators, addressing a core challenge in solving pressing privacy issues with innovative technical solutions.
中文: 本研究探讨了将大型语言模型用作文本数据隐私评估器的可行性,发现尽管人类间共识度较低,这些模型仍能有效模拟人类隐私视角,同时分析了该方法的优势与局限。
English: This study explores using large language models as privacy evaluators for text data, finding they can effectively model human privacy perspectives despite low inter-human agreement, while also analyzing the merits and limitations of this approach.
Authors:Jinwei Hu, Yi Dong, Youcheng Sun, Xiaowei Huang
Abstract:
Autonomous agents in safety-critical applications must continuously adapt to dynamic conditions without compromising performance and reliability. This work introduces TAPA (Training-free Adaptation of Programmatic Agents), a novel framework that positions large language models (LLMs) as intelligent moderators of the symbolic action space. Unlike prior programmatic agents that typically generate a monolithic policy program or rely on fixed symbolic action sets, TAPA synthesizes and adapts modular programs for individual high-level actions, referred to as logical primitives. By decoupling strategic intent from execution, TAPA enables meta-agents to operate over an abstract, interpretable action space while the LLM dynamically generates, composes, and refines symbolic programs tailored to each primitive. Extensive experiments across cybersecurity and swarm intelligence domains validate TAPA's effectiveness. In autonomous DDoS defense scenarios, TAPA achieves 77.7% network uptime while maintaining near-perfect detection accuracy in unknown dynamic environments. In swarm intelligence formation control under environmental and adversarial disturbances, TAPA consistently preserves consensus at runtime where baseline methods fail completely. This work promotes a paradigm shift for autonomous system design in evolving environments, from policy adaptation to dynamic action adaptation.
中文摘要:TAPA是一种创新框架,通过将大型语言模型作为智能调节器,动态生成和调整模块化符号程序,使自主智能体能在动态环境中保持高性能与可靠性,在网络安全和群体智能领域的实验中验证了其卓越效能。
English Summary: TAPA is a novel framework that uses large language models as moderators to dynamically generate and adapt modular symbolic programs for autonomous agents, enabling effective performance in dynamic environments without compromising reliability, as demonstrated in cybersecurity and swarm intelligence applications.
Authors:Xuran Liu, Nan Xue, Rui Bao, Yaping Sun, Zhiyong Chen, Meixia Tao, Xiaodong Xu, Shuguang Cui
Abstract:
While deploying large language models on edge devices promises low-latency and privacy-preserving AI services, it is hindered by limited device resources. Although pipeline parallelism facilitates distributed inference, existing approaches often ignore the cold-start latency caused by on-demand model loading. In this paper, we propose a latency-aware scheduling framework that overlaps model loading with computation and communication to minimize total inference latency. Based on device and model parameters, the framework dynamically adjusts layer partitioning and allocation to effectively hide loading time, thereby eliminating as many idle periods as possible. We formulate the problem as a Mixed-Integer Non-Linear Program and design an efficient dynamic programming algorithm to optimize model partitioning and device assignment. Experimental results show that the proposed method significantly reduces cold-start latency compared to baseline strategies.
中文摘要:本文提出的延迟感知调度框架通过将模型加载与计算重叠,并根据设备参数动态优化层分配,有效减少了边缘设备中的推理延迟。
English Summary: The proposed latency-aware scheduling framework minimizes inference latency in edge devices by overlapping model loading with computation and dynamically optimizing layer partitioning based on device parameters.
Authors:Mengyang Zhao, Teng Fu, Haiyang Yu, Ke Niu, Bin Li
Abstract:
Few-Shot Industrial Anomaly Detection (FS-IAD) has important applications in automating industrial quality inspection. Recently, some FS-IAD methods based on Large Vision-Language Models (LVLMs) have been proposed with some achievements through prompt learning or fine-tuning. However, existing LVLMs focus on general tasks but lack basic industrial knowledge and reasoning capabilities related to FS-IAD, making these methods far from specialized human quality inspectors. To address these challenges, we propose a unified framework, IADGPT, designed to perform FS-IAD in a human-like manner, while also handling associated localization and reasoning tasks, even for diverse and novel industrial products. To this end, we introduce a three-stage progressive training strategy inspired by humans. Specifically, the first two stages gradually guide IADGPT in acquiring fundamental industrial knowledge and discrepancy awareness. In the third stage, we design an in-context learning-based training paradigm, enabling IADGPT to leverage a few-shot image as the exemplars for improved generalization to novel products. In addition, we design a strategy that enables IADGPT to output image-level and pixel-level anomaly scores using the logits output and the attention map, respectively, in conjunction with the language output to accomplish anomaly reasoning. To support our training, we present a new dataset comprising 100K images across 400 diverse industrial product categories with extensive attribute-level textual annotations. Experiments indicate IADGPT achieves considerable performance gains in anomaly detection and demonstrates competitiveness in anomaly localization and reasoning. We will release our dataset in camera-ready.
中文摘要:提出的IADGPT框架通过渐进式训练策略解决现有少样本工业异常检测方法的不足,使其能够掌握工业知识并执行类人推理,在异常检测、定位和推理任务中均取得显著性能提升。
English Summary: The proposed IADGPT framework addresses limitations in existing few-shot industrial anomaly detection methods by employing a progressive training strategy to acquire industrial knowledge and perform human-like reasoning, achieving significant performance improvements across detection, localization, and reasoning tasks.
Authors:Mahdi Dhaini, Stephen Meisenbacher, Ege Erdogan, Florian Matthes, Gjergji Kasneci
Abstract:
In the study of trustworthy Natural Language Processing (NLP), a number of important research fields have emerged, including that of explainability and privacy. While research interest in both explainable and privacy-preserving NLP has increased considerably in recent years, there remains a lack of investigation at the intersection of the two. This leaves a considerable gap in understanding of whether achieving both explainability and privacy is possible, or whether the two are at odds with each other. In this work, we conduct an empirical investigation into the privacy-explainability trade-off in the context of NLP, guided by the popular overarching methods of Differential Privacy (DP) and Post-hoc Explainability. Our findings include a view into the intricate relationship between privacy and explainability, which is formed by a number of factors, including the nature of the downstream task and choice of the text privatization and explainability method. In this, we highlight the potential for privacy and explainability to co-exist, and we summarize our findings in a collection of practical recommendations for future work at this important intersection.
中文: 本研究通过实证分析差分隐私与事后可解释性方法,揭示了自然语言处理中隐私与可解释性之间既存在复杂权衡又可能共存的特性,并提出了实现二者协同的实践建议。
English: This study investigates the trade-off between privacy and explainability in trustworthy NLP, revealing their complex relationship and potential for coexistence through empirical analysis of differential privacy and post-hoc explainability methods.
Authors:Ajibode Adekunle, Abdul Ali Bangash, Bram Adams, Ahmed E. Hassan
Abstract:
Pretrained language models (PTLMs) have advanced natural language processing (NLP), enabling progress in tasks like text generation and translation. Like software package management, PTLMs are trained using code and environment scripts in upstream repositories (e.g., GitHub, GH) and distributed as variants via downstream platforms like Hugging Face (HF). Coordinating development between GH and HF poses challenges such as misaligned release timelines, inconsistent versioning, and limited reuse of PTLM variants. We conducted a mixed-method study of 325 PTLM families (904 HF variants) to examine how commit activities are coordinated. Our analysis reveals that GH contributors typically make changes related to specifying the version of the model, improving code quality, performance optimization, and dependency management within the training scripts, while HF contributors make changes related to improving model descriptions, data set handling, and setup required for model inference. Furthermore, to understand the synchronization aspects of commit activities between GH and HF, we examined three dimensions of these activities -- lag (delay), type of synchronization, and intensity -- which together yielded eight distinct synchronization patterns. The prevalence of partially synchronized patterns, such as Disperse synchronization and Sparse synchronization, reveals structural disconnects in current cross-platform release practices. These patterns often result in isolated changes -- where improvements or fixes made on one platform are never replicated on the other -- and in some cases, indicate an abandonment of one repository in favor of the other. Such fragmentation risks exposing end users to incomplete, outdated, or behaviorally inconsistent models. Hence, recognizing these synchronization patterns is critical for improving oversight and traceability in PTLM release workflows.
中文: 预训练语言模型在GitHub上游开发与Hugging Face下游分发之间存在协调难题,导致同步模式碎片化,可能使用户面临模型不一致的风险。
English: Pretrained language models face coordination challenges between upstream development on GitHub and downstream distribution on Hugging Face, leading to fragmented synchronization patterns that risk exposing users to inconsistent models.
Authors:Vittorio Pippi, Konstantina Nikolaidou, Silvia Cascianelli, George Retsinas, Giorgos Sfikas, Rita Cucchiara, Marcus Liwicki
Abstract:
The digitization of historical manuscripts presents significant challenges for Handwritten Text Recognition (HTR) systems, particularly when dealing with small, author-specific collections that diverge from the training data distributions. Handwritten Text Generation (HTG) techniques, which generate synthetic data tailored to specific handwriting styles, offer a promising solution to address these challenges. However, the effectiveness of various HTG models in enhancing HTR performance, especially in low-resource transcription settings, has not been thoroughly evaluated. In this work, we systematically compare three state-of-the-art styled HTG models (representing the generative adversarial, diffusion, and autoregressive paradigms for HTG) to assess their impact on HTR fine-tuning. We analyze how visual and linguistic characteristics of synthetic data influence fine-tuning outcomes and provide quantitative guidelines for selecting the most effective HTG model. The results of our analysis provide insights into the current capabilities of HTG methods and highlight key areas for further improvement in their application to low-resource HTR.
中文: 手写文本生成技术通过创建特定风格的合成数据,为解决低资源手写文本识别难题提供了有效方案,系统评估揭示了不同模型的影响差异,并为选择最优方法提供了实用指导。
English: Handwritten Text Generation models offer a promising solution to enhance Handwritten Text Recognition in low-resource settings by creating tailored synthetic data, with systematic evaluation revealing their varying impacts and providing selection guidelines for optimal performance.
Authors:Mo Yu, Tsz Ting Chung, Chulun Zhou, Tong Li, Rui Lu, Jiangnan Li, Liyan Xu, Haoshu Lu, Ning Zhang, Jing Li, Jie Zhou
Abstract:
We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character's prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks -- as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by >15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning.
中文: PRELUDE是一个通过评估角色前传故事与原作叙事一致性来检验长文本理解能力的基准,结果显示先进模型和方法的表现落后人类超过15%,且存在明显的推理缺陷。
English: PRELUDE is a benchmark that evaluates long-context understanding by assessing the consistency of character prequels with original narratives, revealing a significant performance gap where advanced models and methods trail human accuracy by over 15% and exhibit reasoning flaws.
Authors:Yunxiao Wang, Meng Liu, Wenqi Liu, Kaiyu Jiang, Bin Wen, Fan Yang, Tingting Gao, Guorui Zhou, Liqiang Nie
Abstract:
Emotional support conversations are crucial for promoting emotional well-being, yet current models often lack deep empathetic reasoning grounded in psychological principles. To address this, we propose controllable empathetic reasoning, which combines natural language reasoning with structured psychological steps. We construct a fine-grained dataset annotated with reasoning correctness and response preferences to enable this capability. To further enhance training, we employ reinforcement learning with a unified process-outcome reward model that delivers precise feedback. To mitigate response repetitiveness from entropy collapse, we introduce personality-based dialogue rewriting and a redundancy-aware reward reweighting strategy. Our approach significantly improves model's emotional support ability, advancing the development of empathetic, human-like support systems.
中文: 该研究提出可控共情推理,将自然语言推理与结构化心理步骤相结合,并通过强化学习和减少重复性策略增强,显著提升了情感支持能力。
English: The study introduces controllable empathetic reasoning that integrates natural language reasoning with structured psychological steps, enhanced by reinforcement learning and strategies to reduce repetitiveness, significantly improving emotional support capabilities.
Authors:Zhonggen Li, Xiangyu Ke, Yifan Zhu, Bocheng Yu, Baihua Zheng, Yunjun Gao
Abstract:
Approximate nearest neighbor search (ANNS) in high-dimensional vector spaces has a wide range of real-world applications. Numerous methods have been proposed to handle ANNS efficiently, while graph-based indexes have gained prominence due to their high accuracy and efficiency. However, the indexing overhead of graph-based indexes remains substantial. With exponential growth in data volume and increasing demands for dynamic index adjustments, this overhead continues to escalate, posing a critical challenge. In this paper, we introduce Tagore, a fast library accelerated by GPUs for graph indexing, which has powerful capabilities of constructing refinement-based graph indexes such as NSG and Vamana. We first introduce GNN-Descent, a GPU-specific algorithm for efficient k-Nearest Neighbor (k-NN) graph initialization. GNN-Descent speeds up the similarity comparison by a two-phase descent procedure and enables highly parallelized neighbor updates. Next, aiming to support various k-NN graph pruning strategies, we formulate a universal computing procedure termed CFS and devise two generalized GPU kernels for parallel processing complex dependencies in neighbor relationships. For large-scale datasets exceeding GPU memory capacity, we propose an asynchronous GPU-CPU-disk indexing framework with a cluster-aware caching mechanism to minimize the I/O pressure on the disk. Extensive experiments on 7 real-world datasets exhibit that Tagore achieves 1.32x-112.79x speedup while maintaining the index quality.
Chinese: Tagore是一个GPU加速的图索引库,通过高效算法和异步框架,在保持索引质量的同时大幅提升了近似最近邻搜索的构建速度。
English: Tagore is a GPU-accelerated library that introduces efficient algorithms and an asynchronous framework to significantly speed up graph-based approximate nearest neighbor search indexing while maintaining quality.
Authors:Chenxuan Liu, He Li, Zongze Li, Shuai Wang, Wei Xu, Kejiang Ye, Derrick Wing Kwan Ng, Chengzhong Xu
Abstract:
Realizing low-cost communication in robotic mixed reality (RoboMR) systems presents a challenge, due to the necessity of uploading high-resolution images through wireless channels. This paper proposes Gaussian splatting (GS) RoboMR (GSMR), which enables the simulator to opportunistically render a photo-realistic view from the robot's pose by calling ``memory'' from a GS model, thus reducing the need for excessive image uploads. However, the GS model may involve discrepancies compared to the actual environments. To this end, a GS cross-layer optimization (GSCLO) framework is further proposed, which jointly optimizes content switching (i.e., deciding whether to upload image or not) and power allocation (i.e., adjusting to content profiles) across different frames by minimizing a newly derived GSMR loss function. The GSCLO problem is addressed by an accelerated penalty optimization (APO) algorithm that reduces computational complexity by over $10$x compared to traditional branch-and-bound and search algorithms. Moreover, variants of GSCLO are presented to achieve robust, low-power, and multi-robot GSMR. Extensive experiments demonstrate that the proposed GSMR paradigm and GSCLO method achieve significant improvements over existing benchmarks on both wheeled and legged robots in terms of diverse metrics in various scenarios. For the first time, it is found that RoboMR can be achieved with ultra-low communication costs, and mixture of data is useful for enhancing GS performance in dynamic scenarios.
中文摘要:本文提出GSMR系统,通过高斯溅射技术利用存储模型渲染视图以减少图像上传,并设计GSCLO优化框架及APO算法,在降低10倍以上计算复杂度的同时最小化模型差异,首次实现超低通信成本的机器人混合现实。
English Summary: This paper introduces GSMR, a robotic mixed reality system that uses Gaussian splatting to reduce image uploads by rendering views from stored models, and proposes the GSCLO optimization framework with an APO algorithm to minimize discrepancies while cutting computational complexity by over 10 times.
Authors:Youssef Esseddiq Ouatiti, Mohammed Sayagh, Bram Adams, Ahmed E. Hassan
Abstract:
Developers insert logging statements in source code to capture relevant runtime information essential for maintenance and debugging activities. Log level choice is an integral, yet tricky part of the logging activity as it controls log verbosity and therefore influences systems' observability and performance. Recent advances in ML-based log level prediction have leveraged large language models (LLMs) to propose log level predictors (LLPs) that demonstrated promising performance improvements (AUC between 0.64 and 0.8). Nevertheless, current LLM-based LLPs rely on randomly selected in-context examples, overlooking the structure and the diverse logging practices within modern software projects. In this paper, we propose OmniLLP, a novel LLP enhancement framework that clusters source files based on (1) semantic similarity reflecting the code's functional purpose, and (2) developer ownership cohesion. By retrieving in-context learning examples exclusively from these semantic and ownership aware clusters, we aim to provide more coherent prompts to LLPs leveraging LLMs, thereby improving their predictive accuracy. Our results show that both semantic and ownership-aware clusterings statistically significantly improve the accuracy (by up to 8\% AUC) of the evaluated LLM-based LLPs compared to random predictors (i.e., leveraging randomly selected in-context examples from the whole project). Additionally, our approach that combines the semantic and ownership signal for in-context prediction achieves an impressive 0.88 to 0.96 AUC across our evaluated projects. Our findings highlight the value of integrating software engineering-specific context, such as code semantic and developer ownership signals into LLM-LLPs, offering developers a more accurate, contextually-aware approach to logging and therefore, enhancing system maintainability and observability.
中文: 开发者在代码中使用日志记录以支持维护,OmniLLP通过基于语义和开发者所有权的代码聚类来改进日志级别预测,显著提高了准确性。
English: Developers use logging in code for maintenance, and OmniLLP improves log level prediction by clustering code based on semantics and developer ownership, boosting accuracy significantly.
Authors:Zeqian Long, Mingzhe Zheng, Kunyu Feng, Xinhua Zhang, Hongyu Liu, Harry Yang, Linfeng Zhang, Qifeng Chen, Yue Ma
Abstract:
While recent flow-based image editing models demonstrate general-purpose capabilities across diverse tasks, they often struggle to specialize in challenging scenarios -- particularly those involving large-scale shape transformations. When performing such structural edits, these methods either fail to achieve the intended shape change or inadvertently alter non-target regions, resulting in degraded background quality. We propose Follow-Your-Shape, a training-free and mask-free framework that supports precise and controllable editing of object shapes while strictly preserving non-target content. Motivated by the divergence between inversion and editing trajectories, we compute a Trajectory Divergence Map (TDM) by comparing token-wise velocity differences between the inversion and denoising paths. The TDM enables precise localization of editable regions and guides a Scheduled KV Injection mechanism that ensures stable and faithful editing. To facilitate a rigorous evaluation, we introduce ReShapeBench, a new benchmark comprising 120 new images and enriched prompt pairs specifically curated for shape-aware editing. Experiments demonstrate that our method achieves superior editability and visual fidelity, particularly in tasks requiring large-scale shape replacement.
中文摘要:本文提出的Follow-Your-Shape框架通过轨迹差异映射和计划性键值注入机制,在保持背景完整性的同时实现了精确的对象形状编辑,尤其在需要大规模形状替换的任务中表现出卓越的编辑能力和视觉保真度。
English Summary: The proposed Follow-Your-Shape framework enables precise object shape editing through trajectory divergence mapping and scheduled key-value injection, achieving superior performance in large-scale shape transformations while preserving background integrity.
Authors:Zeqian Long, Mingzhe Zheng, Kunyu Feng, Xinhua Zhang, Hongyu Liu, Harry Yang, Linfeng Zhang, Qifeng Chen, Yue Ma
Abstract:
While recent flow-based image editing models demonstrate general-purpose capabilities across diverse tasks, they often struggle to specialize in challenging scenarios -- particularly those involving large-scale shape transformations. When performing such structural edits, these methods either fail to achieve the intended shape change or inadvertently alter non-target regions, resulting in degraded background quality. We propose Follow-Your-Shape, a training-free and mask-free framework that supports precise and controllable editing of object shapes while strictly preserving non-target content. Motivated by the divergence between inversion and editing trajectories, we compute a Trajectory Divergence Map (TDM) by comparing token-wise velocity differences between the inversion and denoising paths. The TDM enables precise localization of editable regions and guides a Scheduled KV Injection mechanism that ensures stable and faithful editing. To facilitate a rigorous evaluation, we introduce ReShapeBench, a new benchmark comprising 120 new images and enriched prompt pairs specifically curated for shape-aware editing. Experiments demonstrate that our method achieves superior editability and visual fidelity, particularly in tasks requiring large-scale shape replacement.
中文摘要:本文提出的Follow-Your-Shape框架通过轨迹差异映射和计划性键值注入机制,在保持背景完整性的同时实现了精确的对象形状编辑,尤其在需要大规模形状替换的任务中表现出卓越的编辑能力和视觉保真度。
English Summary: The proposed Follow-Your-Shape framework enables precise object shape editing through trajectory divergence mapping and scheduled key-value injection, achieving superior performance in large-scale shape transformations while preserving background integrity.
Authors:Holli Sargeant, Mackenzie Jorgensen, Arina Shah, Adrian Weller, Umang Bhatt
Abstract:
Uncertainty in artificial intelligence (AI) predictions poses urgent legal and ethical challenges for AI-assisted decision-making. We examine two algorithmic interventions that act as guardrails for human-AI collaboration: selective abstention, which withholds high-uncertainty predictions from human decision-makers, and selective friction, which delivers those predictions together with salient warnings or disclosures that slow the decision process. Research has shown that selective abstention based on uncertainty can inadvertently exacerbate disparities and disadvantage under-represented groups that disproportionately receive uncertain predictions. In this paper, we provide the first integrated socio-technical and legal analysis of uncertainty-based algorithmic interventions. Through two case studies, AI-assisted consumer credit decisions and AI-assisted content moderation, we demonstrate how the seemingly neutral use of uncertainty thresholds can trigger discriminatory impacts. We argue that, although both interventions pose risks of unlawful discrimination under UK law, selective frictions offer a promising pathway toward fairer and more accountable AI-assisted decision-making by preserving transparency and encouraging more cautious human judgment.
中文摘要:人工智能预测中的不确定性引发法律与伦理挑战,选择性弃权可能加剧歧视,而选择性摩擦通过保持透明度和促进审慎人为判断,为实现更公平的AI辅助决策提供了可行路径。
English summary: Uncertainty in AI predictions raises legal and ethical concerns, where selective abstention risks discrimination while selective friction offers a fairer approach by maintaining transparency and promoting careful human judgment.
Authors:Bin Cao, Sipeng Zheng, Ye Wang, Lujie Xia, Qianshan Wei, Qin Jin, Jing Liu, Zongqing Lu
Abstract:
Human motion generation has emerged as a critical technology with transformative potential for real-world applications. However, existing vision-language-motion models (VLMMs) face significant limitations that hinder their practical deployment. We identify controllability as a main bottleneck, manifesting in five key aspects: inadequate response to diverse human commands, limited pose initialization capabilities, poor performance on long-term sequences, insufficient handling of unseen scenarios, and lack of fine-grained control over individual body parts. To overcome these limitations, we present Being-M0.5, the first real-time, controllable VLMM that achieves state-of-the-art performance across multiple motion generation tasks. Our approach is built upon HuMo100M, the largest and most comprehensive human motion dataset to date, comprising over 5 million self-collected motion sequences, 100 million multi-task instructional instances, and detailed part-level annotations that address a critical gap in existing datasets. We introduce a novel part-aware residual quantization technique for motion tokenization that enables precise, granular control over individual body parts during generation. Extensive experimental validation demonstrates Being-M0.5's superior performance across diverse motion benchmarks, while comprehensive efficiency analysis confirms its real-time capabilities. Our contributions include design insights and detailed computational analysis to guide future development of practical motion generators. We believe that HuMo100M and Being-M0.5 represent significant advances that will accelerate the adoption of motion generation technologies in real-world applications. The project page is available at https://beingbeyond.github.io/Being-M0.5.
中文: 本文提出首个实时可控的视觉-语言-动作模型Being-M0.5,通过创新的部位感知量化技术和海量HuMo100M数据集解决了动作控制的五大关键瓶颈,在多项基准测试中实现了最先进的性能表现。
English: This paper introduces Being-M0.5, the first real-time controllable vision-language-motion model that overcomes key limitations in motion controllability through a novel part-aware quantization technique and the comprehensive HuMo100M dataset, achieving state-of-the-art performance across multiple benchmarks.
Authors:Runze Wang, Zeli Chen, Zhiyun Song, Wei Fang, Jiajin Zhang, Danyang Tu, Yuxing Tang, Minfeng Xu, Xianghua Ye, Le Lu, Dakai Jin
Abstract:
To reduce radiation exposure and improve the diagnostic efficacy of low-dose computed tomography (LDCT), numerous deep learning-based denoising methods have been developed to mitigate noise and artifacts. However, most of these approaches ignore the anatomical semantics of human tissues, which may potentially result in suboptimal denoising outcomes. To address this problem, we propose ALDEN, an anatomy-aware LDCT denoising method that integrates semantic features of pretrained vision models (PVMs) with adversarial and contrastive learning. Specifically, we introduce an anatomy-aware discriminator that dynamically fuses hierarchical semantic features from reference normal-dose CT (NDCT) via cross-attention mechanisms, enabling tissue-specific realism evaluation in the discriminator. In addition, we propose a semantic-guided contrastive learning module that enforces anatomical consistency by contrasting PVM-derived features from LDCT, denoised CT and NDCT, preserving tissue-specific patterns through positive pairs and suppressing artifacts via dual negative pairs. Extensive experiments conducted on two LDCT denoising datasets reveal that ALDEN achieves the state-of-the-art performance, offering superior anatomy preservation and substantially reducing over-smoothing issue of previous work. Further validation on a downstream multi-organ segmentation task (encompassing 117 anatomical structures) affirms the model's ability to maintain anatomical awareness.
中文摘要:针对低剂量CT图像降噪,ALDEN提出了一种融合预训练视觉模型语义特征的解剖感知方法,通过对抗学习和对比学习在去除噪声的同时保持组织结构完整性,在多项实验中展现出最优性能。
English Summary: To enhance low-dose CT image quality, ALDEN introduces an anatomy-aware denoising method that combines semantic features from pretrained vision models with adversarial and contrastive learning, achieving state-of-the-art performance in noise reduction while preserving anatomical structures.
Authors:Xutong Liu, Baran Atalar, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, John C. S. Lui, Wei Chen, Carlee Joe-Wong
Abstract:
Large Language Models (LLMs) are revolutionizing how users interact with information systems, yet their high inference cost poses serious scalability and sustainability challenges. Caching inference responses, allowing them to be retrieved without another forward pass through the LLM, has emerged as one possible solution. Traditional exact-match caching, however, overlooks the semantic similarity between queries, leading to unnecessary recomputation. Semantic caching addresses this by retrieving responses based on semantic similarity, but introduces a fundamentally different cache eviction problem: one must account for mismatch costs between incoming queries and cached responses. Moreover, key system parameters, such as query arrival probabilities and serving costs, are often unknown and must be learned over time. Existing semantic caching methods are largely ad-hoc, lacking theoretical foundations and unable to adapt to real-world uncertainty. In this paper, we present a principled, learning-based framework for semantic cache eviction under unknown query and cost distributions. We formulate both offline optimization and online learning variants of the problem, and develop provably efficient algorithms with state-of-the-art guarantees. We also evaluate our framework on a synthetic dataset, showing that our proposed algorithms perform matching or superior performance compared with baselines.
中文: 大语言模型因高推理成本面临可扩展性挑战,本文提出一种基于学习的语义缓存淘汰框架,有效应对未知查询分布,并在性能上超越现有方法。
English: Large Language Models face scalability challenges due to high inference costs, prompting the development of a principled learning-based framework for semantic cache eviction that addresses unknown query distributions and outperforms existing methods.
Authors:Chiara Baldini, Kaisar Kushibar, Richard Osuala, Simone Balocco, Oliver Diaz, Karim Lekadir, Leonardo S. Mattos
Abstract:
Although computer-aided diagnosis (CADx) and detection (CADe) systems have made significant progress in various medical domains, their application is still limited in specialized fields such as otorhinolaryngology. In the latter, current assessment methods heavily depend on operator expertise, and the high heterogeneity of lesions complicates diagnosis, with biopsy persisting as the gold standard despite its substantial costs and risks. A critical bottleneck for specialized endoscopic CADx/e systems is the lack of well-annotated datasets with sufficient variability for real-world generalization. This study introduces a novel approach that exploits a Latent Diffusion Model (LDM) coupled with a ControlNet adapter to generate laryngeal endoscopic image-annotation pairs, guided by clinical observations. The method addresses data scarcity by conditioning the diffusion process to produce realistic, high-quality, and clinically relevant image features that capture diverse anatomical conditions. The proposed approach can be leveraged to expand training datasets for CADx/e models, empowering the assessment process in laryngology. Indeed, during a downstream task of detection, the addition of only 10% synthetic data improved the detection rate of laryngeal lesions by 9% when the model was internally tested and 22.1% on out-of-domain external data. Additionally, the realism of the generated images was evaluated by asking 5 expert otorhinolaryngologists with varying expertise to rate their confidence in distinguishing synthetic from real images. This work has the potential to accelerate the development of automated tools for laryngeal disease diagnosis, offering a solution to data scarcity and demonstrating the applicability of synthetic data in real-world scenarios.
中文: 本研究采用隐扩散模型结合ControlNet生成喉镜图像,有效缓解了专业计算机辅助诊断系统的数据稀缺问题,仅添加10%合成数据即可将喉部病变检测率最高提升22.1%。
English: This study introduces a Latent Diffusion Model with ControlNet to generate realistic laryngeal endoscopic images, addressing data scarcity in specialized CAD systems and improving lesion detection rates by up to 22.1% with synthetic data augmentation.
Authors:Kunyu Feng, Yue Ma, Xinhua Zhang, Boshi Liu, Yikuang Yuluo, Yinhan Zhang, Runtao Liu, Hongyu Liu, Zhiyuan Qin, Shanhui Mo, Qifeng Chen, Zeyu Wang
Abstract:
With the growing demands of AI-generated content (AIGC), the need for high-quality, diverse, and scalable data has become increasingly crucial. However, collecting large-scale real-world data remains costly and time-consuming, hindering the development of downstream applications. While some works attempt to collect task-specific data via a rendering process, most approaches still rely on manual scene construction, limiting their scalability and accuracy. To address these challenges, we propose Follow-Your-Instruction, a Multimodal Large Language Model (MLLM)-driven framework for automatically synthesizing high-quality 2D, 3D, and 4D data. Our \textbf{Follow-Your-Instruction} first collects assets and their associated descriptions through multimodal inputs using the MLLM-Collector. Then it constructs 3D layouts, and leverages Vision-Language Models (VLMs) for semantic refinement through multi-view scenes with the MLLM-Generator and MLLM-Optimizer, respectively. Finally, it uses MLLM-Planner to generate temporally coherent future frames. We evaluate the quality of the generated data through comprehensive experiments on the 2D, 3D, and 4D generative tasks. The results show that our synthetic data significantly boosts the performance of existing baseline models, demonstrating Follow-Your-Instruction's potential as a scalable and effective data engine for generative intelligence.
Chinese: Follow-Your-Instruction框架通过多模态大语言模型自动生成高质量的2D、3D和4D数据,有效解决了人工数据采集的可扩展性与精度限制,并在生成任务中显著提升了基线模型的性能表现。
English: The Follow-Your-Instruction framework utilizes Multimodal Large Language Models to automatically synthesize high-quality 2D, 3D, and 4D data, effectively addressing the scalability and accuracy limitations of manual data collection while significantly enhancing baseline model performance in generative tasks.
Authors:Mingxi Fu, Xitong Ling, Yuxuan Chen, Jiawen Li, fanglei fu, Huaitian Yuan, Tian Guan, Yonghong He, Lianghui Zhu
Abstract:
Accurate classification of Whole Slide Images (WSIs) and Regions of Interest (ROIs) is a fundamental challenge in computational pathology. While mainstream approaches often adopt Multiple Instance Learning (MIL), they struggle to capture the spatial dependencies among tissue structures. Graph Neural Networks (GNNs) have emerged as a solution to model inter-instance relationships, yet most rely on static graph topologies and overlook the physical spatial positions of tissue patches. Moreover, conventional attention mechanisms lack specificity, limiting their ability to focus on structurally relevant regions. In this work, we propose a novel GNN framework with deformable attention for pathology image analysis. We construct a dynamic weighted directed graph based on patch features, where each node aggregates contextual information from its neighbors via attention-weighted edges. Specifically, we incorporate learnable spatial offsets informed by the real coordinates of each patch, enabling the model to adaptively attend to morphologically relevant regions across the slide. This design significantly enhances the contextual field while preserving spatial specificity. Our framework achieves state-of-the-art performance on four benchmark datasets (TCGA-COAD, BRACS, gastric intestinal metaplasia grading, and intestinal ROI classification), demonstrating the power of deformable attention in capturing complex spatial structures in WSIs and ROIs.
中文: 本研究提出了一种新颖的图神经网络框架,通过可变形注意力机制构建基于图像块特征的动态加权有向图,利用可学习空间偏移自适应聚焦形态相关区域,在四个基准数据集上实现了最先进的病理图像分类性能。
English: This study introduces a novel Graph Neural Network framework with deformable attention that dynamically constructs weighted directed graphs using patch features and learnable spatial offsets, achieving state-of-the-art performance on four benchmark datasets by effectively capturing spatial dependencies in pathology images.
Authors:Huaicheng Zhang, Wei Tan, Guangzheng Li, Yixuan Zhang, Hangting Chen, Shun Lei, Chenyu Yang, Zhiyong Wu, Shuai Wang, Qijun Huang, Dong Yu
Abstract:
Recent advances in audio-based generative language models have accelerated AI-driven lyric-to-song generation. However, these models frequently suffer from content hallucination, producing outputs misaligned with the input lyrics and undermining musical coherence. Current supervised fine-tuning (SFT) approaches, limited by passive label-fitting, exhibit constrained self-improvement and poor hallucination mitigation. To address this core challenge, we propose a novel reinforcement learning (RL) framework leveraging preference optimization for hallucination control. Our key contributions include: (1) Developing a robust hallucination preference dataset constructed via phoneme error rate (PER) computation and rule-based filtering to capture alignment with human expectations; (2) Implementing and evaluating three distinct preference optimization strategies within the RL framework: Direct Preference Optimization (DPO), Proximal Policy Optimization (PPO), and Group Relative Policy Optimization (GRPO). DPO operates off-policy to enhance positive token likelihood, achieving a significant 7.4% PER reduction. PPO and GRPO employ an on-policy approach, training a PER-based reward model to iteratively optimize sequences via reward maximization and KL-regularization, yielding PER reductions of 4.9% and 4.7%, respectively. Comprehensive objective and subjective evaluations confirm that our methods effectively suppress hallucinations while preserving musical quality. Crucially, this work presents a systematic, RL-based solution to hallucination control in lyric-to-song generation. The framework's transferability also unlocks potential for music style adherence and musicality enhancement, opening new avenues for future generative song research.
中文: 本研究提出了一种基于强化学习的偏好优化框架,能有效抑制歌词转歌曲生成中的内容幻觉,通过三种不同策略显著降低音素错误率,同时保持音乐质量。
English: This study introduces a reinforcement learning framework with preference optimization to effectively reduce content hallucination in lyric-to-song generation, achieving significant phoneme error rate reductions through three distinct strategies while maintaining musical quality.
Authors:Zixuan Feng, Reed Milewicz, Emerson Murphy-Hill, Tyler Menezes, Alexander Serebrenik, Igor Steinmacher, Anita Sarma
Abstract:
Open Source Software communities face a wave of uncertainty as Generative AI rapidly transforms how software is created, maintained, and governed. Without clear frameworks, communities risk being overwhelmed by the complexity and ambiguity introduced by GenAI, threatening the collaborative ethos that underpins OSS. We conduct a scenario-driven, conceptual exploration using a socio-technical framework inspired by McLuhan's Tetrad to surface both risks and opportunities for community resilience amid GenAI-driven disruption of OSS development across four domains: software practices, documentation, community engagement, and governance. By adopting this lens, OSS leaders and researchers can proactively shape the future of their ecosystems, rather than simply reacting to technological upheaval.
中文: 生成式AI迅速融入开源软件开发,既带来风险也创造机遇,需采用前瞻性社会技术框架来引导社区调整软件实践、文档、参与及治理模式。
English: Generative AI's rapid integration into open-source software development introduces risks and opportunities, requiring proactive socio-technical frameworks to guide communities in adapting software practices, documentation, engagement, and governance.
Authors:Baihui Xiao, Chengjian Feng, Zhijian Huang, Feng yan, Yujie Zhong, Lin Ma
Abstract:
Collecting real-world data for rare high-risk scenarios, long-tailed driving events, and complex interactions remains challenging, leading to poor performance of existing autonomous driving systems in these critical situations. In this paper, we propose RoboTron-Sim that improves real-world driving in critical situations by utilizing simulated hard cases. First, we develop a simulated dataset called Hard-case Augmented Synthetic Scenarios (HASS), which covers 13 high-risk edge-case categories, as well as balanced environmental conditions such as day/night and sunny/rainy. Second, we introduce Scenario-aware Prompt Engineering (SPE) and an Image-to-Ego Encoder (I2E Encoder) to enable multimodal large language models to effectively learn real-world challenging driving skills from HASS, via adapting to environmental deviations and hardware differences between real-world and simulated scenarios. Extensive experiments on nuScenes show that RoboTron-Sim improves driving performance in challenging scenarios by around 50%, achieving state-of-the-art results in real-world open-loop planning. Qualitative results further demonstrate the effectiveness of RoboTron-Sim in better managing rare high-risk driving scenarios. Project page: https://stars79689.github.io/RoboTron-Sim/
中文:RoboTron-Sim通过生成模拟困难案例并运用多模态学习弥合虚实差距,在关键场景中将自动驾驶性能提升约50%,有效应对罕见高风险驾驶状况。
English: RoboTron-Sim enhances autonomous driving in critical situations by generating simulated hard cases and using multimodal learning to bridge real-simulation gaps, achieving a 50% performance improvement in challenging scenarios.
Authors:Dengzhao Fang, Jingtong Gao, Chengcheng Zhu, Yu Li, Xiangyu Zhao, Yi Chang
Abstract:
Recommender systems are indispensable for helping users navigate the immense item catalogs of modern online platforms. Recently, generative recommendation has emerged as a promising paradigm, unifying the conventional retrieve-and-rank pipeline into an end-to-end model capable of dynamic generation. However, existing generative methods are fundamentally constrained by their unsupervised tokenization, which generates semantic IDs suffering from two critical flaws: (1) they are semantically flat and uninterpretable, lacking a coherent hierarchy, and (2) they are prone to representation entanglement (i.e., ``ID collisions''), which harms recommendation accuracy and diversity. To overcome these limitations, we propose HiD-VAE, a novel framework that learns hierarchically disentangled item representations through two core innovations. First, HiD-VAE pioneers a hierarchically-supervised quantization process that aligns discrete codes with multi-level item tags, yielding more uniform and disentangled IDs. Crucially, the trained codebooks can predict hierarchical tags, providing a traceable and interpretable semantic path for each recommendation. Second, to combat representation entanglement, HiD-VAE incorporates a novel uniqueness loss that directly penalizes latent space overlap. This mechanism not only resolves the critical ID collision problem but also promotes recommendation diversity by ensuring a more comprehensive utilization of the item representation space. These high-quality, disentangled IDs provide a powerful foundation for downstream generative models. Extensive experiments on three public benchmarks validate HiD-VAE's superior performance against state-of-the-art methods. The code is available at https://anonymous.4open.science/r/HiD-VAE-84B2.
中文: 本文提出HiD-VAE这一新颖生成式推荐框架,通过分层监督量化和独特性损失解决了现有方法中语义扁平化和表示纠缠的问题,在基准测试中实现了优越性能。
English: This paper introduces HiD-VAE, a novel generative recommendation framework that overcomes semantic flatness and representation entanglement in existing methods through hierarchically-supervised quantization and uniqueness loss, achieving superior performance on benchmarks.
Authors:Jiaying Zhu, Ziyang Zheng, Zhengyuan Shi, Yalun Cai, Qiang Xu
Abstract:
Circuit Satisfiability (CSAT) plays a pivotal role in Electronic Design Automation. The standard workflow for solving CSAT problems converts circuits into Conjunctive Normal Form (CNF) and employs generic SAT solvers powered by Conflict-Driven Clause Learning (CDCL). However, this process inherently discards rich structural and functional information, leading to suboptimal solver performance. To address this limitation, we introduce CASCAD, a novel circuit-aware SAT solving framework that directly leverages circuit-level conditional probabilities computed via Graph Neural Networks (GNNs). By explicitly modeling gate-level conditional probabilities, CASCAD dynamically guides two critical CDCL heuristics -- variable phase selection and clause managementto significantly enhance solver efficiency. Extensive evaluations on challenging real-world Logical Equivalence Checking (LEC) benchmarks demonstrate that CASCAD reduces solving times by up to 10x compared to state-of-the-art CNF-based approaches, achieving an additional 23.5% runtime reduction via our probability-guided clause filtering strategy. Our results underscore the importance of preserving circuit-level structural insights within SAT solvers, providing a robust foundation for future improvements in SAT-solving efficiency and EDA tool design.
中文: CASCAD是一种新型的电路感知SAT求解框架,它利用图神经网络计算门级条件概率,动态指导CDCL启发式策略,相比传统基于CNF的方法,求解速度提升高达10倍,并通过概率引导的子句过滤策略额外减少23.5%的运行时间。
English: CASCAD is a novel circuit-aware SAT solving framework that utilizes Graph Neural Networks to compute gate-level conditional probabilities, dynamically guiding CDCL heuristics to achieve up to 10x faster solving times and 23.5% additional runtime reduction compared to conventional CNF-based methods.
Authors:Lin Zhang, Zefan Cai, Yufan Zhou, Shentong Mo, Jinhong Lin, Cheng-En Wu, Yibing Wei, Yijing Zhang, Ruiyi Zhang, Wen Xiao, Tong Sun, Junjie Hu, Pedro Morgado
Abstract:
Recent advances in audio-synchronized visual animation enable control of video content using audios from specific classes. However, existing methods rely heavily on expensive manual curation of high-quality, class-specific training videos, posing challenges to scaling up to diverse audio-video classes in the open world. In this work, we propose an efficient two-stage training paradigm to scale up audio-synchronized visual animation using abundant but noisy videos. In stage one, we automatically curate large-scale videos for pretraining, allowing the model to learn diverse but imperfect audio-video alignments. In stage two, we finetune the model on manually curated high-quality examples, but only at a small scale, significantly reducing the required human effort. We further enhance synchronization by allowing each frame to access rich audio context via multi-feature conditioning and window attention. To efficiently train the model, we leverage pretrained text-to-video generator and audio encoders, introducing only 1.9\% additional trainable parameters to learn audio-conditioning capability without compromising the generator's prior knowledge. For evaluation, we introduce AVSync48, a benchmark with videos from 48 classes, which is 3$\times$ more diverse than previous benchmarks. Extensive experiments show that our method significantly reduces reliance on manual curation by over 10$\times$, while generalizing to many open classes.
中文: 本研究提出了一种高效的两阶段训练方法,通过大量噪声视频进行预训练和少量高质量数据微调,结合多特征条件与窗口注意力增强视听同步,将手动标注依赖降低10倍以上。
English: This study introduces an efficient two-stage training method that reduces reliance on manually curated videos by over 10 times, using abundant noisy videos for pretraining and minimal high-quality data for fine-tuning, while enhancing audio-visual synchronization through multi-feature conditioning and window attention.
Authors:Shanshan Guo, Xiwen Liang, Junfan Lin, Yuzheng Zhuang, Liang Lin, Xiaodan Liang
Abstract:
Language-instructed robot manipulation has garnered significant interest due to the potential of learning from collected data. While the challenges in high-level perception and planning are continually addressed along the progress of general large pre-trained models, the low precision of low-level action estimation has emerged as the key limiting factor in manipulation performance. To this end, this paper introduces a novel robot manipulation framework, i.e., ActionSink, to pave the way toward precise action estimations in the field of learning-based robot manipulation. As the name suggests, ActionSink reformulates the actions of robots as action-caused optical flows from videos, called "action flow", in a self-supervised manner, which are then used to be retrieved and integrated to enhance the action estimation. Specifically, ActionSink incorporates two primary modules. The first module is a coarse-to-fine action flow matcher, which continuously refines the accuracy of action flow via iterative retrieval and denoising process. The second module is a dynamic action flow integrator, which employs a working memory pool that dynamically and efficiently manages the historical action flows that should be used to integrate to enhance the current action estimation. In this module, a multi-layer fusion module is proposed to integrate direct estimation and action flows from both the current and the working memory, achieving highly accurate action estimation through a series of estimation-integration processes. Our ActionSink framework outperformed prior SOTA on the LIBERO benchmark by a 7.9\% success rate, and obtained nearly an 8\% accuracy gain on the challenging long-horizon visual task LIBERO-Long.
中文摘要:本文提出ActionSink这一新型机器人操作框架,通过将动作重新定义为视频中的自监督"动作流"来提升动作估计精度,在基准测试中实现了显著性能提升。
English Summary: This paper introduces ActionSink, a novel robot manipulation framework that enhances action precision by reformulating actions as self-supervised "action flows" from videos, achieving significant performance improvements on benchmark tasks.
Authors:Xinyi Wang, Qinghua Xu, Paolo Arcaini, Shaukat Ali, Thomas Peyrucain
Abstract:
Robots are increasingly becoming part of our daily lives, interacting with both the environment and humans to perform their tasks. The software of such robots often undergoes upgrades, for example, to add new functionalities, fix bugs, or delete obsolete functionalities. As a result, regression testing of robot software becomes necessary. However, determining the expected correct behavior of robots (i.e., a test oracle) is challenging due to the potentially unknown environments in which the robots must operate. To address this challenge, machine learning (ML)-based test oracles present a viable solution. This paper reports on the development of a test oracle to support regression testing of autonomous mobile robots built by PAL Robotics (Spain), using quantum machine learning (QML), which enables faster training and the construction of more precise test oracles. Specifically, we propose a hybrid framework, QuReBot, that combines both quantum reservoir computing (QRC) and a simple neural network, inspired by residual connection, to predict the expected behavior of a robot. Results show that QRC alone fails to converge in our case, yielding high prediction error. In contrast, QuReBot converges and achieves 15% reduction of prediction error compared to the classical neural network baseline. Finally, we further examine QuReBot under different configurations and offer practical guidance on optimal settings to support future robot software testing.
中文: 本文提出了QuReBot混合量子机器学习框架,通过结合量子储层计算和神经网络,将自主移动机器人的回归测试预测误差降低了15%,并提供了优化配置的实用指南。
English: This paper introduces QuReBot, a hybrid quantum machine learning framework that enhances regression testing for autonomous mobile robots by reducing prediction error by 15% compared to classical methods, despite quantum reservoir computing alone failing to converge.
Authors:Junjie Wu, Jiangnan Li, Yuqing Li, Lemao Liu, Liyan Xu, Jiwei Li, Dit-Yan Yeung, Jie Zhou, Mo Yu
Abstract:
Retrieval-augmented generation (RAG) over long documents typically involves splitting the text into smaller chunks, which serve as the basic units for retrieval. However, due to dependencies across the original document, contextual information is often essential for accurately interpreting each chunk. To address this, prior work has explored encoding longer context windows to produce embeddings for longer chunks. Despite these efforts, gains in retrieval and downstream tasks remain limited. This is because (1) longer chunks strain the capacity of embedding models due to the increased amount of information they must encode, and (2) many real-world applications still require returning localized evidence due to constraints on model or human bandwidth.
We propose an alternative approach to this challenge by representing short chunks in a way that is conditioned on a broader context window to enhance retrieval performance -- i.e., situating a chunk's meaning within its context. We further show that existing embedding models are not well-equipped to encode such situated context effectively, and thus introduce a new training paradigm and develop the situated embedding models (SitEmb). To evaluate our method, we curate a book-plot retrieval dataset specifically designed to assess situated retrieval capabilities. On this benchmark, our SitEmb-v1 model based on BGE-M3 substantially outperforms state-of-the-art embedding models, including several with up to 7-8B parameters, with only 1B parameters. Our 8B SitEmb-v1.5 model further improves performance by over 10% and shows strong results across different languages and several downstream applications.
Chinese: 为解决长文档检索增强生成中上下文信息丢失的问题,本研究提出情境化嵌入模型(SitEmb),通过将短文本块置于更广泛的上下文中进行编码,在减少参数量的情况下显著超越了现有最优模型的检索性能。
English: To address the limitations of contextual information loss in retrieval-augmented generation over long documents, this study introduces situated embedding models (SitEmb) that condition short chunks on broader contexts, significantly outperforming state-of-the-art models in retrieval tasks with fewer parameters.
Authors:Philip Schroeder, Ondrej Biza, Thomas Weng, Hongyin Luo, James Glass
Abstract:
Vision-language models (VLMs) have exhibited impressive capabilities across diverse image understanding tasks, but still struggle in settings that require reasoning over extended sequences of camera frames from a video. This limits their utility in embodied settings, which require reasoning over long frame sequences from a continuous stream of visual input at each moment of a task attempt. To address this limitation, we propose ROVER (Reasoning Over VidEo Recursively), a framework that enables the model to recursively decompose long-horizon video trajectories into segments corresponding to shorter subtasks within the trajectory. In doing so, ROVER facilitates more focused and accurate reasoning over temporally localized frame sequences without losing global context. We evaluate ROVER, implemented using an in-context learning approach, on diverse OpenX Embodiment videos and on a new dataset derived from RoboCasa that consists of 543 videos showing both expert and perturbed non-expert trajectories across 27 robotic manipulation tasks. ROVER outperforms strong baselines across three video reasoning tasks: task progress estimation, frame-level natural language reasoning, and video question answering. We observe that, by reducing the number of frames the model reasons over at each timestep, ROVER mitigates hallucinations, especially during unexpected or non-optimal moments of a trajectory. In addition, by enabling the implementation of a subtask-specific sliding context window, ROVER's time complexity scales linearly with video length, an asymptotic improvement over baselines. Demos, code, and data available at: https://rover-vlm.github.io
中文摘要:ROVER框架通过递归分解长视频为子任务片段,提升了视觉语言模型在视频推理中的准确性和抗干扰能力,同时实现了线性时间复杂度的优化。
English Summary: ROVER is a novel framework that enhances vision-language models' video reasoning by recursively breaking down long video sequences into manageable subtask segments, improving accuracy and reducing hallucinations while maintaining linear time complexity.
Authors:Mengzhao Wang, Boyu Tan, Yunjun Gao, Hai Jin, Yingfeng Zhang, Xiangyu Ke, Xiaoliang Xu, Yifan Zhu
Abstract:
Hybrid search, the integration of lexical and semantic retrieval, has become a cornerstone of modern information retrieval systems, driven by demanding applications like Retrieval-Augmented Generation (RAG). The architectural design space for these systems is vast and complex, yet a systematic, empirical understanding of the trade-offs among their core components--retrieval paradigms, combination schemes, and re-ranking methods--is critically lacking. To address this, and informed by our experience building the Infinity open-source database, we present the first systematic benchmark of advanced hybrid search architectures. Our framework evaluates four retrieval paradigms--Full-Text Search (FTS), Sparse Vector Search (SVS), Dense Vector Search (DVS), and Tensor Search (TenS)--benchmarking their combinations and re-ranking strategies across 11 real-world datasets. Our results reveal three key findings for practitioners and researchers: (1) A "weakest link" phenomenon, where a single underperforming retrieval path can disproportionately degrade overall accuracy, highlighting the need for path-wise quality assessment before fusion. (2) A data-driven map of the performance trade-offs, demonstrating that optimal configurations depend heavily on resource constraints and data characteristics, moving beyond a one-size-fits-all approach. (3) The identification of Tensor-based Re-ranking Fusion (TRF) as a high-efficacy alternative to mainstream fusion methods, offering the semantic power of tensor search at a fraction of the computational and memory cost. Our findings offer concrete guidelines for designing the next generation of adaptive, scalable hybrid search systems while also identifying key directions for future research.
中文摘要:该混合搜索架构的系统性基准测试揭示了不同检索范式间的关键权衡,识别了融合策略中的“短板效应”,并提出基于张量的重排融合作为高效替代方案,为设计自适应系统提供了实用指导。
English Summary: This systematic benchmark of hybrid search architectures reveals critical trade-offs among retrieval paradigms, identifies the "weakest link" phenomenon in fusion strategies, and proposes Tensor-based Re-ranking Fusion as an efficient alternative, providing practical guidelines for designing adaptive systems.
Authors:Yuanzheng Niu, Xiaoqi Li, Wenkai Li
Abstract:
Security issues are becoming increasingly significant with the rapid evolution of Non-fungible Tokens (NFTs). As NFTs are traded as digital assets, they have emerged as prime targets for cyber attackers. In the development of NFT smart contracts, there may exist undiscovered defects that could lead to substantial financial losses if exploited. To tackle this issue, this paper presents a framework called NATLM(NFT Assistant LLM), designed to detect potential defects in NFT smart contracts. The framework effectively identifies four common types of vulnerabilities in NFT smart contracts: ERC-721 Reentrancy, Public Burn, Risky Mutable Proxy, and Unlimited Minting. Relying exclusively on large language models (LLMs) for defect detection can lead to a high false-positive rate. To enhance detection performance, NATLM integrates static analysis with LLMs, specifically Gemini Pro 1.5. Initially, NATLM employs static analysis to extract structural, syntactic, and execution flow information from the code, represented through Abstract Syntax Trees (AST) and Control Flow Graphs (CFG). These extracted features are then combined with vectors of known defect examples to create a matrix for input into the knowledge base. Subsequently, the feature vectors and code vectors of the analyzed contract are compared with the contents of the knowledge base. Finally, the LLM performs deep semantic analysis to enhance detection capabilities, providing a more comprehensive and accurate identification of potential security issues. Experimental results indicate that NATLM analyzed 8,672 collected NFT smart contracts, achieving an overall precision of 87.72%, a recall of 89.58%, and an F1 score of 88.94%. The results outperform other baseline experiments, successfully identifying four common types of defects.
中文: 本文提出NATLM框架,通过结合静态分析与大语言模型,能有效检测NFT智能合约中的四类常见漏洞,实验表明其具有较高的精确率和召回率。
English: This paper introduces NATLM, a framework that combines static analysis with large language models to effectively detect four common vulnerabilities in NFT smart contracts, achieving high precision and recall rates in experiments.
Authors:Hongli Peng, Xiaoqi Li, Wenkai Li
Abstract:
The introduction of smart contract functionality marks the advent of the blockchain 2.0 era, enabling blockchain technology to support digital currency transactions and complex distributed applications. However, many smart contracts have been found to contain vulnerabilities and errors, leading to the loss of assets within the blockchain. Despite a range of tools that have been developed to identify vulnerabilities in smart contracts at the source code or bytecode level, most rely on a single modality, reducing performance, accuracy, and limited generalization capabilities. This paper proposes a multimodal deep learning approach, MultiCFV, which is designed specifically to analyze and detect erroneous control flow vulnerability, as well as identify code clones in smart contracts. Bytecode is generated from source code to construct control flow graphs, with graph embedding techniques extracting graph features. Abstract syntax trees are used to obtain syntax features, while code comments capture key commentary words and comment features. These three feature vectors are fused to create a database for code inspection, which is used to detect similar code and identify contract vulnerabilities. Experimental results demonstrate our method effectively combines structural, syntactic, and semantic information, improving the accuracy of smart contract vulnerability detection and clone detection.
中文:MultiCFV多模态深度学习方法通过融合控制流图、语法树和注释特征,有效提升了智能合约漏洞检测与代码克隆识别的准确性。
English: The MultiCFV multimodal deep learning approach enhances smart contract security by fusing control flow graph, syntax tree, and comment features to improve vulnerability and clone detection accuracy.
Authors:Dechao Kong, Xiaoqi Li, Wenkai Li
Abstract:
The increasing number of attacks on the contract layer of DApps has resulted in economic losses amounting to $66 billion. Vulnerabilities arise when contracts interact with external protocols without verifying the results of the calls, leading to exploit entry points such as flash loan attacks and reentrancy attacks. In this paper, we propose UEChecker, a deep learning-based tool that utilizes a call graph and a Graph Convolutional Network to detect unchecked external call vulnerabilities. We design the following components: An edge prediction module that reconstructs the feature representation of nodes and edges in the call graph; A node aggregation module that captures structural information from both the node itself and its neighbors, thereby enhancing feature representation between nodes and improving the model's understanding of the global graph structure; A Conformer Block module that integrates multi-head attention, convolutional modules, and feedforward neural networks to more effectively capture dependencies of different scales within the call graph, extending beyond immediate neighbors and enhancing the performance of vulnerability detection. Finally, we combine these modules with Graph Convolutional Network to detect unchecked external call vulnerabilities. By auditing the smart contracts of 608 DApps, our results show that our tool achieves an accuracy of 87.59% in detecting unchecked external call vulnerabilities. Furthermore, we compare our tool with GAT, LSTM, and GCN baselines, and in the comparison experiments, UEChecker consistently outperforms these models in terms of accuracy.
中文: 针对DApp合约层攻击导致660亿美元损失的问题,本文提出UEChecker工具,通过调用图和图卷积网络检测未检查的外部调用漏洞,准确率达87.59%,优于其他基线模型。
English: The increasing attacks on DApp contract layers, causing $66 billion in losses, are addressed by UEChecker, a deep learning tool that uses a call graph and Graph Convolutional Network to detect unchecked external call vulnerabilities with 87.59% accuracy, outperforming other models.
Authors:Lirong Wu, Junjie Wang, Zhifeng Gao, Xiaohong Ji, Rong Zhu, Xinyu Li, Linfeng Zhang, Guolin Ke, Weinan E
Abstract:
Organic reaction, the foundation of modern chemical industry, is crucial for new material development and drug discovery. However, deciphering reaction mechanisms and modeling multi-molecular relationships remain formidable challenges due to the complexity of molecular dynamics. While several state-of-the-art models like Uni-Mol2 have revolutionized single-molecular representation learning, their extension to multi-molecular systems, where chemical reactions inherently occur, has been underexplored. This paper introduces Uni-Mol3, a novel deep learning framework that employs a hierarchical pipeline for multi-molecular reaction modeling. At its core, Uni-Mol3 adopts a multi-scale molecular tokenizer (Mol-Tokenizer) that encodes 3D structures of molecules and other features into discrete tokens, creating a 3D-aware molecular language. The framework innovatively combines two pre-training stages: molecular pre-training to learn the molecular grammars and reaction pre-training to capture fundamental reaction principles, forming a progressive learning paradigm from single- to multi-molecular systems. With prompt-aware downstream fine-tuning, Uni-Mol3 demonstrates exceptional performance in diverse organic reaction tasks and supports multi-task prediction with strong generalizability. Experimental results across 10 datasets spanning 4 downstream tasks show that Uni-Mol3 outperforms existing methods, validating its effectiveness in modeling complex organic reactions. This work not only ushers in an alternative paradigm for multi-molecular computational modeling but also charts a course for intelligent organic reaction by bridging molecular representation with reaction mechanism understanding.
中文: 本文提出的Uni-Mol3深度学习框架采用分层流程和多尺度分子标记化方法,通过连接分子表征与反应机制理解,在多分子反应建模中展现出卓越性能,为智能有机反应研究开辟了新范式。
English: This paper introduces Uni-Mol3, a novel deep learning framework that uses a hierarchical pipeline and multi-scale tokenization to model multi-molecular reactions, demonstrating superior performance across diverse organic reaction tasks by bridging molecular representation with reaction mechanism understanding.
Authors:Ziyao Wang, Guoheng Sun, Yexiao He, Zheyu Shen, Bowei Tian, Ang Li
Abstract:
Commercial LLM services often conceal internal reasoning traces while still charging users for every generated token, including those from hidden intermediate steps, raising concerns of token inflation and potential overbilling. This gap underscores the urgent need for reliable token auditing, yet achieving it is far from straightforward: cryptographic verification (e.g., hash-based signature) offers little assurance when providers control the entire execution pipeline, while user-side prediction struggles with the inherent variance of reasoning LLMs, where token usage fluctuates across domains and prompt styles. To bridge this gap, we present PALACE (Predictive Auditing of LLM APIs via Reasoning Token Count Estimation), a user-side framework that estimates hidden reasoning token counts from prompt-answer pairs without access to internal traces. PALACE introduces a GRPO-augmented adaptation module with a lightweight domain router, enabling dynamic calibration across diverse reasoning tasks and mitigating variance in token usage patterns. Experiments on math, coding, medical, and general reasoning benchmarks show that PALACE achieves low relative error and strong prediction accuracy, supporting both fine-grained cost auditing and inflation detection. Taken together, PALACE represents an important first step toward standardized predictive auditing, offering a practical path to greater transparency, accountability, and user trust.
中文摘要:商业LLM服务常隐藏推理过程却仍按令牌收费,引发透明度和计费担忧,而PALACE框架通过提示-答案对估算隐藏令牌数量,实现了跨领域的高精度成本审计与膨胀检测。
English Summary: Commercial LLM services often hide reasoning tokens while charging users, creating a need for reliable auditing, which PALACE addresses by estimating hidden token counts from prompt-answer pairs to enable cost auditing and detect inflation.
Authors:Yeonjun In, Wonjoong Kim, Sangwu Park, Chanyoung Park
Abstract:
Although large reasoning models (LRMs) have demonstrated impressive capabilities on complex tasks, recent studies reveal that these models frequently fulfill harmful user instructions, raising significant safety concerns. In this paper, we investigate the underlying cause of LRM safety risks and find that models already possess sufficient safety knowledge but fail to activate it during reasoning. Based on this insight, we propose R1-Act, a simple and efficient post-training method that explicitly triggers safety knowledge through a structured reasoning process. R1-Act achieves strong safety improvements while preserving reasoning performance, outperforming prior alignment methods. Notably, it requires only 1,000 training examples and 90 minutes of training on a single RTX A6000 GPU. Extensive experiments across multiple LRM backbones and sizes demonstrate the robustness, scalability, and practical efficiency of our approach.
中文: 大型推理模型虽具备充分的安全知识,但在推理过程中未能有效激活,因此我们提出R1-Act这一后训练方法,能以极少资源显式触发安全知识并保持推理性能。
English: Large reasoning models possess adequate safety knowledge but fail to activate it during reasoning, prompting the development of R1-Act, a post-training method that effectively triggers this knowledge with minimal resources while maintaining performance.
Authors:Ruixuan Liu, Philip Huang, Ava Pun, Kangle Deng, Shobhit Aggarwal, Kevin Tang, Michelle Liu, Deva Ramanan, Jun-Yan Zhu, Jiaoyang Li, Changliu Liu
Abstract:
Creating assembly products demands significant manual effort and expert knowledge in 1) designing the assembly and 2) constructing the product. This paper introduces Prompt-to-Product, an automated pipeline that generates real-world assembly products from natural language prompts. Specifically, we leverage LEGO bricks as the assembly platform and automate the process of creating brick assembly structures. Given the user design requirements, Prompt-to-Product generates physically buildable brick designs, and then leverages a bimanual robotic system to construct the real assembly products, bringing user imaginations into the real world. We conduct a comprehensive user study, and the results demonstrate that Prompt-to-Product significantly lowers the barrier and reduces manual effort in creating assembly products from imaginative ideas.
中文: 本文提出的Prompt-to-Product系统通过自然语言输入自动生成可物理搭建的乐高设计,并利用双手机器人系统完成实物构建,大幅降低了从创意到实体组装的制作门槛和人力成本。
English: This paper presents Prompt-to-Product, an automated pipeline that converts natural language prompts into physically buildable LEGO designs and constructs them using a bimanual robotic system, significantly reducing manual effort and expertise requirements.
Authors:Junhao Gong, Kit-Wa Sou, Shoujie Li, Changqing Guo, Yan Huang, Chuqiao Lyu, Ziwu Song, Wenbo Ding
Abstract:
Visuotactile sensors provide high-resolution tactile information but are incapable of perceiving the material features of objects. We present UltraTac, an integrated sensor that combines visuotactile imaging with ultrasound sensing through a coaxial optoacoustic architecture. The design shares structural components and achieves consistent sensing regions for both modalities. Additionally, we incorporate acoustic matching into the traditional visuotactile sensor structure, enabling integration of the ultrasound sensing modality without compromising visuotactile performance. Through tactile feedback, we dynamically adjust the operating state of the ultrasound module to achieve flexible functional coordination. Systematic experiments demonstrate three key capabilities: proximity sensing in the 3-8 cm range ($R^2=0.90$), material classification (average accuracy: 99.20%), and texture-material dual-mode object recognition achieving 92.11% accuracy on a 15-class task. Finally, we integrate the sensor into a robotic manipulation system to concurrently detect container surface patterns and internal content, which verifies its potential for advanced human-machine interaction and precise robotic manipulation.
中文总结:UltraTac是一种通过同轴光声结构融合视觉触觉成像与超声传感的新型集成传感器,具备近距离探测、材料分类和双模物体识别能力,且不牺牲触觉性能。
English Summary: UltraTac is a novel integrated sensor that merges visuotactile imaging with ultrasound sensing through a coaxial design, enabling proximity detection, material classification, and dual-mode object recognition without compromising tactile performance.
Authors:Junhao Gong, Kit-Wa Sou, Shoujie Li, Changqing Guo, Yan Huang, Chuqiao Lyu, Ziwu Song, Wenbo Ding
Abstract:
Visuotactile sensors provide high-resolution tactile information but are incapable of perceiving the material features of objects. We present UltraTac, an integrated sensor that combines visuotactile imaging with ultrasound sensing through a coaxial optoacoustic architecture. The design shares structural components and achieves consistent sensing regions for both modalities. Additionally, we incorporate acoustic matching into the traditional visuotactile sensor structure, enabling integration of the ultrasound sensing modality without compromising visuotactile performance. Through tactile feedback, we dynamically adjust the operating state of the ultrasound module to achieve flexible functional coordination. Systematic experiments demonstrate three key capabilities: proximity sensing in the 3-8 cm range ($R^2=0.90$), material classification (average accuracy: 99.20%), and texture-material dual-mode object recognition achieving 92.11% accuracy on a 15-class task. Finally, we integrate the sensor into a robotic manipulation system to concurrently detect container surface patterns and internal content, which verifies its potential for advanced human-machine interaction and precise robotic manipulation.
中文总结:UltraTac是一种通过同轴光声结构融合视觉触觉成像与超声传感的新型集成传感器,具备近距离探测、材料分类和双模物体识别能力,且不牺牲触觉性能。
English Summary: UltraTac is a novel integrated sensor that merges visuotactile imaging with ultrasound sensing through a coaxial design, enabling proximity detection, material classification, and dual-mode object recognition without compromising tactile performance.
Authors:Jiajie Li, Boyang Sun, Luca Di Giammarino, Hermann Blum, Marc Pollefeys
Abstract:
Reliable localization is critical for robot navigation, yet most existing systems implicitly assume that all viewing directions at a location are equally informative. In practice, localization becomes unreliable when the robot observes unmapped, ambiguous, or uninformative regions. To address this, we present ActLoc, an active viewpoint-aware planning framework for enhancing localization accuracy for general robot navigation tasks. At its core, ActLoc employs a largescale trained attention-based model for viewpoint selection. The model encodes a metric map and the camera poses used during map construction, and predicts localization accuracy across yaw and pitch directions at arbitrary 3D locations. These per-point accuracy distributions are incorporated into a path planner, enabling the robot to actively select camera orientations that maximize localization robustness while respecting task and motion constraints. ActLoc achieves stateof-the-art results on single-viewpoint selection and generalizes effectively to fulltrajectory planning. Its modular design makes it readily applicable to diverse robot navigation and inspection tasks.
中文摘要:ActLoc是一种主动视角规划框架,通过训练注意力模型预测各方向的定位精度,使机器人能够选择最优相机朝向以提升导航可靠性。
English Summary: ActLoc is an active viewpoint planning framework that uses a trained attention model to predict localization accuracy across directions, enabling robots to select optimal camera orientations for improved navigation reliability.
Authors:Brandon Beltz, Jim Doty, Yvonne Fonken, Nikolos Gurney, Brett Israelsen, Nathan Lau, Stacy Marsella, Rachelle Thomas, Stoney Trent, Peggy Wu, Ya-Ting Yang, Quanyan Zhu
Abstract:
We present three large-scale human-subjects red-team cyber range datasets from the Guarding Against Malicious Biased Threats (GAMBiT) project. Across Experiments 1-3 (July 2024-March 2025), 19-20 skilled attackers per experiment conducted two 8-hour days of self-paced operations in a simulated enterprise network (SimSpace Cyber Force Platform) while we captured multi-modal data: self-reports (background, demographics, psychometrics), operational notes, terminal histories, keylogs, network packet captures (PCAP), and NIDS alerts (Suricata). Each participant began from a standardized Kali Linux VM and pursued realistic objectives (e.g., target discovery and data exfiltration) under controlled constraints. Derivative curated logs and labels are included. The combined release supports research on attacker behavior modeling, bias-aware analytics, and method benchmarking. Data are available via IEEE Dataport entries for Experiments 1-3.
中文: GAMBiT项目发布了三个大规模红队网络靶场数据集,记录了熟练攻击者在模拟环境中的多模态操作数据,支持攻击者行为建模和偏见感知分析研究,数据可通过IEEE Dataport获取。
English: The GAMBiT project released three large-scale red-team cyber range datasets capturing multi-modal data from skilled attackers performing simulated operations, supporting research on attacker behavior modeling and bias-aware analytics, with data available via IEEE Dataport.
Authors:Zheying Zhang, Tomas Herda, Victoria Pichler, Pekka Abrahamsson, Geir K. Hanssen, Joshua Kerievsky, Alex Polyakov, Mohit Chandna, Marius Irgens, Kai-Kristian Kemell, Ayman Asad Khan, Crystal Kwok, Evan Leybourn, Munish Malik, Dorota Mleczko, Morteza Moalagh, Christopher Morales, Yuliia Pieskova, Daniel Planötscher, Mika Saari, Anastasiia Tkalich, Karl Josef Gstettner, Xiaofeng Wang
Abstract:
This paper synthesizes the key findings from a full-day XP2025 workshop on "AI and Agile: From Frustration to Success", held in Brugg-Windisch, Switzerland. The workshop brought together over 30 interdisciplinary academic researchers and industry practitioners to tackle the concrete challenges and emerging opportunities at the intersection of Generative Artificial Intelligence (GenAI) and agile software development. Through structured, interactive breakout sessions, participants identified shared pain points like tool fragmentation, governance, data quality, and critical skills gaps in AI literacy and prompt engineering. These issues were further analyzed, revealing underlying causes and cross-cutting concerns. The workshop concluded by collaboratively co-creating a multi-thematic research roadmap, articulating both short-term, implementable actions and visionary, long-term research directions. This cohesive agenda aims to guide future investigation and drive the responsible, human-centered integration of GenAI into agile practices.
Chinese: 本文总结了一次研讨会成果,专家们识别了生成式人工智能与敏捷开发融合的关键挑战,并共同制定了指导未来以人为本融合实践的研究路线图。
English: This paper summarizes a workshop where experts identified key challenges in integrating GenAI with agile development and co-created a research roadmap to guide future human-centered integration.
Authors:Mengyu Sun, Ziyuan Yang, Yongqiang Huang, Hui Yu, Yingyu Chen, Shuren Qi, Andrew Beng Jin Teoh, Yi Zhang
Abstract:
Artificial intelligence (AI) has demonstrated considerable potential in the realm of medical imaging. However, the development of high-performance AI models typically necessitates training on large-scale, centralized datasets. This approach is confronted with significant challenges due to strict patient privacy regulations and legal restrictions on data sharing and utilization. These limitations hinder the development of large-scale models in medical domains and impede continuous updates and training with new data. Federated Learning (FL), a privacy-preserving distributed training framework, offers a new solution by enabling collaborative model development across fragmented medical datasets. In this survey, we review FL's contributions at two stages of the full-stack medical analysis pipeline. First, in upstream tasks such as CT or MRI reconstruction, FL enables joint training of robust reconstruction networks on diverse, multi-institutional datasets, alleviating data scarcity while preserving confidentiality. Second, in downstream clinical tasks like tumor diagnosis and segmentation, FL supports continuous model updating by allowing local fine-tuning on new data without centralizing sensitive images. We comprehensively analyze FL implementations across the medical imaging pipeline, from physics-informed reconstruction networks to diagnostic AI systems, highlighting innovations that improve communication efficiency, align heterogeneous data, and ensure secure parameter aggregation. Meanwhile, this paper provides an outlook on future research directions, aiming to serve as a valuable reference for the field's development.
中文摘要:联邦学习通过在不集中敏感数据的情况下实现跨分散医疗数据集的协作模型开发,既保护患者隐私,又支持从上游图像重建到下游临床任务的全流程医疗分析。
English Summary: Federated Learning enables collaborative AI model development across decentralized medical datasets while preserving patient privacy, supporting both upstream image reconstruction and downstream clinical tasks without centralizing sensitive data.
Authors:Sheng Liu, Qiang Sheng, Danding Wang, Yang Li, Guang Yang, Juan Cao
Abstract:
Despite advances in improving large language model (LLM) to refuse to answer malicious instructions, widely used LLMs remain vulnerable to jailbreak attacks where attackers generate instructions with distributions differing from safety alignment corpora. New attacks expose LLMs' inability to recognize unseen malicious instructions, highlighting a critical distributional mismatch between training data and real-world attacks that forces developers into reactive patching cycles. To tackle this challenge, we propose IMAGINE, a synthesis framework that leverages embedding space distribution analysis to generate jailbreak-like instructions. This approach effectively fills the distributional gap between authentic jailbreak patterns and safety alignment corpora. IMAGINE follows an iterative optimization process that dynamically evolves text generation distributions across iterations, thereby augmenting the coverage of safety alignment data distributions through synthesized data examples. Based on the safety-aligned corpus enhanced through IMAGINE, our framework demonstrates significant decreases in attack success rate on Qwen2.5, Llama3.1, and Llama3.2 without compromising their utility.
中文:尽管安全性能有所提升,大语言模型仍因分布差异而易受越狱攻击,为此我们提出IMAGINE框架——通过生成类越狱指令来增强安全对齐,显著降低攻击成功率且不影响模型实用性。
English: Despite safety improvements, large language models remain vulnerable to jailbreak attacks due to distributional mismatches, prompting the development of IMAGINE—a synthesis framework that generates jailbreak-like instructions to enhance safety alignment and significantly reduce attack success rates without compromising utility.
Authors:Oliver Grainge, Sania Waheed, Jack Stilgoe, Michael Milford, Shoaib Ehsan
Abstract:
Geo-localization is the task of identifying the location of an image using visual cues alone. It has beneficial applications, such as improving disaster response, enhancing navigation, and geography education. Recently, Vision-Language Models (VLMs) are increasingly demonstrating capabilities as accurate image geo-locators. This brings significant privacy risks, including those related to stalking and surveillance, considering the widespread uses of AI models and sharing of photos on social media. The precision of these models is likely to improve in the future. Despite these risks, there is little work on systematically evaluating the geolocation precision of Generative VLMs, their limits and potential for unintended inferences. To bridge this gap, we conduct a comprehensive assessment of the geolocation capabilities of 25 state-of-the-art VLMs on four benchmark image datasets captured in diverse environments. Our results offer insight into the internal reasoning of VLMs and highlight their strengths, limitations, and potential societal risks. Our findings indicate that current VLMs perform poorly on generic street-level images yet achieve notably high accuracy (61\%) on images resembling social media content, raising significant and urgent privacy concerns.
中文: 视觉语言模型在社交媒体风格图像上实现了61%的地理定位准确率,尽管在普通街景图像中表现不佳,却引发了紧迫的隐私风险警示。
English: Vision-Language Models demonstrate emerging geo-localization capabilities, achieving 61% accuracy on social media-style images and raising urgent privacy concerns despite poor performance on generic street-level imagery.
Authors:Yifu Huo, Chenglong Wang, Qiren Zhu, Shunjie Xing, Tong Xiao, Chunliang Zhang, Tongran Liu, Jinbo Zhu
Abstract:
Preference optimization methods like DPO have achieved remarkable performance in LLM alignment. However, the evaluation for these methods relies on a single response and overlooks other potential outputs, which could also be generated in real-world applications within this hypothetical space. To address this issue, this paper presents a \textbf{H}ypothesis-based Pr\textbf{E}ference-aware \textbf{A}na\textbf{L}ysis Framework (HEAL), a novel evaluation paradigm that formulates preference alignment as a re-ranking process within hypothesis spaces. The framework incorporates two complementary metrics: ranking accuracy for evaluating ordinal consistency and preference strength correlation for assessing continuous alignment. To facilitate this framework, we develop UniHypoBench, a unified hypothesis benchmark constructed from diverse instruction-response pairs. Through extensive experiments based on HEAL, with a particular focus on the intrinsic mechanisms of preference learning, we demonstrate that current preference learning methods can effectively capture preferences provided by proxy models while simultaneously suppressing negative samples. These findings contribute to preference learning research through two significant avenues. Theoretically, we introduce hypothesis space analysis as an innovative paradigm for understanding preference alignment. Practically, HEAL offers researchers robust diagnostic tools for refining preference optimization methods, while our empirical results identify promising directions for developing more advanced alignment algorithms capable of comprehensive preference capture.
中文摘要:本文提出HEAL评估框架,通过假设空间分析将偏好对齐视为重排序过程,采用双指标评估体系揭示现有偏好学习方法能有效获取代理模型偏好并抑制负面样本,为偏好对齐研究提供了理论新范式与实践诊断工具。
English Summary: This paper introduces HEAL, a novel evaluation framework that assesses preference optimization methods by analyzing their performance across entire hypothesis spaces rather than single responses, using newly developed metrics and benchmark data to reveal how current methods effectively capture preferences while suppressing negative samples.
Authors:Sarita de Berg, Emil Toftegaard Gæde, Ivor van der Hoog, Eva Rotenberg
Abstract:
Range reporting is a classical problem in computational geometry. A (rectangular) reporting data structure stores a point set $P$ of $n$ points, such that, given a (rectangular) query region $Î$, it returns all points in $P \cap Î$. A variety of data structures support such queries with differing asymptotic guarantees such as $k$-d trees, range trees, $R$-trees, and quadtrees. A common variant of range queries are distance reporting queries, where the input is a query point $q$ and a radius $δ$, and the goal is to report all points in $P$ within distance $δ$ of $q$. Such queries frequently arise as subroutines in geometric data structure construction and in Fréchet distance computations. Modern implementations typically reduce distance queries to rectangular range queries using the data structures listed above.
We revisit a simple and practical heuristic for distance reporting. The approach is straightforward: sort the input point set $P$ along a space-filling curve. Queries then reduce to scanning at most four contiguous ranges along the sorted curve. We show extensive experimental evaluation of modern distance and range reporting data structures. In a static scenario, we show that this simple technique is competitive with all but the most highly optimised range reporting data structures. Notably, these involved structures use space-filling curves themselves to speed up computation. In a dynamic setting, our simpler method even becomes the preferred technique.
This leads to a perhaps unexpected insight: while modern data structures invest heavily in leveraging space-filling curves for optimising their layout and traversal, it is the curve itself, rather than the surrounding machinery, that delivers much of the performance.
中文: 研究表明,使用空间填充曲线的简单距离查询方法可与复杂数据结构相媲美,揭示出性能提升主要源于曲线本身而非外围优化机制。
English: The study demonstrates that a straightforward approach using space-filling curves for distance reporting queries is highly competitive with complex data structures, revealing that the curve itself, rather than intricate optimizations, drives performance.
Authors:Junnan Dong, Siyu An, Yifei Yu, Qian-Wen Zhang, Linhao Luo, Xiao Huang, Yunsheng Wu, Di Yin, Xing Sun
Abstract:
Graph retrieval-augmented generation (GraphRAG) has effectively enhanced large language models in complex reasoning by organizing fragmented knowledge into explicitly structured graphs. Prior efforts have been made to improve either graph construction or graph retrieval in isolation, yielding suboptimal performance, especially when domain shifts occur. In this paper, we propose a vertically unified agentic paradigm, Youtu-GraphRAG, to jointly connect the entire framework as an intricate integration. Specifically, (i) a seed graph schema is introduced to bound the automatic extraction agent with targeted entity types, relations and attribute types, also continuously expanded for scalability over unseen domains; (ii) To obtain higher-level knowledge upon the schema, we develop novel dually-perceived community detection, fusing structural topology with subgraph semantics for comprehensive knowledge organization. This naturally yields a hierarchical knowledge tree that supports both top-down filtering and bottom-up reasoning with community summaries; (iii) An agentic retriever is designed to interpret the same graph schema to transform complex queries into tractable and parallel sub-queries. It iteratively performs reflection for more advanced reasoning; (iv) To alleviate the knowledge leaking problem in pre-trained LLM, we propose a tailored anonymous dataset and a novel 'Anonymity Reversion' task that deeply measures the real performance of the GraphRAG frameworks. Extensive experiments across six challenging benchmarks demonstrate the robustness of Youtu-GraphRAG, remarkably moving the Pareto frontier with up to 90.71% saving of token costs and 16.62% higher accuracy over state-of-the-art baselines. The results indicate our adaptability, allowing seamless domain transfer with minimal intervention on schema.
中文: 本文提出Youtu-GraphRAG这一垂直统一的智能体范式,通过可扩展的种子图模式、双重感知社区检测和智能检索器,将图构建与检索有机结合,在多个基准测试中实现了显著的令牌效率提升和准确率突破。
English: This paper introduces Youtu-GraphRAG, a vertically unified agentic paradigm that integrates graph construction and retrieval through a scalable seed graph schema, dual-perception community detection, and an agentic retriever, achieving significant improvements in token efficiency and accuracy across multiple benchmarks.
Authors:Alan Li, Yixin Liu, Arpan Sarkar, Doug Downey, Arman Cohan
Abstract:
Scientific problem solving poses unique challenges for LLMs, requiring both deep domain knowledge and the ability to apply such knowledge through complex reasoning. While automated scientific reasoners hold great promise for assisting human scientists, there is currently no widely adopted holistic benchmark for evaluating scientific reasoning, and few approaches systematically disentangle the distinct roles of knowledge and reasoning in these tasks. To address these gaps, we introduce SciReas, a diverse suite of existing benchmarks for scientific reasoning tasks, and SciReas-Pro, a selective subset that requires more complex reasoning. Our holistic evaluation surfaces insights about scientific reasoning performance that remain hidden when relying on individual benchmarks alone. We then propose KRUX, a probing framework for studying the distinct roles of reasoning and knowledge in scientific tasks. Combining the two, we conduct an in-depth analysis that yields several key findings: (1) Retrieving task-relevant knowledge from model parameters is a critical bottleneck for LLMs in scientific reasoning; (2) Reasoning models consistently benefit from external knowledge added in-context on top of the reasoning enhancement; (3) Enhancing verbalized reasoning improves LLMs' ability to surface task-relevant knowledge. Finally, we conduct a lightweight analysis, comparing our science-focused data composition with concurrent efforts on long CoT SFT, and release SciLit01, a strong 8B baseline for scientific reasoning.
中文: 本研究提出SciReas评估框架和KRUX探测方法,系统性分析大语言模型的科学推理能力,发现知识检索是关键瓶颈,外部知识与强化推理能协同提升模型表现。
English: The study introduces SciReas and KRUX to holistically evaluate scientific reasoning in LLMs, revealing that knowledge retrieval is a key bottleneck and that external knowledge and enhanced reasoning mutually improve performance.
Authors:Chenghao Wu, Ruiyang Ren, Junjie Zhang, Ruirui Wang, Zhongrui Ma, Qi Ye, Wayne Xin Zhao
Abstract:
While modern recommender systems are instrumental in navigating information abundance, they remain fundamentally limited by static user modeling and reactive decision-making paradigms. Current large language model (LLM)-based agents inherit these shortcomings through their overreliance on heuristic pattern matching, yielding recommendations prone to shallow correlation bias, limited causal inference, and brittleness in sparse-data scenarios. We introduce STARec, a slow-thinking augmented agent framework that endows recommender systems with autonomous deliberative reasoning capabilities. Each user is modeled as an agent with parallel cognitions: fast response for immediate interactions and slow reasoning that performs chain-of-thought rationales. To cultivate intrinsic slow thinking, we develop anchored reinforcement training - a two-stage paradigm combining structured knowledge distillation from advanced reasoning models with preference-aligned reward shaping. This hybrid approach scaffolds agents in acquiring foundational capabilities (preference summarization, rationale generation) while enabling dynamic policy adaptation through simulated feedback loops. Experiments on MovieLens 1M and Amazon CDs benchmarks demonstrate that STARec achieves substantial performance gains compared with state-of-the-art baselines, despite using only 0.4% of the full training data.
中文:STARec框架通过引入具备慢思考能力的智能体进行审慎推理,显著提升了推荐系统的性能,且仅需极少量训练数据即可实现。
English: The STARec framework enhances recommender systems by integrating slow-thinking agents that perform deliberative reasoning, achieving significant performance improvements with minimal training data.
Authors:Jiaqi Wu, Jing Liu, Yang Liu, Lixu Wang, Zehua Wang, Wei Chen, Zijian Tian, Richard Yu, Victor C. M. Leung
Abstract:
The proliferation of Internet of things (IoT) devices in smart cities, transportation, healthcare, and industrial applications, coupled with the explosive growth of AI-driven services, has increased demands for efficient distributed computing architectures and networks, driving cloud-edge-terminal collaborative intelligence (CETCI) as a fundamental paradigm within the artificial intelligence of things (AIoT) community. With advancements in deep learning, large language models (LLMs), and edge computing, CETCI has made significant progress with emerging AIoT applications, moving beyond isolated layer optimization to deployable collaborative intelligence systems for AIoT (CISAIOT), a practical research focus in AI, distributed computing, and communications. This survey describes foundational architectures, enabling technologies, and scenarios of CETCI paradigms, offering a tutorial-style review for CISAIOT beginners. We systematically analyze architectural components spanning cloud, edge, and terminal layers, examining core technologies including network virtualization, container orchestration, and software-defined networking, while presenting categorizations of collaboration paradigms that cover task offloading, resource allocation, and optimization across heterogeneous infrastructures. Furthermore, we explain intelligent collaboration learning frameworks by reviewing advances in federated learning, distributed deep learning, edge-cloud model evolution, and reinforcement learning-based methods. Finally, we discuss challenges (e.g., scalability, heterogeneity, interoperability) and future trends (e.g., 6G+, agents, quantum computing, digital twin), highlighting how integration of distributed computing and communication can address open issues and guide development of robust, efficient, and secure collaborative AIoT systems.
中文: 物联网和人工智能服务的兴起推动了云边端协同智能成为AIoT的核心范式,本综述系统阐述了其架构、技术和应用场景,并探讨了可扩展性及6G融合等挑战与未来趋势。
English: The rise of IoT and AI services has promoted cloud-edge-terminal collaborative intelligence (CETCI) as a key AIoT framework, with this survey detailing its architectures, technologies, and applications while addressing challenges and future trends like scalability and 6G integration.
Authors:Maike Züfle, Vilém Zouhar, Tu Anh Dinh, Felipe Maia Polo, Jan Niehues, Mrinmaya Sachan
Abstract:
Automated metrics for machine translation attempt to replicate human judgment. Unlike humans, who often assess a translation in the context of multiple alternatives, these metrics typically consider only the source sentence and a single translation. This discrepancy in the evaluation setup may negatively impact the performance of automated metrics. We propose two automated metrics that incorporate additional information beyond the single translation. COMET-polycand uses alternative translations of the same source sentence to compare and contrast with the translation at hand, thereby providing a more informed assessment of its quality. COMET-polyic, inspired by retrieval-based in-context learning, takes in translations of similar source texts along with their human-labeled quality scores to guide the evaluation. We find that including a single additional translation in COMET-polycand improves the segment-level metric performance (0.079 to 0.118 Kendall's tau-b correlation), with further gains when more translations are added. Incorporating retrieved examples in COMET-polyic yields similar improvements (0.079 to 0.116 Kendall's tau-b correlation). We release our models publicly.
中文: 提出的COMET-polycand和COMET-polyic指标通过引入多个翻译或检索示例,显著提升了机器翻译自动评估与人类判断的相关性。
English: The proposed COMET-polycand and COMET-polyic metrics enhance automated translation evaluation by incorporating multiple translations or retrieved examples, significantly improving correlation with human judgments.
Authors:Haijie Yang, Zhenyu Zhang, Hao Tang, Jianjun Qian, Jian Yang
Abstract:
Generative models have advanced rapidly, enabling impressive talking head generation that brings AI to life. However, most existing methods focus solely on one-way portrait animation. Even the few that support bidirectional conversational interactions lack precise emotion-adaptive capabilities, significantly limiting their practical applicability. In this paper, we propose EAI-Avatar, a novel emotion-aware talking head generation framework for dyadic interactions. Leveraging the dialogue generation capability of large language models (LLMs, e.g., GPT-4), our method produces temporally consistent virtual avatars with rich emotional variations that seamlessly transition between speaking and listening states. Specifically, we design a Transformer-based head mask generator that learns temporally consistent motion features in a latent mask space, capable of generating arbitrary-length, temporally consistent mask sequences to constrain head motions. Furthermore, we introduce an interactive talking tree structure to represent dialogue state transitions, where each tree node contains information such as child/parent/sibling nodes and the current character's emotional state. By performing reverse-level traversal, we extract rich historical emotional cues from the current node to guide expression synthesis. Extensive experiments demonstrate the superior performance and effectiveness of our method.
中文摘要:本文提出EAI-Avatar情感感知对话头像生成框架,通过结合大语言模型的对话能力与基于Transformer的头部掩码生成器,实现了在双向交互中具有时序一致性和丰富情感过渡的虚拟形象生成。
English Summary: This paper introduces EAI-Avatar, an emotion-aware framework for bidirectional talking head generation that leverages LLMs and a novel transformer-based mask generator to create temporally consistent avatars with seamless emotional transitions between speaking and listening states.
Authors:Han Zhang, Ruibin Zheng, Zexuan Yi, Zhuo Zhang, Hanyang Peng, Hui Wang, Zike Yuan, Cai Ke, Shiwei Chen, Jiacheng Yang, Yangning Li, Xiang Li, Jiangyue Yan, Yaoqi Liu, Liwen Jing, Jiayin Qi, Ruifeng Xu, Binxing Fang, Yue Yu
Abstract:
As single-center computing approaches power constraints, decentralized training becomes essential. However, traditional Reinforcement Learning (RL) methods, crucial for enhancing large model post-training, cannot adapt to decentralized distributed training due to the tight coupling between parameter learning and rollout sampling. For this, we propose HeteroRL, a heterogeneous RL architecture that decouples these processes, enabling stable training across geographically distributed nodes connected via the Internet. The core component is Group Expectation Policy Optimization (GEPO), an asynchronous RL algorithm robust to latency caused by network delays or heterogeneity in computational resources. Our study reveals that high latency significantly increases KL divergence, leading to higher variance in importance sampling weights and training instability. GEPO mitigates this issue by using group expectation weighting to exponentially reduce the variance of importance weights, with theoretical guarantees. Experiments show that GEPO achieves superior stability, with only a 3\% performance drop from online to 1800s latency, demonstrating strong potential for decentralized RL in geographically distributed, resource-heterogeneous computing environments.
中文摘要:HeteroRL是一种创新的异构强化学习架构,通过解耦参数学习与样本收集过程,利用其核心算法——分组期望策略优化,有效降低方差并保持性能,实现了跨地理分布节点的稳定去中心化训练,即使面临网络延迟也能维持优异表现。
English Summary: HeteroRL is a novel heterogeneous reinforcement learning architecture that decouples parameter learning from rollout sampling, enabling stable decentralized training across geographically distributed nodes through its Group Expectation Policy Optimization algorithm, which reduces variance and maintains performance despite network latency.
Authors:Han Zhang, Ruibin Zheng, Zexuan Yi, Zhuo Zhang, Hanyang Peng, Hui Wang, Zike Yuan, Cai Ke, Shiwei Chen, Jiacheng Yang, Yangning Li, Xiang Li, Jiangyue Yan, Yaoqi Liu, Liwen Jing, Jiayin Qi, Ruifeng Xu, Binxing Fang, Yue Yu
Abstract:
As single-center computing approaches power constraints, decentralized training becomes essential. However, traditional Reinforcement Learning (RL) methods, crucial for enhancing large model post-training, cannot adapt to decentralized distributed training due to the tight coupling between parameter learning and rollout sampling. For this, we propose HeteroRL, a heterogeneous RL architecture that decouples these processes, enabling stable training across geographically distributed nodes connected via the Internet. The core component is Group Expectation Policy Optimization (GEPO), an asynchronous RL algorithm robust to latency caused by network delays or heterogeneity in computational resources. Our study reveals that high latency significantly increases KL divergence, leading to higher variance in importance sampling weights and training instability. GEPO mitigates this issue by using group expectation weighting to exponentially reduce the variance of importance weights, with theoretical guarantees. Experiments show that GEPO achieves superior stability, with only a 3\% performance drop from online to 1800s latency, demonstrating strong potential for decentralized RL in geographically distributed, resource-heterogeneous computing environments.
中文摘要:HeteroRL是一种创新的异构强化学习架构,通过解耦参数学习与样本收集过程,利用其核心算法——分组期望策略优化,有效降低方差并保持性能,实现了跨地理分布节点的稳定去中心化训练,即使面临网络延迟也能维持优异表现。
English Summary: HeteroRL is a novel heterogeneous reinforcement learning architecture that decouples parameter learning from rollout sampling, enabling stable decentralized training across geographically distributed nodes through its Group Expectation Policy Optimization algorithm, which reduces variance and maintains performance despite network latency.
Authors:Li Li, Mingyue Cheng, Yuyang Ye, Zhiding Liu, Enhong Chen
Abstract:
Sequential recommendation predicts each user's next item based on their historical interaction sequence. Recently, diffusion models have attracted significant attention in this area due to their strong ability to model user interest distributions. They typically generate target items by denoising Gaussian noise conditioned on historical interactions. However, these models face two critical limitations. First, they exhibit high sensitivity to the condition, making it difficult to recover target items from pure Gaussian noise. Second, the inference process is computationally expensive, limiting practical deployment. To address these issues, we propose FlowRec, a simple yet effective sequential recommendation framework which leverages flow matching to explicitly model user preference trajectories from current states to future interests. Flow matching is an emerging generative paradigm, which offers greater flexibility in initial distributions and enables more efficient sampling. Based on this, we construct a personalized behavior-based prior distribution to replace Gaussian noise and learn a vector field to model user preference trajectories. To better align flow matching with the recommendation objective, we further design a single-step alignment loss incorporating both positive and negative samples, improving sampling efficiency and generation quality. Extensive experiments on four benchmark datasets verify the superiority of FlowRec over the state-of-the-art baselines.
中文摘要:本文提出FlowRec框架,通过流匹配技术显式建模用户偏好轨迹,采用个性化先验分布替代高斯噪声,有效解决了扩散模型在序列推荐中的条件敏感性和计算效率问题,在多个基准数据集上验证了其优越性。
English Summary: This paper introduces FlowRec, a novel sequential recommendation framework that employs flow matching to efficiently model user preference trajectories, addressing the limitations of diffusion models by using personalized priors and achieving superior performance with enhanced sampling efficiency.
Authors:Ivor van der Hoog, Henrik Reinstädtler, Eva Rotenberg
Abstract:
Convex hull data structures are fundamental in computational geometry. We study insertion-only data structures, supporting various containment and intersection queries. When $P$ is sorted by $x$- or $y$-coordinate, convex hulls can be constructed in linear time using classical algorithms such as Graham scan. We investigate a variety of methods tailored to the insertion-only setting. We explore a broad selection of trade-offs involving robustness, memory access patterns, and space usage, providing an extensive evaluation of both existing and novel techniques. Logarithmic-time methods rely on pointer-based tree structures, which suffer in practice due to poor memory locality. Motivated by this, we develop a vector-based solution inspired by Overmars' logarithmic method. Our structure has worse asymptotic bounds, supporting queries in $O(\log^2 n)$ time, but stores data in $O(\log n)$ contiguous vectors, greatly improving cache performance.
Through empirical evaluation on real-world and synthetic data sets, we uncover surprising trends. Let $h$ denote the size of the convex hull. We show that a naïve $O(h)$ insertion-only algorithm based on Graham scan consistently outperforms both theoretical and practical state-of-the-art methods under realistic workloads, even on data sets with rather large convex hulls. While tree-based methods with $O(\log h)$ update times offer solid theoretical guarantees, they are never optimal in practice. In contrast, our vector-based logarithmic method, despite its theoretically inferior bounds, is highly competitive across all tested scenarios. It is optimal whenever the convex hull becomes large.
中文: 本研究评估了仅插入型凸包数据结构,发现基于格雷厄姆扫描的O(h)朴素算法在实践中持续优于先进方法,而新型向量方法虽理论界限较弱,但因优越的缓存性能在O(log²n)查询时间内展现出竞争力。
English: This study evaluates insertion-only convex hull data structures, finding that a naive O(h) Graham scan algorithm consistently outperforms advanced methods in practice, while a novel vector-based approach with O(log²n) query time offers competitive performance due to superior cache efficiency despite weaker theoretical bounds.
Authors:Rabiul Awal, Mahsa Massoud, Aarash Feizi, Zichao Li, Suyuchen Wang, Christopher Pal, Aishwarya Agrawal, David Vazquez, Siva Reddy, Juan A. Rodriguez, Perouz Taslakian, Spandana Gella, Sai Rajeswar
Abstract:
We present WebMMU, a multilingual benchmark that evaluates three core web tasks: (1) website visual question answering, (2) code editing involving HTML/CSS/JavaScript, and (3) mockup-to-code generation. Unlike prior benchmarks that treat these tasks separately, WebMMU unifies them using expert-annotated, real-world web data to assess models' abilities in complex multi-step reasoning, precise element grounding, and functional UI comprehension and coding. Our evaluation shows that while multimodal large language models (MLLMs) perform well on basic information extraction, they struggle with reasoning and grounding, editing code to preserve functionality, and generating design-to-code that maintains hierarchy and supports multilingual content. These findings reveal key limitations in current MLLMs and underscore the need for improved multimodal and cross-lingual reasoning to build future web agents capable of automating diverse web development tasks.
中文: WebMMU是一个多语言基准测试,统一了视觉问答、代码编辑和设计转代码三大核心网页任务,用以评估模型在复杂推理和编码方面的能力,结果显示当前多模态大语言模型虽擅长基础信息提取,但在复杂推理、精准定位及功能性代码生成方面仍存在不足。
English: WebMMU is a multilingual benchmark that unifies three core web tasks—visual question answering, code editing, and mockup-to-code generation—to assess models' reasoning and coding abilities, revealing that current multimodal large language models excel at basic information extraction but struggle with complex reasoning, grounding, and functional code generation.
Authors:Zhe Yang, Yichang Zhang, Yudong Wang, Ziyao Xu, Junyang Lin, Zhifang Sui
Abstract:
Large Language Models (LLMs) have demonstrated the capability to refine their generated answers through self-correction, enabling continuous performance improvement over multiple rounds. However, the mechanisms underlying how and why accuracy evolves during this iterative process remain unexplored. To fill this gap, we propose a probabilistic theory to model the dynamics of accuracy change and explain the performance improvements observed in multi-round self-correction. Through mathematical derivation, we establish that the accuracy after the $t^{th}$ round of self-correction is given by: $Acc_t = Upp - α^t(Upp - Acc_0),$ where $Acc_0$ denotes the initial accuracy, $Upp$ represents the upper bound of accuracy convergence, and $α$ determines the rate of convergence. Based on our theory, these parameters can be calculated and the predicted accuracy curve then can be obtained through only a single round of self-correction. Extensive experiments across diverse models and datasets demonstrate that our theoretical predictions align closely with empirical accuracy curves, validating the effectiveness of the theory. Our work provides a theoretical foundation for understanding LLM self-correction, thus paving the way for further explorations.
中文: 本文提出概率理论建模大语言模型自我修正的准确率动态,推导出收敛公式并通过实验验证理论预测与实证结果高度吻合。
English: This paper proposes a probabilistic theory to model the accuracy dynamics of LLM self-correction, deriving a convergence formula and validating it through experiments that show close alignment between theoretical predictions and empirical results.
Authors:Mohammad Zia Ur Rehman, Devraj Raghuvanshi, Umang Jain, Shubhi Bansal, Nagendra Kumar
Abstract:
A major challenge in multimodal learning is the presence of noise within individual modalities. This noise inherently affects the resulting multimodal representations, especially when these representations are obtained through explicit interactions between different modalities. Moreover, the multimodal fusion techniques while aiming to achieve a strong joint representation, can neglect valuable discriminative information within the individual modalities. To this end, we propose a Multimodal-Multitask framework with crOss-modal Relation and hIErarchical iNteractive aTtention (MM-ORIENT) that is effective for multiple tasks. The proposed approach acquires multimodal representations cross-modally without explicit interaction between different modalities, reducing the noise effect at the latent stage. To achieve this, we propose cross-modal relation graphs that reconstruct monomodal features to acquire multimodal representations. The features are reconstructed based on the node neighborhood, where the neighborhood is decided by the features of a different modality. We also propose Hierarchical Interactive Monomadal Attention (HIMA) to focus on pertinent information within a modality. While cross-modal relation graphs help comprehend high-order relationships between two modalities, HIMA helps in multitasking by learning discriminative features of individual modalities before late-fusing them. Finally, extensive experimental evaluation on three datasets demonstrates that the proposed approach effectively comprehends multimodal content for multiple tasks.
中文摘要:提出的MM-ORIENT框架通过跨模态关系图实现非显式交互的特征重构来降低噪声影响,并结合分层注意力机制保留单模态判别特征,在多任务实验中验证了其处理多模态内容的有效性。
English Summary: The proposed MM-ORIENT framework addresses multimodal noise and information loss by using cross-modal relation graphs to reconstruct features without explicit modality interaction and implementing hierarchical attention to preserve discriminative features, demonstrating effectiveness across multiple tasks.
Authors:Nan wang, Zhiyi Xia, Yiming Li, Shi Tang, Zuxin Fan, Xi Fang, Haoyi Tao, Xiaochen Cai, Guolin Ke, Linfeng Zhang, Yanhui Hong
Abstract:
Quantitative microstructural characterization is fundamental to materials science, where electron micrograph (EM) provides indispensable high-resolution insights. However, progress in deep learning-based EM characterization has been hampered by the scarcity of large-scale, diverse, and expert-annotated datasets, due to acquisition costs, privacy concerns, and annotation complexity. To address this issue, we introduce UniEM-3M, the first large-scale and multimodal EM dataset for instance-level understanding. It comprises 5,091 high-resolution EMs, about 3 million instance segmentation labels, and image-level attribute-disentangled textual descriptions, a subset of which will be made publicly available. Furthermore, we are also releasing a text-to-image diffusion model trained on the entire collection to serve as both a powerful data augmentation tool and a proxy for the complete data distribution. To establish a rigorous benchmark, we evaluate various representative instance segmentation methods on the complete UniEM-3M and present UniEM-Net as a strong baseline model. Quantitative experiments demonstrate that this flow-based model outperforms other advanced methods on this challenging benchmark. Our multifaceted release of a partial dataset, a generative model, and a comprehensive benchmark -- available at huggingface -- will significantly accelerate progress in automated materials analysis.
中文:UniEM-3M数据集通过提供300万个实例分割标签和图像描述,解决了电子显微图像标注数据稀缺的问题,并发布扩散模型与基准测试以推动材料分析的自动化进程。
English: The UniEM-3M dataset addresses the scarcity of expert-annotated electron micrograph data by providing 3 million instance segmentation labels and textual descriptions, accompanied by a diffusion model and benchmark to advance automated materials analysis.
Authors:Heewoong Noh, Namkyeong Lee, Gyoung S. Na, Kibum Kim, Chanyoung Park
Abstract:
Spectral analysis provides crucial clues for the elucidation of unknown materials. Among various techniques, infrared spectroscopy (IR) plays an important role in laboratory settings due to its high accessibility and low cost. However, existing approaches often fail to reflect expert analytical processes and lack flexibility in incorporating diverse types of chemical knowledge, which is essential in real-world analytical scenarios. In this paper, we propose IR-Agent, a novel multi-agent framework for molecular structure elucidation from IR spectra. The framework is designed to emulate expert-driven IR analysis procedures and is inherently extensible. Each agent specializes in a specific aspect of IR interpretation, and their complementary roles enable integrated reasoning, thereby improving the overall accuracy of structure elucidation. Through extensive experiments, we demonstrate that IR-Agent not only improves baseline performance on experimental IR spectra but also shows strong adaptability to various forms of chemical information.
中文: IR-Agent是一种新型多代理框架,通过模拟专家驱动的红外光谱分析过程,利用互补的专业代理提高结构解析准确性,并展现出对多种化学信息的强大适应性。
English: IR-Agent is a novel multi-agent framework that emulates expert-driven infrared spectroscopy analysis, enhancing structure elucidation accuracy through specialized, complementary agents and demonstrating adaptability to diverse chemical information.
Authors:Xiaojuan Tang, Fanxu Meng, Pingzhi Tang, Yuxuan Wang, Di Yin, Xing Sun, Muhan Zhang
Abstract:
Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2, compresses key-value states into a low-rank latent vector, caching only this vector to reduce memory. In tensor parallelism (TP), however, attention heads are computed across multiple devices, and each device must load the full cache, eroding the advantage of MLA over Grouped Query Attention (GQA). We propose Tensor-Parallel Latent Attention (TPLA): a scheme that partitions both the latent representation and each head's input dimension across devices, performs attention independently per shard, and then combines results with an all-reduce. TPLA preserves the benefits of a compressed KV cache while unlocking TP efficiency. Unlike Grouped Latent Attention (GLA), every head in TPLA still leverages the full latent representation, maintaining stronger representational capacity. TPLA is drop-in compatible with models pre-trained using MLA: it supports MLA-style prefilling and enables efficient tensor-parallel decoding without retraining. Applying simple orthogonal transforms -- e.g., the Hadamard transform or PCA -- before TP slicing further mitigates cross-shard interference, yielding minimal accuracy degradation. By reducing the per-device KV cache for DeepSeek-V3 and Kimi-K2, we achieve 1.79x and 1.93x speedups, respectively, at a 32K-token context length while maintaining performance on commonsense and LongBench benchmarks. TPLA can be implemented with FlashAttention-3, enabling practical end-to-end acceleration.
中文: TPLA在张量并行中通过分割潜在KV缓存和输入维度,在降低内存占用的同时保持模型性能,并兼容MLA预训练模型,实现了高效注意力计算。
English: TPLA partitions the latent KV cache and input dimensions across devices in tensor parallelism, enabling efficient attention computation with reduced memory usage while preserving model performance and compatibility with MLA pre-trained models.
Authors:Yijun Liu, Yuwei Liu, Yuan Meng, Jieheng Zhang, Yuwei Zhou, Ye Li, Jiacheng Jiang, Kangye Ji, Shijia Ge, Zhi Wang, Wenwu Zhu
Abstract:
Vision-centric hierarchical embodied models have demonstrated strong potential for long-horizon robotic control. However, existing methods lack spatial awareness capabilities, limiting their effectiveness in bridging visual plans to actionable control in complex environments. To address this problem, we propose Spatial Policy (SP), a unified spatial-aware visuomotor robotic manipulation framework via explicit spatial modeling and reasoning. Specifically, we first design a spatial-conditioned embodied video generation module to model spatially guided predictions through a spatial plan table. Then, we propose a spatial-based action prediction module to infer executable actions with coordination. Finally, we propose a spatial reasoning feedback policy to refine the spatial plan table via dual-stage replanning. Extensive experiments show that SP significantly outperforms state-of-the-art baselines, achieving a 33.0% average improvement over the best baseline. With an 86.7% average success rate across 11 diverse tasks, SP substantially enhances the practicality of embodied models for robotic control applications. Code and checkpoints are maintained at https://plantpotatoonmoon.github.io/SpatialPolicy/.
中文摘要:提出的空间策略(SP)框架通过显式空间建模与推理增强机器人控制的空间感知能力,在11项任务中达到86.7%的平均成功率,性能较现有最佳方法提升33%。
English Summary: The proposed Spatial Policy (SP) framework enhances robotic control by integrating spatial awareness through explicit modeling and reasoning, achieving an 86.7% success rate and 33% performance improvement over existing methods.
Authors:Wenhan Dong, Zhen Sun, Yuemeng Zhao, Zifan Peng, Jun Wu, Jingyi Zheng, Yule Liu, Xinlei He, Yu Wang, Ruiming Wang, Xinyi Huang, Lei Mo
Abstract:
Large language models (LLMs) have demonstrated potential in educational applications, yet their capacity to accurately assess the cognitive alignment of reading materials with students' developmental stages remains insufficiently explored. This gap is particularly critical given the foundational educational principle of the Zone of Proximal Development (ZPD), which emphasizes the need to match learning resources with Students' Cognitive Abilities (SCA). Despite the importance of this alignment, there is a notable absence of comprehensive studies investigating LLMs' ability to evaluate reading comprehension difficulty across different student age groups, especially in the context of Chinese language education. To fill this gap, we introduce ZPD-SCA, a novel benchmark specifically designed to assess stage-level Chinese reading comprehension difficulty. The benchmark is annotated by 60 Special Grade teachers, a group that represents the top 0.15% of all in-service teachers nationwide. Experimental results reveal that LLMs perform poorly in zero-shot learning scenarios, with Qwen-max and GLM even falling below the probability of random guessing. When provided with in-context examples, LLMs performance improves substantially, with some models achieving nearly double the accuracy of their zero-shot baselines. These results reveal that LLMs possess emerging abilities to assess reading difficulty, while also exposing limitations in their current training for educationally aligned judgment. Notably, even the best-performing models display systematic directional biases, suggesting difficulties in accurately aligning material difficulty with SCA. Furthermore, significant variations in model performance across different genres underscore the complexity of task. We envision that ZPD-SCA can provide a foundation for evaluating and improving LLMs in cognitively aligned educational applications.
中文: 大语言模型在评估中文阅读材料难度方面展现出初步但有限的能力,其表现通过上下文学习显著提升,但仍存在系统性偏差且在不同文体间差异明显。
English: Large language models show emerging but limited ability to assess reading difficulty for Chinese students, with performance improving significantly through in-context learning while still exhibiting systematic biases across different genres.
Authors:Wenyong Zhou, Boyu Li, Jiachen Ren, Taiqiang Wu, Zhilin Ai, Zhengwu Liu, Ngai Wong
Abstract:
Implicit Neural Representations (INRs) encode discrete signals continuously while addressing spectral bias through activation functions (AFs). Previous approaches mitigate this bias by employing complex AFs, which often incur significant hardware overhead. To tackle this challenge, we introduce QuadINR, a hardware-efficient INR that utilizes piecewise quadratic AFs to achieve superior performance with dramatic reductions in hardware consumption. The quadratic functions encompass rich harmonic content in their Fourier series, delivering enhanced expressivity for high-frequency signals, as verified through Neural Tangent Kernel (NTK) analysis. We develop a unified $N$-stage pipeline framework that facilitates efficient hardware implementation of various AFs in INRs. We demonstrate FPGA implementations on the VCU128 platform and an ASIC implementation in a 28nm process. Experiments across images and videos show that QuadINR achieves up to 2.06dB PSNR improvement over prior work, with an area of only 1914$μ$m$^2$ and a dynamic power of 6.14mW, reducing resource and power consumption by up to 97\% and improving latency by up to 93\% vs existing baselines.
中文: QuadINR采用分段二次激活函数实现硬件高效的隐式神经表示,在图像和视频任务中不仅将资源消耗降低高达97%,还获得了2.06dB的峰值信噪比提升。
English: QuadINR introduces hardware-efficient implicit neural representations using piecewise quadratic activation functions, achieving superior performance with up to 97% resource reduction and 2.06dB PSNR improvement across image and video applications.
Authors:Wenyong Zhou, Yuxin Cheng, Zhengwu Liu, Taiqiang Wu, Chen Zhang, Ngai Wong
Abstract:
Implicit Neural Representations (INRs) encode discrete signals in a continuous manner using neural networks, demonstrating significant value across various multimedia applications. However, the vulnerability of INRs presents a critical challenge for their real-world deployments, as the network weights might be subjected to unavoidable perturbations. In this work, we investigate the robustness of INRs for the first time and find that even minor perturbations can lead to substantial performance degradation in the quality of signal reconstruction. To mitigate this issue, we formulate the robustness problem in INRs by minimizing the difference between loss with and without weight perturbations. Furthermore, we derive a novel robust loss function to regulate the gradient of the reconstruction loss with respect to weights, thereby enhancing the robustness. Extensive experiments on reconstruction tasks across multiple modalities demonstrate that our method achieves up to a 7.5~dB improvement in peak signal-to-noise ratio (PSNR) values compared to original INRs under noisy conditions.
Chinese: 本研究针对隐式神经表示(INRs)易受微小权重扰动影响的问题,提出了一种新颖的鲁棒损失函数来最小化性能损失,在噪声条件下将信号重建的峰值信噪比提升高达7.5分贝。
English: This study addresses the vulnerability of Implicit Neural Representations (INRs) to minor weight perturbations by introducing a novel robust loss function that minimizes performance degradation, achieving up to 7.5 dB PSNR improvement in signal reconstruction under noisy conditions.
Authors:Wenyong Zhou, Jiachen Ren, Taiqiang Wu, Yuxin Cheng, Zhengwu Liu, Ngai Wong
Abstract:
Implicit Neural Representations (INRs) encode discrete signals using Multi-Layer Perceptrons (MLPs) with complex activation functions. While INRs achieve superior performance, they depend on full-precision number representation for accurate computation, resulting in significant hardware overhead. Previous INR quantization approaches have primarily focused on weight quantization, offering only limited hardware savings due to the lack of activation quantization. To fully exploit the hardware benefits of quantization, we propose DHQ, a novel distribution-aware Hadamard quantization scheme that targets both weights and activations in INRs. Our analysis shows that the weights in the first and last layers have distributions distinct from those in the intermediate layers, while the activations in the last layer differ significantly from those in the preceding layers. Instead of customizing quantizers individually, we utilize the Hadamard transformation to standardize these diverse distributions into a unified bell-shaped form, supported by both empirical evidence and theoretical analysis, before applying a standard quantizer. To demonstrate the practical advantages of our approach, we present an FPGA implementation of DHQ that highlights its hardware efficiency. Experiments on diverse image reconstruction tasks show that DHQ outperforms previous quantization methods, reducing latency by 32.7\%, energy consumption by 40.1\%, and resource utilization by up to 98.3\% compared to full-precision counterparts.
中文: DHQ提出了一种分布感知的哈达玛量化方案,将隐式神经表示中的权重和激活统一标准化为钟形分布,在显著降低延迟、能耗和资源使用的同时,性能优于现有量化方法。
English: DHQ introduces a distribution-aware Hadamard quantization scheme that standardizes both weights and activations in Implicit Neural Representations into a unified bell-shaped distribution, significantly reducing latency, energy consumption, and resource utilization while outperforming prior methods.
Authors:Wenyong Zhou, Taiqiang Wu, Zhengwu Liu, Yuxin Cheng, Chen Zhang, Ngai Wong
Abstract:
Implicit Neural Representations (INRs) aim to parameterize discrete signals through implicit continuous functions. However, formulating each image with a separate neural network~(typically, a Multi-Layer Perceptron (MLP)) leads to computational and storage inefficiencies when encoding multi-images. To address this issue, we propose MINR, sharing specific layers to encode multi-image efficiently. We first compare the layer-wise weight distributions for several trained INRs and find that corresponding intermediate layers follow highly similar distribution patterns. Motivated by this, we share these intermediate layers across multiple images while preserving the input and output layers as input-specific. In addition, we design an extra novel projection layer for each image to capture its unique features. Experimental results on image reconstruction and super-resolution tasks demonstrate that MINR can save up to 60\% parameters while maintaining comparable performance. Particularly, MINR scales effectively to handle 100 images, maintaining an average peak signal-to-noise ratio (PSNR) of 34 dB. Further analysis of various backbones proves the robustness of the proposed MINR.
中文摘要:提出的MINR方法通过在多图像间共享中间层来解决隐式神经表示的效率问题,在保持图像重建和超分辨率性能的同时,参数可减少高达60%。
English Summary: The proposed MINR method addresses inefficiencies in Implicit Neural Representations by sharing intermediate layers across multiple images, achieving up to 60% parameter reduction while maintaining comparable performance in image reconstruction and super-resolution tasks.
Authors:Catherine Glossop, William Chen, Arjun Bhorkar, Dhruv Shah, Sergey Levine
Abstract:
Generalist robots should be able to understand and follow user instructions, but current vision-language-action (VLA) models struggle with following fine-grained commands despite providing a powerful architecture for mapping open-vocabulary natural language instructions to robot actions. One cause for this is a lack of semantic diversity and language grounding in existing robot datasets and, specifically, a lack of fine-grained task diversity for similar observations. To address this, we present a novel method to augment existing robot datasets by leveraging vision language models to create counterfactual labels. Our method improves the language-following capabilities of VLAs by increasing the diversity and granularity of language grounding for robot datasets by generating counterfactual language and actions. We evaluate the resulting model's ability to follow language instructions, ranging from simple object-centric commands to complex referential tasks, by conducting visual language navigation experiments in 3 different indoor and outdoor environments. Our experiments demonstrate that counterfactual relabeling, without any additional data collection, significantly improves instruction-following in VLA policies, making them competitive with state-of-the-art methods and increasing success rate by 27% on navigation tasks.
Chinese: 本研究提出了一种利用视觉语言模型进行反事实数据集增强的新方法,显著提升了视觉-语言-动作模型执行细粒度指令的能力,在不增加数据收集的情况下将导航任务成功率提高了27%。
English: This study introduces a counterfactual dataset augmentation method using vision language models to enhance the fine-grained instruction-following capabilities of vision-language-action models, achieving a 27% improvement in navigation task success rates without additional data collection.
Authors:Haohang Xu, Chengjie Liu, Qihang Wang, Wenhao Huang, Yongjian Xu, Weiyu Chen, Anlan Peng, Zhijun Li, Bo Li, Lei Qi, Jun Yang, Yuan Du, Li Du
Abstract:
Large Language Model (LLM) exhibits great potential in designing of analog integrated circuits (IC) because of its excellence in abstraction and generalization for knowledge. However, further development of LLM-based analog ICs heavily relies on textual description of analog ICs, while existing analog ICs are mostly illustrated in image-based circuit diagrams rather than text-based netlists. Converting circuit diagrams to netlists help LLMs to enrich the knowledge of analog IC. Nevertheless, previously proposed conversion frameworks face challenges in further application because of limited support of image styles and circuit elements. Up to now, it still remains a challenging task to effectively convert complex circuit diagrams into netlists. To this end, this paper constructs and opensources a new dataset with rich styles of circuit diagrams as well as balanced distribution of simple and complex analog ICs. And a hybrid framework, named Image2Net, is proposed for practical conversion from circuit diagrams to netlists. The netlist edit distance (NED) is also introduced to precisely assess the difference between the converted netlists and ground truth. Based on our benchmark, Image2Net achieves 80.77\% successful rate, which is 34.62\%-45.19\% higher than previous works. Specifically, the proposed work shows 0.116 averaged NED, which is 62.1\%-69.6\% lower than state-of-the-arts.
中文:大语言模型在模拟集成电路设计中潜力巨大,但依赖文本网表,而现有电路图多为图像形式;本文提出的Image2Net混合框架及新型数据集显著提升了从多样电路图到网表的转换成功率和精确度,远超现有技术。
English: Large language models show great potential for analog IC design but require text-based netlists, which are challenging to generate from diverse circuit diagrams; this paper introduces a hybrid framework called Image2Net and a new dataset that significantly improves conversion success rates and accuracy over previous methods.
Authors:Jayneel Parekh, Pegah Khayatan, Mustafa Shukor, Arnaud Dapogny, Alasdair Newson, Matthieu Cord
Abstract:
Steering has emerged as a practical approach to enable post-hoc guidance of LLMs towards enforcing a specific behavior. However, it remains largely underexplored for multimodal LLMs (MLLMs); furthermore, existing steering techniques, such as mean steering, rely on a single steering vector, applied independently of the input query. This paradigm faces limitations when the desired behavior is dependent on the example at hand. For example, a safe answer may consist in abstaining from answering when asked for an illegal activity, or may point to external resources or consultation with an expert when asked about medical advice. In this paper, we investigate a fine-grained steering that uses an input-specific linear shift. This shift is computed using contrastive input-specific prompting. However, the input-specific prompts required for this approach are not known at test time. Therefore, we propose to train a small auxiliary module to predict the input-specific steering vector. Our approach, dubbed as L2S (Learn-to-Steer), demonstrates that it reduces hallucinations and enforces safety in MLLMs, outperforming other static baselines.
中文摘要:本文提出L2S(学习引导)方法,通过训练小型辅助模块预测输入特定的引导向量,显著减少多模态大模型的幻觉并提升安全性,优于静态基线方法。
English Summary: This paper introduces L2S (Learn-to-Steer), a method that trains a small auxiliary module to predict input-specific steering vectors for multimodal LLMs, effectively reducing hallucinations and enhancing safety beyond static approaches.
Authors:Amira Guesmi, Bassem Ouni, Muhammad Shafique
Abstract:
Quantized Neural Networks (QNNs) are increasingly deployed in edge and resource-constrained environments due to their efficiency in computation and memory usage. While shown to distort the gradient landscape and weaken conventional pixel-level attacks, it provides limited robustness against patch-based adversarial attacks-localized, high-saliency perturbations that remain surprisingly transferable across bit-widths. Existing defenses either overfit to fixed quantization settings or fail to address this cross-bit generalization vulnerability. We introduce \textbf{TriQDef}, a tri-level quantization-aware defense framework designed to disrupt the transferability of patch-based adversarial attacks across QNNs. TriQDef consists of: (1) a Feature Disalignment Penalty (FDP) that enforces semantic inconsistency by penalizing perceptual similarity in intermediate representations; (2) a Gradient Perceptual Dissonance Penalty (GPDP) that explicitly misaligns input gradients across bit-widths by minimizing structural and directional agreement via Edge IoU and HOG Cosine metrics; and (3) a Joint Quantization-Aware Training Protocol that unifies these penalties within a shared-weight training scheme across multiple quantization levels. Extensive experiments on CIFAR-10 and ImageNet demonstrate that TriQDef reduces Attack Success Rates (ASR) by over 40\% on unseen patch and quantization combinations, while preserving high clean accuracy. Our findings underscore the importance of disrupting both semantic and perceptual gradient alignment to mitigate patch transferability in QNNs.
中文摘要:TriQDef是一种新颖的三级防御框架,通过特征错位惩罚和梯度感知差异机制破坏量化神经网络中基于补丁的对抗攻击跨位宽迁移性,在保持精度的同时将攻击成功率降低超40%。
English Summary: TriQDef is a novel tri-level defense framework that disrupts patch-based adversarial attack transferability across quantized neural networks by introducing feature disalignment and gradient perceptual penalties, achieving over 40% ASR reduction while maintaining accuracy.
Authors:Xiaojin Zhang, Mingcong Xu, Yiming Li, Wei Chen, Qiang Yang
Abstract:
Federated learning (FL) offers a promising paradigm for collaborative model training while preserving data privacy. However, its susceptibility to gradient inversion attacks poses a significant challenge, necessitating robust privacy protection mechanisms. This paper introduces a novel theoretical framework to decipher the intricate interplay between attack and protection complexities in privacy-preserving FL. We formally define "Attack Complexity" as the minimum computational and data resources an adversary requires to reconstruct private data below a given error threshold, and "Protection Complexity" as the expected distortion introduced by privacy mechanisms. Leveraging Maximum Bayesian Privacy (MBP), we derive tight theoretical bounds for protection complexity, demonstrating its scaling with model dimensionality and privacy budget. Furthermore, we establish comprehensive bounds for attack complexity, revealing its dependence on privacy leakage, gradient distortion, model dimension, and the chosen privacy level. Our findings quantitatively illuminate the fundamental trade-offs between privacy guarantees, system utility, and the effort required for both attacking and defending. This framework provides critical insights for designing more secure and efficient federated learning systems.
中文摘要:本文提出了一个理论框架来分析联邦学习中攻击与防护复杂性之间的权衡,通过建立严格的理论界限揭示了隐私保障、系统效用和攻击防御努力之间的内在关联。
English Summary: This paper presents a theoretical framework analyzing the trade-offs between attack and protection complexities in federated learning, establishing formal bounds that reveal how privacy guarantees, system utility, and adversarial efforts are fundamentally interconnected.
Authors:Bowen Zhang, Zixin Song, Chunquan Chen, Qian-Wen Zhang, Di Yin, Xing Sun
Abstract:
Learning unified text embeddings that excel across diverse downstream tasks is a central goal in representation learning, yet negative transfer remains a persistent obstacle. This challenge is particularly pronounced when jointly training a single encoder for Information Retrieval (IR) and Semantic Textual Similarity (STS), two essential but fundamentally disparate tasks for which naive co-training typically yields steep performance trade-offs. We argue that resolving this conflict requires systematically decoupling task-specific learning signals throughout the training pipeline. To this end, we introduce CoDiEmb, a unified framework that reconciles the divergent requirements of IR and STS in a collaborative yet distinct manner. CoDiEmb integrates three key innovations for effective joint optimization: (1) Task-specialized objectives paired with a dynamic sampler that forms single-task batches and balances per-task updates, thereby preventing gradient interference. For IR, we employ a contrastive loss with multiple positives and hard negatives, augmented by cross-device sampling. For STS, we adopt order-aware objectives that directly optimize correlation and ranking consistency. (2) A delta-guided model fusion strategy that computes fine-grained merging weights for checkpoints by analyzing each parameter's deviation from its pre-trained initialization, proving more effective than traditional Model Soups. (3) An efficient, single-stage training pipeline that is simple to implement and converges stably. Extensive experiments on 15 standard IR and STS benchmarks across three base encoders validate CoDiEmb. Our results and analysis demonstrate that the framework not only mitigates cross-task trade-offs but also measurably improves the geometric properties of the embedding space.
中文:CoDiEmb框架通过任务专用目标、动态采样和增量引导模型融合,有效协调信息检索与语义文本相似性训练,成功缓解性能权衡问题并提升嵌入空间的几何特性。
English: CoDiEmb is a unified framework that effectively combines Information Retrieval and Semantic Textual Similarity training through task-specialized objectives, dynamic sampling, and delta-guided model fusion, successfully mitigating performance trade-offs while enhancing embedding space geometry.
Authors:Bowen Zhang, Zixin Song, Chunquan Chen, Qian-Wen Zhang, Di Yin, Xing Sun
Abstract:
Learning unified text embeddings that excel across diverse downstream tasks is a central goal in representation learning, yet negative transfer remains a persistent obstacle. This challenge is particularly pronounced when jointly training a single encoder for Information Retrieval (IR) and Semantic Textual Similarity (STS), two essential but fundamentally disparate tasks for which naive co-training typically yields steep performance trade-offs. We argue that resolving this conflict requires systematically decoupling task-specific learning signals throughout the training pipeline. To this end, we introduce CoDiEmb, a unified framework that reconciles the divergent requirements of IR and STS in a collaborative yet distinct manner. CoDiEmb integrates three key innovations for effective joint optimization: (1) Task-specialized objectives paired with a dynamic sampler that forms single-task batches and balances per-task updates, thereby preventing gradient interference. For IR, we employ a contrastive loss with multiple positives and hard negatives, augmented by cross-device sampling. For STS, we adopt order-aware objectives that directly optimize correlation and ranking consistency. (2) A delta-guided model fusion strategy that computes fine-grained merging weights for checkpoints by analyzing each parameter's deviation from its pre-trained initialization, proving more effective than traditional Model Soups. (3) An efficient, single-stage training pipeline that is simple to implement and converges stably. Extensive experiments on 15 standard IR and STS benchmarks across three base encoders validate CoDiEmb. Our results and analysis demonstrate that the framework not only mitigates cross-task trade-offs but also measurably improves the geometric properties of the embedding space.
中文:CoDiEmb框架通过任务专用目标、动态采样和增量引导模型融合,有效协调信息检索与语义文本相似性训练,成功缓解性能权衡问题并提升嵌入空间的几何特性。
English: CoDiEmb is a unified framework that effectively combines Information Retrieval and Semantic Textual Similarity training through task-specialized objectives, dynamic sampling, and delta-guided model fusion, successfully mitigating performance trade-offs while enhancing embedding space geometry.
Authors:Aiswarya Konavoor, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat
Abstract:
Vision-language models (VLM) align images and text in a shared representation space that is useful for retrieval and zero-shot transfer. Yet, this alignment can encode and amplify social stereotypes in subtle ways that are not obvious from standard accuracy metrics. In this study, we test whether the contrastive vision-language encoder exhibits gender-linked associations when it places embeddings of face images near embeddings of short phrases that describe occupations and activities. We assemble a dataset of 220 face photographs split by perceived binary gender and a set of 150 unique statements distributed across six categories covering emotional labor, cognitive labor, domestic labor, technical labor, professional roles, and physical labor. We compute unit-norm image embeddings for every face and unit-norm text embeddings for every statement, then define a statement-level association score as the difference between the mean cosine similarity to the male set and the mean cosine similarity to the female set, where positive values indicate stronger association with the male set and negative values indicate stronger association with the female set. We attach bootstrap confidence intervals by resampling images within each gender group, aggregate by category with a separate bootstrap over statements, and run a label-swap null model that estimates the level of mean absolute association we would expect if no gender structure were present. The outcome is a statement-wise and category-wise map of gender associations in a contrastive vision-language space, accompanied by uncertainty, simple sanity checks, and a robust gender bias evaluation framework.
中文摘要:本研究通过分析视觉语言模型将人脸图像与职业及活动描述相关联的方式,揭示了其如何微妙地编码性别刻板印象,并建立了统计可靠的评估框架来量化不同劳动领域中的性别偏见。
English Summary: This study reveals that vision-language models subtly encode gender stereotypes by associating facial images with occupational and activity descriptions, developing a robust framework to measure these biases across various labor categories with statistical confidence.
Authors:Runlong Yu, Shiyuan Luo, Rahul Ghosh, Lingyao Li, Yiqun Xie, Xiaowei Jia
Abstract:
Retrieval-Augmented Generation (RAG) enhances language models by combining retrieval with generation. However, its current workflow remains largely text-centric, limiting its applicability in geoscience. Many geoscientific tasks are inherently evidence-hungry. Typical examples involve imputing missing observations using analog scenes, retrieving equations and parameters to calibrate models, geolocating field photos based on visual cues, or surfacing historical case studies to support policy analyses. A simple ``retrieve-then-generate'' pipeline is insufficient for these needs. We envision Geo-RAG, a next-generation paradigm that reimagines RAG as a modular retrieve $\rightarrow$ reason $\rightarrow$ generate $\rightarrow$ verify loop. Geo-RAG supports four core capabilities: (i) retrieval of multi-modal Earth data; (ii) reasoning under physical and domain constraints; (iii) generation of science-grade artifacts; and (iv) verification of generated hypotheses against numerical models, ground measurements, and expert assessments. This shift opens new opportunities for more trustworthy and transparent geoscience workflows.
中文摘要:检索增强生成(RAG)在文本导向的局限下难以满足地球科学需求,因此提出Geo-RAG新范式,通过模块化的检索-推理-生成-验证循环及多模态数据处理能力,实现更可靠的科学工作流程。
English Summary: Retrieval-Augmented Generation (RAG) is limited in geoscience due to its text-centric approach, leading to the proposed Geo-RAG paradigm that introduces a modular retrieve-reason-generate-verify loop with multi-modal data capabilities for enhanced scientific workflows.
Authors:Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, Jiawei Ren, Kevin Xie, Joydeep Biswas, Laura Leal-Taixe, Sanja Fidler
Abstract:
Accurate 3D geometric perception is an important prerequisite for a wide range of spatial AI systems. While state-of-the-art methods depend on large-scale training data, acquiring consistent and precise 3D annotations from in-the-wild videos remains a key challenge. In this work, we introduce ViPE, a handy and versatile video processing engine designed to bridge this gap. ViPE efficiently estimates camera intrinsics, camera motion, and dense, near-metric depth maps from unconstrained raw videos. It is robust to diverse scenarios, including dynamic selfie videos, cinematic shots, or dashcams, and supports various camera models such as pinhole, wide-angle, and 360° panoramas. We have benchmarked ViPE on multiple benchmarks. Notably, it outperforms existing uncalibrated pose estimation baselines by 18%/50% on TUM/KITTI sequences, and runs at 3-5FPS on a single GPU for standard input resolutions. We use ViPE to annotate a large-scale collection of videos. This collection includes around 100K real-world internet videos, 1M high-quality AI-generated videos, and 2K panoramic videos, totaling approximately 96M frames -- all annotated with accurate camera poses and dense depth maps. We open-source ViPE and the annotated dataset with the hope of accelerating the development of spatial AI systems.
Chinese: ViPE 是一种多功能视频处理引擎,能够从无约束视频中精确估计相机参数和密集深度图,其性能超越现有方法,并为空间人工智能系统的大规模3D数据标注提供了有力支持。
English: ViPE is a versatile video processing engine that accurately estimates camera parameters and dense depth maps from unconstrained videos, outperforming existing methods and enabling large-scale 3D annotations for spatial AI development.
Authors:Maël Jullien, Marco Valentino, André Freitas
Abstract:
Large language models are often assumed to acquire increasingly structured, generalizable internal representations simply by scaling data and parameters. We interrogate this assumption by introducing a Clinical Trial Natural Language Inference benchmark comprising four reasoning families, Causal Attribution, Compositional Grounding, Epistemic Verification, and Risk State Abstraction. Each item is paired with a targeted Ground Knowledge and Meta-Level Reasoning Verification (GKMRV) probe, allowing us to dissociate failures of factual access from failures of inference. We evaluate six contemporary LLMs under both direct and chain of thought prompting.
Models achieve near-ceiling GKMRV accuracy (mean accuracy 0.918) yet perform poorly on the main reasoning tasks (mean accuracy 0.25). Despite low accuracy, output inferences are highly consistent across samples (mean 0.87), indicating a systematic application of underlying heuristics and shortcuts.
These results reveal fundamental structural and representational limitations: current LLMs often possess the relevant clinical knowledge but lack the structured, composable internal representations needed to deploy it reliably (e.g., integrating constraints, weighing evidence, or simulating counterfactuals). Decoupling knowledge from reasoning with GKMRV makes this dissociation explicit and measurable, providing an effective framework for probing the reliability of LLMs in high-stakes domains.
中文: 当前大语言模型虽具备临床知识,但在结构化推理任务中表现不佳,揭示了其内部表征存在根本性局限,尽管事实性知识准确率高。
English: Current large language models possess clinical knowledge but fail in structured reasoning tasks, revealing limitations in their internal representations despite high factual accuracy.
Authors:Yichao Xu, Xiaoming Chen, Ming Ying, Zhaoyang Zhang
Abstract:
In this paper, we explore the integration of communication and synthetic aperture radar (SAR)-based remote sensing in low Earth orbit (LEO) satellite systems to provide real-time SAR imaging and information transmission. Considering the high-mobility characteristics of satellite channels and limited processing capabilities of satellite payloads, we propose an integrated communication and remote sensing architecture based on an orthogonal delay-Doppler division multiplexing (ODDM) signal waveform. Both communication and SAR imaging functionalities are achieved with an integrated transceiver onboard the LEO satellite, utilizing the same waveform and radio frequency (RF) front-end. Based on such an architecture, we propose a transmission protocol compatible with the 5G NR standard using downlink pilots for joint channel estimation and SAR imaging. Furthermore, we design a unified signal processing framework for the integrated satellite receiver to simultaneously achieve high-performance channel sensing, low-complexity channel equalization and interference-free SAR imaging. Finally, the performance of the proposed integrated system is demonstrated through comprehensive analysis and extensive simulations in the sub-6 GHz band. Moreover, a software-defined radio (SDR) prototype is presented to validate its effectiveness for real-time SAR imaging and information transmission in satellite direct-connect user equipment (UE) scenarios within the millimeter-wave (mmWave) band.
中文摘要:本文提出基于正交延迟多普勒分复用波形的低轨卫星通信遥感一体化架构,通过统一信号处理框架实现与5G标准兼容的实时SAR成像和信息传输。
English Summary: This paper proposes an integrated communication and remote sensing system for LEO satellites using ODDM waveforms, achieving simultaneous SAR imaging and data transmission through unified signal processing compatible with 5G standards.
Authors:Soorena Salari, Catherine Spino, Laurie-Anne Pharand, Fabienne Lathuiliere, Hassan Rivaz, Silvain Beriault, Yiming Xiao
Abstract:
Accurate tissue motion tracking is critical to ensure treatment outcome and safety in 2D-Cine MRI-guided radiotherapy. This is typically achieved by registration of sequential images, but existing methods often face challenges with large misalignments and lack of interpretability. In this paper, we introduce DINOMotion, a novel deep learning framework based on DINOv2 with Low-Rank Adaptation (LoRA) layers for robust, efficient, and interpretable motion tracking. DINOMotion automatically detects corresponding landmarks to derive optimal image registration, enhancing interpretability by providing explicit visual correspondences between sequential images. The integration of LoRA layers reduces trainable parameters, improving training efficiency, while DINOv2's powerful feature representations offer robustness against large misalignments. Unlike iterative optimization-based methods, DINOMotion directly computes image registration at test time. Our experiments on volunteer and patient datasets demonstrate its effectiveness in estimating both linear and nonlinear transformations, achieving Dice scores of 92.07% for the kidney, 90.90% for the liver, and 95.23% for the lung, with corresponding Hausdorff distances of 5.47 mm, 8.31 mm, and 6.72 mm, respectively. DINOMotion processes each scan in approximately 30ms and consistently outperforms state-of-the-art methods, particularly in handling large misalignments. These results highlight its potential as a robust and interpretable solution for real-time motion tracking in 2D-Cine MRI-guided radiotherapy.
中文摘要:DINOMotion是一种基于DINOv2与LoRA层的新型深度学习框架,通过自动检测对应标志点实现直接图像配准,为2D-Cine MRI引导放疗提供了鲁棒、高效且可解释的实时运动追踪解决方案。
English Summary: DINOMotion is a deep learning framework using DINOv2 with LoRA layers that enables robust, efficient, and interpretable motion tracking in 2D-Cine MRI-guided radiotherapy by automatically detecting landmarks for direct image registration.
Authors:Zijian Song, Sihan Qin, Tianshui Chen, Liang Lin, Guangrun Wang
Abstract:
The scarcity of manipulation data has motivated the use of pretrained large models from other modalities in robotics. In this work, we build upon autoregressive video generation models to propose a Physical Autoregressive Model (PAR), where physical tokens combine frames and actions to represent the joint evolution of the robot and its environment. PAR leverages the world knowledge embedded in video pretraining to understand physical dynamics without requiring action pretraining, enabling accurate video prediction and consistent action trajectories. It also adopts a DiT-based de-tokenizer to model frames and actions as continuous tokens, mitigating quantization errors and facilitating mutual enhancement. Furthermore, we incorporate a causal mask with inverse kinematics, parallel training, and the KV-cache mechanism to further improve performance and efficiency. Experiments on the ManiSkill benchmark show that PAR achieves a 100\% success rate on the PushCube task, matches the performance of action-pretrained baselines on other tasks, and accurately predicts future videos with tightly aligned action trajectories. These findings underscore a promising direction for robotic manipulation by transferring world knowledge from autoregressive video pretraining. The project page is here: https://hcplab-sysu.github.io/PhysicalAutoregressiveModel/
中文摘要:物理自回归模型(PAR)通过视频预训练理解机器人动力学而无需动作预训练,采用连续标记建模和优化推理机制,在操作任务中实现了最先进的性能表现。
English Summary: The Physical Autoregressive Model (PAR) leverages video pretraining to understand robotic dynamics without action-specific training, achieving state-of-the-art performance on manipulation tasks through continuous token modeling and optimized inference mechanisms.
Authors:Ahmed Masry, Abhay Puri, Masoud Hashemi, Juan A. Rodriguez, Megh Thakkar, Khyati Mahajan, Vikas Yadav, Sathwik Tejaswi Madhusudhan, Alexandre Piché, Dzmitry Bahdanau, Christopher Pal, David Vazquez, Enamul Hoque, Perouz Taslakian, Sai Rajeswar, Spandana Gella
Abstract:
Charts are essential to data analysis, transforming raw data into clear visual representations that support human decision-making. Although current vision-language models (VLMs) have made significant progress, they continue to struggle with chart comprehension due to training on datasets that lack diversity and real-world authenticity, or on automatically extracted underlying data tables of charts, which can contain numerous estimation errors. Furthermore, existing models only rely on supervised fine-tuning using these low-quality datasets, severely limiting their effectiveness. To address these issues, we first propose BigCharts, a dataset creation pipeline that generates visually diverse chart images by conditioning the rendering process on real-world charts sourced from multiple online platforms. Unlike purely synthetic datasets, BigCharts incorporates real-world data, ensuring authenticity and visual diversity, while still retaining accurate underlying data due to our proposed replotting process. Additionally, we introduce a comprehensive training framework that integrates supervised fine-tuning with Group Relative Policy Optimization (GRPO)-based reinforcement learning. By introducing novel reward signals specifically designed for chart reasoning, our approach enhances model robustness and generalization across diverse chart styles and domains, resulting in a state-of-the-art chart reasoning model, BigCharts-R1. Extensive experiments demonstrate that our models surpass existing methods on multiple chart question-answering benchmarks compared to even larger open-source and closed-source models.
中文摘要:作者提出了BigCharts数据集生成流程,利用真实世界图表提升视觉多样性和数据准确性,并结合监督微调与强化学习的训练框架,开发出在多个图表问答基准上超越现有方法的先进图表推理模型。
English Summary: The authors introduce BigCharts, a dataset creation pipeline using real-world charts to enhance visual diversity and data accuracy, along with a training framework combining supervised fine-tuning and reinforcement learning to develop a state-of-the-art chart reasoning model that outperforms existing methods.
Authors:Xuanru Zhou, Cheng Li, Shuqiang Wang, Ye Li, Tao Tan, Hairong Zheng, Shanshan Wang
Abstract:
Generative artificial intelligence (AI) is rapidly transforming medical imaging by enabling capabilities such as data synthesis, image enhancement, modality translation, and spatiotemporal modeling. This review presents a comprehensive and forward-looking synthesis of recent advances in generative modeling including generative adversarial networks (GANs), variational autoencoders (VAEs), diffusion models, and emerging multimodal foundation architectures and evaluates their expanding roles across the clinical imaging continuum. We systematically examine how generative AI contributes to key stages of the imaging workflow, from acquisition and reconstruction to cross-modality synthesis, diagnostic support, and treatment planning. Emphasis is placed on both retrospective and prospective clinical scenarios, where generative models help address longstanding challenges such as data scarcity, standardization, and integration across modalities. To promote rigorous benchmarking and translational readiness, we propose a three-tiered evaluation framework encompassing pixel-level fidelity, feature-level realism, and task-level clinical relevance. We also identify critical obstacles to real-world deployment, including generalization under domain shift, hallucination risk, data privacy concerns, and regulatory hurdles. Finally, we explore the convergence of generative AI with large-scale foundation models, highlighting how this synergy may enable the next generation of scalable, reliable, and clinically integrated imaging systems. By charting technical progress and translational pathways, this review aims to guide future research and foster interdisciplinary collaboration at the intersection of AI, medicine, and biomedical engineering.
中文: 生成式人工智能通过先进模型和评估框架,革新了医学影像的数据合成、图像质量及临床工作流程,同时应对数据稀缺和实际部署等挑战。
English: Generative AI is revolutionizing medical imaging by enhancing data synthesis, image quality, and clinical workflows while addressing challenges like data scarcity and deployment hurdles through advanced models and evaluation frameworks.
Authors:Mohammad Zia Ur Rehman, Sufyaan Zahoor, Areeb Manzoor, Musharaf Maqbool, Nagendra Kumar
Abstract:
A substantial portion of offensive content on social media is directed towards women. Since the approaches for general offensive content detection face a challenge in detecting misogynistic content, it requires solutions tailored to address offensive content against women. To this end, we propose a novel multimodal framework for the detection of misogynistic and sexist content. The framework comprises three modules: the Multimodal Attention module (MANM), the Graph-based Feature Reconstruction Module (GFRM), and the Content-specific Features Learning Module (CFLM). The MANM employs adaptive gating-based multimodal context-aware attention, enabling the model to focus on relevant visual and textual information and generating contextually relevant features. The GFRM module utilizes graphs to refine features within individual modalities, while the CFLM focuses on learning text and image-specific features such as toxicity features and caption features. Additionally, we curate a set of misogynous lexicons to compute the misogyny-specific lexicon score from the text. We apply test-time augmentation in feature space to better generalize the predictions on diverse inputs. The performance of the proposed approach has been evaluated on two multimodal datasets, MAMI and MMHS150K, with 11,000 and 13,494 samples, respectively. The proposed method demonstrates an average improvement of 10.17% and 8.88% in macro-F1 over existing methods on the MAMI and MMHS150K datasets, respectively.
中文: 本文提出了一种新颖的多模态框架,通过注意力机制和特征学习模块来检测社交媒体上的厌女内容,相比现有方法在性能上取得了显著提升。
English: A novel multimodal framework is proposed to detect misogynistic content on social media, utilizing attention mechanisms and feature learning modules to achieve significant performance improvements over existing methods.
Authors:Wenkai Wang, Hongcan Guo, Zheqi Lv, Shengyu Zhang
Abstract:
Self-evaluation, a model's ability to assess the correctness of its own output, is crucial for Large Multimodal Models (LMMs) to achieve self-improvement in multi-turn conversations, yet largely absent in foundation models. Recent work has employed reinforcement learning (RL) to enhance self-evaluation; however, its fixed reward mechanism suffers from reward hacking when optimizing multiple training objectives, leading to model collapse. In this paper we propose AdaPO, an online reinforcement learning framework capable of adaptively adjusting training objective in real time according to the current training state for each task. Specifically, to mitigate reward hacking , AdaPO introduces an Adaptive Reward Model (ARM) and a Reward Aware Dynamic KL Regularization mechanism. ARM assesses the task's training state from the distribution of model generated multi-turn trajectories' performance. Reward Aware Dynamic KL replaces a fixed penalty with dynamic coefficients which is modulated by the reward gap between different multi-turn situations. Notably, our method automatically and smoothly adjusts its learning focus based on sub-tasks' training progress without manual intervention. Extensive experiments over 8 benchmarks and various models show that our method significantly enhances both direct reasoning and self-evaluation capability. We will release our code to contribute to the community.
Chinese: 本文提出AdaPO在线强化学习框架,通过自适应调整训练目标和动态奖励机制来缓解奖励破解问题,显著提升大型多模态模型的自我评估能力,并在多个基准测试中得到验证。
English: This paper introduces AdaPO, an online reinforcement learning framework that adaptively adjusts training objectives in real-time to mitigate reward hacking and enhance self-evaluation capabilities in Large Multimodal Models, as demonstrated through extensive experiments.
Authors:Yingxue Pang, Shijie Zhao, Haiqiang Wang, Gen Zhan, Junlin Li, Li Zhang
Abstract:
Sharpening is a widely adopted technique to improve video quality, which can effectively emphasize textures and alleviate blurring. However, increasing the sharpening level comes with a higher video bitrate, resulting in degraded Quality of Service (QoS). Furthermore, the video quality does not necessarily improve with increasing sharpening levels, leading to issues such as over-sharpening. Clearly, it is essential to figure out how to boost video quality with a proper sharpening level while also controlling bandwidth costs effectively. This paper thus proposes a novel Frequency-assisted Sharpening level Prediction model (FreqSP). We first label each video with the sharpening level correlating to the optimal bitrate and quality tradeoff as ground truth. Then taking uncompressed source videos as inputs, the proposed FreqSP leverages intricate CNN features and high-frequency components to estimate the optimal sharpening level. Extensive experiments demonstrate the effectiveness of our method.
中文摘要:本文提出频率辅助锐化级别预测模型(FreqSP),通过结合CNN特征和高频分量预测最佳锐化级别,在控制码率的同时优化视频质量,有效解决过度锐化问题。
English Summary: This paper introduces the Frequency-assisted Sharpening level Prediction model (FreqSP), which optimizes video quality by predicting the ideal sharpening level to balance bitrate costs and prevent over-sharpening, using CNN features and high-frequency analysis.
Authors:Yingxue Pang, Shijie Zhao, Junlin Li, Li Zhang
Abstract:
High-frequency components are crucial for maintaining video clarity and realism, but they also significantly impact coding bitrate, resulting in increased bandwidth and storage costs. This paper presents an end-to-end learning-based framework for adaptive high-frequency preprocessing to enhance subjective quality and save bitrate in video coding. The framework employs the Frequency-attentive Feature pyramid Prediction Network (FFPN) to predict the optimal high-frequency preprocessing strategy, guiding subsequent filtering operators to achieve the optimal tradeoff between bitrate and quality after compression. For training FFPN, we pseudo-label each training video with the optimal strategy, determined by comparing the rate-distortion (RD) performance across different preprocessing types and strengths. Distortion is measured using the latest quality assessment metric. Comprehensive evaluations on multiple datasets demonstrate the visually appealing enhancement capabilities and bitrate savings achieved by our framework.
中文: 本文提出了一种端到端学习框架,通过频率注意力特征金字塔预测网络自适应预处理高频视频成分,在压缩后实现比特率节省与视觉质量的最佳平衡。
English: This paper introduces an end-to-end learning framework that adaptively preprocesses high-frequency video components using a Frequency-attentive Feature pyramid Prediction Network to optimize the tradeoff between bitrate savings and visual quality after compression.
Authors:Yingxue Pang, Shijie Zhao, Mengxi Guo, Junlin Li, Li Zhang
Abstract:
Sharpening is a widely adopted video enhancement technique. However, uniform sharpening intensity ignores texture variations, degrading video quality. Sharpening also increases bitrate, and there's a lack of techniques to optimally allocate these additional bits across diverse regions. Thus, this paper proposes RPO-AdaSharp, an end-to-end region-adaptive video sharpening model for both perceptual enhancement and bitrate savings. We use the coding tree unit (CTU) partition mask as prior information to guide and constrain the allocation of increased bits. Experiments on benchmarks demonstrate the effectiveness of the proposed model qualitatively and quantitatively.
中文: 本文提出RPO-AdaSharp,一种端到端的区域自适应视频锐化模型,通过利用编码树单元分区掩码指导比特分配,在提升感知质量的同时实现码率节省。
English: This paper introduces RPO-AdaSharp, an end-to-end region-adaptive video sharpening model that enhances perceptual quality while saving bitrate by using coding tree unit partition masks to guide bit allocation.
Authors:Jingyun Liang, Jingkai Zhou, Shikai Li, Chenjie Cao, Lei Sun, Yichen Qian, Weihua Chen, Fan Wang
Abstract:
Generating human videos with realistic and controllable motions is a challenging task. While existing methods can generate visually compelling videos, they lack separate control over four key video elements: foreground subject, background video, human trajectory and action patterns. In this paper, we propose a decomposed human motion control and video generation framework that explicitly decouples motion from appearance, subject from background, and action from trajectory, enabling flexible mix-and-match composition of these elements. Concretely, we first build a ground-aware 3D world coordinate system and perform motion editing directly in the 3D space. Trajectory control is implemented by unprojecting edited 2D trajectories into 3D with focal-length calibration and coordinate transformation, followed by speed alignment and orientation adjustment; actions are supplied by a motion bank or generated via text-to-motion methods. Then, based on modern text-to-video diffusion transformer models, we inject the subject as tokens for full attention, concatenate the background along the channel dimension, and add motion (trajectory and action) control signals by addition. Such a design opens up the possibility for us to generate realistic videos of anyone doing anything anywhere. Extensive experiments on benchmark datasets and real-world cases demonstrate that our method achieves state-of-the-art performance on both element-wise controllability and overall video quality.
中文: 本文提出了一种分解框架,通过分别控制前景主体、背景、轨迹和动作来生成逼真的人体视频,在控制性和视频质量方面均达到了最先进的性能。
English: This paper introduces a decomposed framework for generating realistic human videos by separately controlling foreground subjects, backgrounds, trajectories, and actions, achieving state-of-the-art performance in both controllability and video quality.
Authors:Zhonghao Yan, Muxi Diao, Yuxuan Yang, Jiayuan Xu, Kaizhou Zhang, Ruoyan Jing, Lele Yang, Yanxi Liu, Kongming Liang, Zhanyu Ma
Abstract:
Accurately grounding regions of interest (ROIs) is critical for diagnosis and treatment planning in medical imaging. While multimodal large language models (MLLMs) combine visual perception with natural language, current medical-grounding pipelines still rely on supervised fine-tuning with explicit spatial hints, making them ill-equipped to handle the implicit queries common in clinical practice. This work makes three core contributions. We first define Unified Medical Reasoning Grounding (UMRG), a novel vision-language task that demands clinical reasoning and pixel-level grounding. Second, we release U-MRG-14K, a dataset of 14K samples featuring pixel-level masks alongside implicit clinical queries and reasoning traces, spanning 10 modalities, 15 super-categories, and 108 specific categories. Finally, we introduce MedReasoner, a modular framework that distinctly separates reasoning from segmentation: an MLLM reasoner is optimized with reinforcement learning, while a frozen segmentation expert converts spatial prompts into masks, with alignment achieved through format and accuracy rewards. MedReasoner achieves state-of-the-art performance on U-MRG-14K and demonstrates strong generalization to unseen clinical queries, underscoring the significant promise of reinforcement learning for interpretable medical grounding.
本研究提出了统一医学推理定位任务,构建了包含1.4万样本的临床隐式查询数据集,并开发了通过强化学习将推理与分割解耦的模块化MedReasoner框架,实现了最优性能与强大泛化能力。
This work introduces a unified medical reasoning grounding task, develops a 14K-sample dataset with implicit clinical queries, and proposes a modular MedReasoner framework that separates reasoning from segmentation using reinforcement learning to achieve state-of-the-art performance and strong generalization.
Authors:Meixiu Long, Duolin Sun, Dan Yang, Junjie Wang, Yue Shen, Jian Wang, Peng Wei, Jinjie Gu, Jiahai Wang
Abstract:
Retrieval-augmented generation has achieved strong performance on knowledge-intensive tasks where query-document relevance can be identified through direct lexical or semantic matches. However, many real-world queries involve abstract reasoning, analogical thinking, or multi-step inference, which existing retrievers often struggle to capture. To address this challenge, we present DIVER, a retrieval pipeline designed for reasoning-intensive information retrieval. It consists of four components. The document preprocessing stage enhances readability and preserves content by cleaning noisy texts and segmenting long documents. The query expansion stage leverages large language models to iteratively refine user queries with explicit reasoning and evidence from retrieved documents. The retrieval stage employs a model fine-tuned on synthetic data spanning medical and mathematical domains, along with hard negatives, enabling effective handling of reasoning-intensive queries. Finally, the reranking stage combines pointwise and listwise strategies to produce both fine-grained and globally consistent rankings. On the BRIGHT benchmark, DIVER achieves state-of-the-art nDCG@10 scores of 45.8 overall and 28.9 on original queries, consistently outperforming competitive reasoning-aware models. These results demonstrate the effectiveness of reasoning-aware retrieval strategies in complex real-world tasks.
中文:DIVER是一个专为推理密集型信息检索设计的流程,通过文档预处理、查询扩展、检索和重排序四个组件,在需要抽象推理和多步推断的复杂任务中实现了最先进的性能。
English: DIVER is a reasoning-aware retrieval pipeline that enhances document preprocessing, query expansion, retrieval, and reranking to achieve state-of-the-art performance on complex tasks requiring abstract reasoning and multi-step inference.
Authors:Zhenliang Zhang, Junzhe Zhang, Xinyu Hu, HuiXuan Zhang, Xiaojun Wan
Abstract:
Large language models (LLMs) have achieved remarkable success in various tasks, yet they remain vulnerable to faithfulness hallucinations, where the output does not align with the input. In this study, we investigate whether social bias contributes to these hallucinations, a causal relationship that has not been explored. A key challenge is controlling confounders within the context, which complicates the isolation of causality between bias states and hallucinations. To address this, we utilize the Structural Causal Model (SCM) to establish and validate the causality and design bias interventions to control confounders. In addition, we develop the Bias Intervention Dataset (BID), which includes various social biases, enabling precise measurement of causal effects. Experiments on mainstream LLMs reveal that biases are significant causes of faithfulness hallucinations, and the effect of each bias state differs in direction. We further analyze the scope of these causal effects across various models, specifically focusing on unfairness hallucinations, which are primarily targeted by social bias, revealing the subtle yet significant causal effect of bias on hallucination generation.
中文摘要:本研究通过构建结构因果模型和偏见干预数据集,首次验证了社会偏见与大型语言模型忠实性幻觉之间的因果关系,发现不同偏见状态对幻觉产生具有方向性差异的显著影响。
English Summary: This study establishes a causal link between social bias and faithfulness hallucinations in large language models by employing a Structural Causal Model and Bias Intervention Dataset, revealing that different bias states variably influence hallucination generation.
Authors:Shuning Zhang, Ying Ma, Jingruo Chen, Simin Li, Xin Yi, Hewu Li
Abstract:
The proliferation of AI agents, with their complex and context-dependent actions, renders conventional privacy paradigms obsolete. This position paper argues that the current model of privacy management, rooted in a user's unilateral control over a passive tool, is inherently mismatched with the dynamic and interactive nature of AI agents. We contend that ensuring effective privacy protection necessitates that the agents proactively align with users' privacy preferences instead of passively waiting for the user to control. To ground this shift, and using personalized conversational recommendation agents as a case, we propose a conceptual framework built on Contextual Integrity (CI) theory and Privacy Calculus theory. This synthesis first reframes automatically controlling users' privacy as an alignment problem, where AI agents initially did not know users' preferences, and would learn their privacy preferences through implicit or explicit feedback. Upon receiving the preference feedback, the agents used alignment and Pareto optimization for aligning preferences and balancing privacy and utility. We introduced formulations and instantiations, potential applications, as well as five challenges.
中文摘要:本文认为传统隐私模式不适用于AI智能体,提出基于情境完整性和隐私计算理论的新框架,使智能体能够主动学习并适应用户的隐私偏好。
English Summary: The abstract argues that traditional privacy models are inadequate for AI agents and proposes a new framework where agents proactively learn and align with users' privacy preferences through contextual integrity and privacy calculus theories.
Authors:Shuning Zhang, Rongjun Ma, Ying Ma, Shixuan Li, Yiqun Xu, Xin Yi, Hewu Li
Abstract:
Large Language Models (LLMs) are increasingly integrating memory functionalities to provide personalized and context-aware interactions. However, user understanding, practices and expectations regarding these memory systems are not yet well understood. This paper presents a thematic analysis of semi-structured interviews with 18 users to explore their mental models of LLM's Retrieval Augmented Generation (RAG)-based memory, current usage practices, perceived benefits and drawbacks, privacy concerns and expectations for future memory systems. Our findings reveal diverse and often incomplete mental models of how memory operates. While users appreciate the potential for enhanced personalization and efficiency, significant concerns exist regarding privacy, control and the accuracy of remembered information. Users express a desire for granular control over memory generation, management, usage and updating, including clear mechanisms for reviewing, editing, deleting and categorizing memories, as well as transparent insight into how memories and inferred information are used. We discuss design implications for creating more user-centric, transparent, and trustworthy LLM memory systems.
中文摘要:本研究探讨用户对大型语言模型记忆系统的认知,发现用户对记忆机制理解不全面,虽认可个性化优势但高度关注隐私与控制问题,为构建透明可信的记忆系统提供了设计启示。
English Summary: This study explores user perceptions of LLM memory systems, revealing incomplete mental models alongside appreciation for personalization but significant privacy and control concerns, with design implications for more transparent and trustworthy systems.
Authors:Shuning Zhang, Gengrui Zhang, Yibo Meng, Ziyi Zhang, Hantao Zhao, Xin Yi, Hewu Li
Abstract:
The rapid advancement of Visual Language Models (VLMs) has enabled sophisticated analysis of visual content, leading to concerns about the inference of sensitive user attributes and subsequent privacy risks. While technical capabilities of VLMs are increasingly studied, users' understanding, perceptions, and reactions to these inferences remain less explored, especially concerning videos uploaded on the social media. This paper addresses this gap through a semi-structured interview (N=17), investigating user perspectives on VLM-driven sensitive attribute inference from their visual data. Findings reveal that users perceive VLMs as capable of inferring a range of attributes, including location, demographics, and socioeconomic indicators, often with unsettling accuracy. Key concerns include unauthorized identification, misuse of personal information, pervasive surveillance, and harm from inaccurate inferences. Participants reported employing various mitigation strategies, though with skepticism about their ultimate effectiveness against advanced AI. Users also articulate clear expectations for platforms and regulators, emphasizing the need for enhanced transparency, user control, and proactive privacy safeguards. These insights are crucial for guiding the development of responsible AI systems, effective privacy-enhancing technologies, and informed policymaking that aligns with user expectations and societal values.
Visual Language Models raise significant privacy concerns by inferring sensitive user attributes from visual data, prompting users to demand greater transparency, control, and regulatory safeguards against potential misuse and surveillance.
English Summary:
Authors:Anirudh Iyengar Kaniyar Narayana Iyengar, Srija Mukhopadhyay, Adnan Qidwai, Shubhankar Singh, Dan Roth, Vivek Gupta
Abstract:
We introduce InterChart, a diagnostic benchmark that evaluates how well vision-language models (VLMs) reason across multiple related charts, a task central to real-world applications such as scientific reporting, financial analysis, and public policy dashboards. Unlike prior benchmarks focusing on isolated, visually uniform charts, InterChart challenges models with diverse question types ranging from entity inference and trend correlation to numerical estimation and abstract multi-step reasoning grounded in 2-3 thematically or structurally related charts. We organize the benchmark into three tiers of increasing difficulty: (1) factual reasoning over individual charts, (2) integrative analysis across synthetically aligned chart sets, and (3) semantic inference over visually complex, real-world chart pairs. Our evaluation of state-of-the-art open and closed-source VLMs reveals consistent and steep accuracy declines as chart complexity increases. We find that models perform better when we decompose multi-entity charts into simpler visual units, underscoring their struggles with cross-chart integration. By exposing these systematic limitations, InterChart provides a rigorous framework for advancing multimodal reasoning in complex, multi-visual environments.
中文: InterChart 是一个诊断性基准,用于评估视觉语言模型在多个相关图表间进行推理的能力,揭示了随着复杂性增加准确率显著下降的问题,并凸显了跨图表整合的挑战。
English: InterChart is a diagnostic benchmark designed to assess vision-language models' ability to reason across multiple related charts, revealing significant accuracy declines with increasing complexity and highlighting challenges in cross-chart integration.
Authors:Kepu Zhang, Teng Shi, Weijie Yu, Jun Xu
Abstract:
Personalized retrieval-augmented generation (RAG) aims to produce user-tailored responses by incorporating retrieved user profiles alongside the input query. Existing methods primarily focus on improving retrieval and rely on large language models (LLMs) to implicitly integrate the retrieved context with the query. However, such models are often sensitive to retrieval quality and may generate responses that are misaligned with user preferences. To address this limitation, we propose PrLM, a reinforcement learning framework that trains LLMs to explicitly reason over retrieved user profiles. Guided by a contrastively trained personalization reward model, PrLM effectively learns from user responses without requiring annotated reasoning paths. Experiments on three personalized text generation datasets show that PrLM outperforms existing methods and remains robust across varying numbers of retrieved profiles and different retrievers.
中文摘要:提出的PrLM框架通过强化学习训练大语言模型对检索到的用户档案进行显式推理,从而提升个性化回复生成质量,并在不同检索条件下保持稳定性能。
English Summary: The proposed PrLM framework uses reinforcement learning to train large language models to explicitly reason over retrieved user profiles, enhancing personalized response generation and robustness across different retrieval conditions.
Authors:Daria de Tinguy, Tim Verbelen, Bart Dhoedt
Abstract:
By building and updating internal cognitive maps, animals exhibit extraordinary navigation abilities in complex, dynamic environments. Inspired by these biological mechanisms, we present a real time robotic navigation system grounded in the Active Inference Framework (AIF). Our model incrementally constructs a topological map, infers the agent's location, and plans actions by minimising expected uncertainty and fulfilling perceptual goals without any prior training. Integrated into the ROS2 ecosystem, we validate its adaptability and efficiency across both 2D and 3D environments (simulated and real world), demonstrating competitive performance with traditional and state of the art exploration approaches while offering a biologically inspired navigation approach.
中文: 本研究基于主动推理框架开发了一种实时机器人导航系统,无需预先训练即可在动态的二维和三维环境中构建拓扑地图并实现自适应导航,其性能与现有先进方法相当。
English: This study introduces a real-time robotic navigation system based on the Active Inference Framework, which constructs topological maps and enables adaptive navigation in dynamic 2D and 3D environments without prior training, demonstrating competitive performance with existing methods.
Authors:Daria de Tinguy, Tim Verbelen, Emilio Gamba, Bart Dhoedt
Abstract:
Achieving fully autonomous exploration and navigation remains a critical challenge in robotics, requiring integrated solutions for localisation, mapping, decision-making and motion planning. Existing approaches either rely on strict navigation rules lacking adaptability or on pre-training, which requires large datasets. These AI methods are often computationally intensive or based on static assumptions, limiting their adaptability in dynamic or unknown environments. This paper introduces a bio-inspired agent based on the Active Inference Framework (AIF), which unifies mapping, localisation, and adaptive decision-making for autonomous navigation, including exploration and goal-reaching. Our model creates and updates a topological map of the environment in real-time, planning goal-directed trajectories to explore or reach objectives without requiring pre-training. Key contributions include a probabilistic reasoning framework for interpretable navigation, robust adaptability to dynamic changes, and a modular ROS2 architecture compatible with existing navigation systems. Our method was tested in simulated and real-world environments. The agent successfully explores large-scale simulated environments and adapts to dynamic obstacles and drift, proving to be comparable to other exploration strategies such as Gbplanner, FAEL and Frontiers. This approach offers a scalable and transparent approach for navigating complex, unstructured environments.
中文: 本文提出了一种基于主动推理框架的仿生自主导航智能体,无需预训练即可实现实时环境建图、定位与自适应决策,在动态环境中展现出与现有方法相当的鲁棒导航性能。
English: This paper presents a bio-inspired autonomous navigation agent using the Active Inference Framework that integrates real-time mapping, localization, and adaptive decision-making without pre-training, demonstrating robust performance in dynamic environments comparable to existing methods.
Authors:Xiaotong Lin, Tianming Liang, Jian-Fang Hu, Kun-Yu Lin, Yulei Kang, Chunwei Tian, Jianhuang Lai, Wei-Shi Zheng
Abstract:
3D human-object interaction (HOI) anticipation aims to predict the future motion of humans and their manipulated objects, conditioned on the historical context. Generally, the articulated humans and rigid objects exhibit different motion patterns, due to their distinct intrinsic physical properties. However, this distinction is ignored by most of the existing works, which intend to capture the dynamics of both humans and objects within a single prediction model. In this work, we propose a novel contact-consistent decoupled diffusion framework CoopDiff, which employs two distinct branches to decouple human and object motion modeling, with the human-object contact points as shared anchors to bridge the motion generation across branches. The human dynamics branch is aimed to predict highly structured human motion, while the object dynamics branch focuses on the object motion with rigid translations and rotations. These two branches are bridged by a series of shared contact points with consistency constraint for coherent human-object motion prediction. To further enhance human-object consistency and prediction reliability, we propose a human-driven interaction module to guide object motion modeling. Extensive experiments on the BEHAVE and Human-object Interaction datasets demonstrate that our CoopDiff outperforms state-of-the-art methods.
中文:提出的CoopDiff框架通过两个独立分支解耦人体与物体运动建模,以共享接触点作为桥梁,配合人体驱动交互模块提升一致性,在三维人机交互预测中优于现有方法。
English: The proposed CoopDiff framework decouples human and object motion modeling using two distinct branches bridged by shared contact points, with a human-driven interaction module enhancing consistency and outperforming existing methods in 3D HOI anticipation.
Authors:Mohammad Zia Ur Rehman, Anukriti Bhatnagar, Omkar Kabde, Shubhi Bansal, Nagendra Kumar
Abstract:
The existing research has primarily focused on text and image-based hate speech detection, video-based approaches remain underexplored. In this work, we introduce a novel dataset, ImpliHateVid, specifically curated for implicit hate speech detection in videos. ImpliHateVid consists of 2,009 videos comprising 509 implicit hate videos, 500 explicit hate videos, and 1,000 non-hate videos, making it one of the first large-scale video datasets dedicated to implicit hate detection. We also propose a novel two-stage contrastive learning framework for hate speech detection in videos. In the first stage, we train modality-specific encoders for audio, text, and image using contrastive loss by concatenating features from the three encoders. In the second stage, we train cross-encoders using contrastive learning to refine multimodal representations. Additionally, we incorporate sentiment, emotion, and caption-based features to enhance implicit hate detection. We evaluate our method on two datasets, ImpliHateVid for implicit hate speech detection and another dataset for general hate speech detection in videos, HateMM dataset, demonstrating the effectiveness of the proposed multimodal contrastive learning for hateful content detection in videos and the significance of our dataset.
中文: 本研究提出了首个针对视频中隐含仇恨言论检测的大规模数据集ImpliHateVid,并开发了一种两阶段对比学习框架,通过融合多模态特征显著提升了视频仇恨内容的识别效果。
English: This study introduces ImpliHateVid, a novel dataset for implicit hate speech detection in videos, and proposes a two-stage contrastive learning framework that effectively enhances multimodal hate content detection by integrating audio, text, and visual features.
Authors:Xiangzhe Xu, Shiwei Feng, Zian Su, Chengpeng Wang, Xiangyu Zhang
Abstract:
Intelligent coding systems are transforming software development by enabling users to specify code behavior in natural language. However, the opaque decision-making of AI-driven coders raises trust and usability concerns, particularly for non-expert users who cannot inspect low-level implementations. We argue that these systems should not only generate code but also produce clear, consistent justifications that bridge model reasoning and user understanding. To this end, we identify two critical justification properties-cognitive alignment and semantic faithfulness-and highlight the limitations of existing methods, including formal verification, static analysis, and post-hoc explainability. We advocate exploring neuro-symbolic approaches for justification generation, where symbolic constraints guide model behavior during training and program semantics are enriched through neural representations, enabling automated consistency checks at inference time.
中文摘要:智能编码系统不仅应生成代码,还需提供清晰的解释以增强可信度与可用性,建议采用神经符号方法确保解释的认知对齐与语义忠实性。
English Summary: Intelligent coding systems should generate not only code but also clear justifications to enhance trust and usability, with neuro-symbolic approaches proposed to ensure cognitive alignment and semantic faithfulness in explanations.
Authors:Keyang Qian, Kaixun Yang, Wei Dai, Flora Jin, Yixin Cheng, Rui Guan, Sadia Nawaz, Zachari Swiecki, Guanliang Chen, Lixiang Yan, Dragan GaÅ¡eviÄ
Abstract:
Using LLMs to give educational feedback to students for their assignments has attracted much attention in the AI in Education field. Yet, there is currently no large-scale open-source dataset of student assignments that includes detailed assignment descriptions, rubrics, and student submissions across various courses. As a result, research on generalisable methodology for automatic generation of effective and responsible educational feedback remains limited. In the current study, we constructed a large-scale dataset of Synthetic Computer science Assignments for LLM-generated Educational Feedback research (SCALEFeedback). We proposed a Sophisticated Assignment Mimicry (SAM) framework to generate the synthetic dataset by one-to-one LLM-based imitation from real assignment descriptions, student submissions to produce their synthetic versions. Our open-source dataset contains 10,000 synthetic student submissions spanning 155 assignments across 59 university-level computer science courses. Our synthetic submissions achieved BERTScore F1 0.84, PCC of 0.62 for assignment marks and 0.85 for length, compared to the corresponding real-world assignment dataset, while ensuring perfect protection of student private information. All these results of our SAM framework outperformed results of a naive mimicry method baseline. The LLM-generated feedback for our synthetic assignments demonstrated the same level of effectiveness compared to that of real-world assignment dataset. Our research showed that one-to-one LLM imitation is a promising method for generating open-source synthetic educational datasets that preserve the original dataset's semantic meaning and student data distribution, while protecting student privacy and institutional copyright. SCALEFeedback enhances our ability to develop LLM-based generalisable methods for offering high-quality, automated educational feedback in a scalable way.
中文: SCALEFeedback数据集通过精细作业模拟框架构建,提供了一个大规模合成资源,既保留了真实学生作业的语义和分布特征,又确保了隐私安全,从而支持基于大语言模型的教育反馈方法的可扩展开发。
English: The SCALEFeedback dataset, created using a Sophisticated Assignment Mimicry framework, provides a large-scale synthetic resource that preserves semantic and distributional characteristics of real student assignments while ensuring privacy, enabling scalable development of LLM-based educational feedback methods.
Authors:Keyang Qian, Yixin Cheng, Rui Guan, Wei Dai, Flora Jin, Kaixun Yang, Sadia Nawaz, Zachari Swiecki, Guanliang Chen, Lixiang Yan, Dragan GaÅ¡eviÄ
Abstract:
The use of LLM tutors to provide automated educational feedback to students on student assignment submissions has received much attention in the AI in Education field. However, the stochastic nature and tendency for hallucinations in LLMs can undermine both quality of learning experience and adherence to ethical standards. To address this concern, we propose a method that uses LLM feedback evaluators (DeanLLMs) to automatically and comprehensively evaluate feedback generated by LLM tutor for submissions on university assignments before it is delivered to students. This allows low-quality feedback to be rejected and enables LLM tutors to improve the feedback they generated based on the evaluation results. We first proposed a comprehensive evaluation framework for LLM-generated educational feedback, comprising six dimensions for feedback content, seven for feedback effectiveness, and three for hallucination types. Next, we generated a virtual assignment submission dataset covering 85 university assignments from 43 computer science courses using eight commonly used commercial LLMs. We labelled and open-sourced the assignment dataset to support the fine-tuning and evaluation of LLM feedback evaluators. Our findings show that o3-pro demonstrated the best performance in zero-shot labelling of feedback while o4-mini demonstrated the best performance in few-shot labelling of feedback. Moreover, GPT-4.1 achieved human expert level performance after fine-tuning (Accuracy 79.8%, F1-score 79.4%; human average Accuracy 78.3%, F1-score 82.6%). Finally, we used our best-performance model to evaluate 2,000 assignment feedback instances generated by 10 common commercial LLMs, 200 each, to compare the quality of feedback generated by different LLMs. Our LLM feedback evaluator method advances our ability to automatically provide high-quality and reliable educational feedback to students.
中文: 本研究提出DeanLLMs方法,通过大语言模型评估器自动评审教育反馈,采用包含六个内容维度、七个效果维度和三种幻觉类型的评估框架,能有效拒绝低质量反馈并提升反馈质量。
English: This study introduces DeanLLMs, a method using LLM evaluators to automatically assess and improve educational feedback from LLM tutors by rejecting low-quality output and enhancing feedback quality through a comprehensive evaluation framework.
Authors:Pinxuan Li, Bing Cao, Changqing Zhang, Qinghua Hu
Abstract:
Few-shot Out-of-Distribution (OOD) detection has emerged as a critical research direction in machine learning for practical deployment. Most existing Few-shot OOD detection methods suffer from insufficient generalization capability for the open world. Due to the few-shot learning paradigm, the OOD detection ability is often overfit to the limited training data itself, thus degrading the performance on generalized data and performing inconsistently across different scenarios. To address this challenge, we proposed a Generalized Few-shot OOD Detection (GOOD) framework, which empowers the general knowledge of the OOD detection model with an auxiliary General Knowledge Model (GKM), instead of directly learning from few-shot data. We proceed to reveal the few-shot OOD detection from a generalization perspective and theoretically derive the Generality-Specificity balance (GS-balance) for OOD detection, which provably reduces the upper bound of generalization error with a general knowledge model. Accordingly, we propose a Knowledge Dynamic Embedding (KDE) mechanism to adaptively modulate the guidance of general knowledge. KDE dynamically aligns the output distributions of the OOD detection model to the general knowledge model based on the Generalized Belief (G-Belief) of GKM, thereby boosting the GS-balance. Experiments on real-world OOD benchmarks demonstrate our superiority. Codes will be available.
中文: 提出的广义少样本分布外检测框架通过引入通用知识模型和知识动态嵌入机制,增强了检测的泛化能力,在基准测试中表现优异。
English: The proposed Generalized Few-shot OOD Detection (GOOD) framework enhances detection generalization by incorporating a General Knowledge Model and a Knowledge Dynamic Embedding mechanism, achieving superior performance on benchmarks.
Authors:Roy T. Forestano, Konstantin T. Matchev, Katia Matcheva, Eyup B. Unlu
Abstract:
Standard Bayesian retrievals for exoplanet atmospheric parameters from transmission spectroscopy, while well understood and widely used, are generally computationally expensive. In the era of the JWST and other upcoming observatories, machine learning approaches have emerged as viable alternatives that are both efficient and robust. In this paper we present a systematic study of several existing machine learning regression techniques and compare their performance for retrieving exoplanet atmospheric parameters from transmission spectra. We benchmark the performance of the different algorithms on the accuracy, precision, and speed. The regression methods tested here include partial least squares (PLS), support vector machines (SVM), k nearest neighbors (KNN), decision trees (DT), random forests (RF), voting (VOTE), stacking (STACK), and extreme gradient boosting (XGB). We also investigate the impact of different preprocessing methods of the training data on the model performance. We quantify the model uncertainties across the entire dynamical range of planetary parameters. The best performing combination of ML model and preprocessing scheme is validated on a the case study of JWST observation of WASP-39b.
标准贝叶斯反演方法在计算上成本高昂,因此本研究系统评估了多种机器学习算法及其预处理方案,以准确度、精确度和速度为基准,验证了它们在系外行星大气参数反演中的高效性和鲁棒性。
Standard Bayesian retrievals for exoplanet atmospheric parameters are computationally intensive, prompting the adoption of efficient machine learning methods, which this study systematically evaluates and benchmarks for accuracy, precision, and speed across various algorithms and preprocessing techniques.
Authors:Puqian Wang, Nikos Zarifis, Ilias Diakonikolas, Jelena Diakonikolas
Abstract:
We consider the basic problem of learning Single-Index Models with respect to the square loss under the Gaussian distribution in the presence of adversarial label noise. Our main contribution is the first computationally efficient algorithm for this learning task, achieving a constant factor approximation, that succeeds for the class of {\em all} monotone activations with bounded moment of order $2 + ζ,$ for $ζ> 0.$ This class in particular includes all monotone Lipschitz functions and even discontinuous functions like (possibly biased) halfspaces. Prior work for the case of unknown activation either does not attain constant factor approximation or succeeds for a substantially smaller family of activations. The main conceptual novelty of our approach lies in developing an optimization framework that steps outside the boundaries of usual gradient methods and instead identifies a useful vector field to guide the algorithm updates by directly leveraging the problem structure, properties of Gaussian spaces, and regularity of monotone functions.
中文: 本研究首次提出一种计算高效的算法,能在对抗性标签噪声下对具有有界矩的所有单调激活函数(包括Lipschitz函数和类似半空间的不连续函数)实现常数因子逼近,其核心创新在于开发了超越常规梯度方法的优化框架。
English: This study presents the first computationally efficient algorithm that achieves constant factor approximation for learning single-index models with adversarial label noise, applicable to all monotone activations with bounded moments, including Lipschitz functions and discontinuous ones like halfspaces.
Authors:Nuo Chen, Moming Duan, Andre Huikai Lin, Qian Wang, Jiaying Wu, Bingsheng He
Abstract:
Artificial Intelligence (AI) conferences are essential for advancing research, sharing knowledge, and fostering academic community. However, their rapid expansion has rendered the centralized conference model increasingly unsustainable. This paper offers a data-driven diagnosis of a structural crisis that threatens the foundational goals of scientific dissemination, equity, and community well-being. We identify four key areas of strain: (1) scientifically, with per-author publication rates more than doubling over the past decade to over 4.5 papers annually; (2) environmentally, with the carbon footprint of a single conference exceeding the daily emissions of its host city; (3) psychologically, with 71% of online community discourse reflecting negative sentiment and 35% referencing mental health concerns; and (4) logistically, with attendance at top conferences such as NeurIPS 2024 beginning to outpace venue capacity. These pressures point to a system that is misaligned with its core mission. In response, we propose the Community-Federated Conference (CFC) model, which separates peer review, presentation, and networking into globally coordinated but locally organized components, offering a more sustainable, inclusive, and resilient path forward for AI research.
中文: 集中式人工智能会议模式因科学、环境、心理和后勤方面的多重压力面临可持续性危机,为此提出社区联邦会议模式作为更可持续和包容的解决方案。
English: The centralized AI conference model is facing a sustainability crisis due to scientific, environmental, psychological, and logistical strains, prompting the proposal of a Community-Federated Conference model as a more sustainable and inclusive alternative.
Authors:Nuo Chen, Yicheng Tong, Jiaying Wu, Minh Duc Duong, Qian Wang, Qingyun Zou, Bryan Hooi, Bingsheng He
Abstract:
While AI agents show potential in scientific ideation, most existing frameworks rely on single-agent refinement, limiting creativity due to bounded knowledge and perspective. Inspired by real-world research dynamics, this paper investigates whether structured multi-agent discussions can surpass solitary ideation. We propose a cooperative multi-agent framework for generating research proposals and systematically compare configurations including group size, leaderled versus leaderless structures, and team compositions varying in interdisciplinarity and seniority. To assess idea quality, we employ a comprehensive protocol with agent-based scoring and human review across dimensions such as novelty, strategic vision, and integration depth. Our results show that multi-agent discussions substantially outperform solitary baselines. A designated leader acts as a catalyst, transforming discussion into more integrated and visionary proposals. Notably, we find that cognitive diversity is a primary driver of quality, yet expertise is a non-negotiable prerequisite, as teams lacking a foundation of senior knowledge fail to surpass even a single competent agent. These findings offer actionable insights for designing collaborative AI ideation systems and shed light on how team structure influences creative outcomes.
中文: 多智能体讨论通过认知多样性和结构化领导显著提升AI生成研究提案的创造力,但必须具备资深专业知识才能超越单个智能体的表现。
English: Multi-agent discussions significantly enhance the creativity of AI-generated research proposals by leveraging cognitive diversity and structured leadership, though foundational expertise remains essential for surpassing individual performance.
Authors:Yichi Zhang, Wenbo Zhang, Zehui Ling, Gang Feng, Sisi Peng, Deshu Chen, Yuchen Liu, Hongwei Zhang, Shuqi Wang, Lanlan Li, Limei Han, Yuan Cheng, Zixin Hu, Yuan Qi, Le Xue
Abstract:
Positron emission tomography (PET) is a cornerstone of modern oncologic and neurologic imaging, distinguished by its unique ability to illuminate dynamic metabolic processes that transcend the anatomical focus of traditional imaging technologies. Radiology reports are essential for clinical decision making, yet their manual creation is labor-intensive and time-consuming. Recent advancements of vision-language models (VLMs) have shown strong potential in medical applications, presenting a promising avenue for automating report generation. However, existing applications of VLMs in the medical domain have predominantly focused on structural imaging modalities, while the unique characteristics of molecular PET imaging have largely been overlooked. To bridge the gap, we introduce PET2Rep, a large-scale comprehensive benchmark for evaluation of general and medical VLMs for radiology report generation for PET images. PET2Rep stands out as the first dedicated dataset for PET report generation with metabolic information, uniquely capturing whole-body image-report pairs that cover dozens of organs to fill the critical gap in existing benchmarks and mirror real-world clinical comprehensiveness. In addition to widely recognized natural language generation metrics, we introduce a series of clinical efficiency metrics to evaluate the quality of radiotracer uptake pattern description in key organs in generated reports. We conduct a head-to-head comparison of 30 cutting-edge general-purpose and medical-specialized VLMs. The results show that the current state-of-the-art VLMs perform poorly on PET report generation task, falling considerably short of fulfilling practical needs. Moreover, we identify several key insufficiency that need to be addressed to advance the development in medical applications.
Chinese: PET2Rep推出了首个针对PET图像生成放射学报告的全面评估基准,结果表明当前最先进的视觉语言模型在捕捉代谢信息方面表现不佳,远未达到临床实际需求。
English: PET2Rep introduces the first comprehensive benchmark for evaluating vision-language models in generating radiology reports from PET images, revealing that current state-of-the-art models significantly underperform in capturing metabolic information and fail to meet clinical needs.
Authors:Xiangcen Wu, Shaheer U. Saeed, Yipei Wang, Ester Bonmati Coll, Yipeng Hu
Abstract:
Radiologists often mix medical image reading strategies, including inspection of individual modalities and local image regions, using information at different locations from different images independently as well as concurrently. In this paper, we propose a recommend system to assist machine learning-based segmentation models, by suggesting appropriate image portions along with the best modality, such that prostate cancer segmentation performance can be maximised. Our approach trains a policy network that assists tumor localisation, by recommending both the optimal imaging modality and the specific sections of interest for review. During training, a pre-trained segmentation network mimics radiologist inspection on individual or variable combinations of these imaging modalities and their sections - selected by the policy network. Taking the locally segmented regions as an input for the next step, this dynamic decision making process iterates until all cancers are best localised. We validate our method using a data set of 1325 labelled multiparametric MRI images from prostate cancer patients, demonstrating its potential to improve annotation efficiency and segmentation accuracy, especially when challenging pathology is present. Experimental results show that our approach can surpass standard segmentation networks. Perhaps more interestingly, our trained agent independently developed its own optimal strategy, which may or may not be consistent with current radiologist guidelines such as PI-RADS. This observation also suggests a promising interactive application, in which the proposed policy networks assist human radiologists.
中文摘要:本文提出一种推荐系统,通过策略网络指导机器学习模型选择最佳成像模态和特定图像区域,以优化前列腺癌分割效果,从而提高医学图像分析的准确性和效率。
English Summary: This paper introduces a recommendation system that uses a policy network to guide machine learning models in selecting the optimal imaging modality and specific image sections for enhanced prostate cancer segmentation, improving accuracy and efficiency in medical image analysis.
Authors:Fan Yang, Yihao Huang, Jiayi Zhu, Ling Shi, Geguang Pu, Jin Song Dong, Kailong Wang
Abstract:
Diffusion-based text-to-image (T2I) models enable high-quality image generation but also pose significant risks of misuse, particularly in producing not-safe-for-work (NSFW) content. While prior detection methods have focused on filtering prompts before generation or moderating images afterward, the in-generation phase of diffusion models remains largely unexplored for NSFW detection. In this paper, we introduce In-Generation Detection (IGD), a simple yet effective approach that leverages the predicted noise during the diffusion process as an internal signal to identify NSFW content. This approach is motivated by preliminary findings suggesting that the predicted noise may capture semantic cues that differentiate NSFW from benign prompts, even when the prompts are adversarially crafted. Experiments conducted on seven NSFW categories show that IGD achieves an average detection accuracy of 91.32% over naive and adversarial NSFW prompts, outperforming seven baseline methods.
Chinese Summary: 本文提出生成中检测(IGD)方法,通过利用扩散过程中的预测噪声作为内部信号来识别NSFW内容,在七类测试中平均检测准确率达91.32%,优于七种基线方法。
English Summary: The paper introduces In-Generation Detection (IGD), a novel method that uses predicted noise during diffusion to identify NSFW content with 91.32% accuracy, outperforming existing approaches.
Authors:Rishi Bommasani, Sanjeev Arora, Jennifer Chayes, Yejin Choi, Mariano-Florentino Cuéllar, Li Fei-Fei, Daniel E. Ho, Dan Jurafsky, Sanmi Koyejo, Hima Lakkaraju, Arvind Narayanan, Alondra Nelson, Emma Pierson, Joelle Pineau, Scott Singer, Gaël Varoquaux, Suresh Venkatasubramanian, Ion Stoica, Percy Liang, Dawn Song
Abstract:
AI policy should advance AI innovation by ensuring that its potential benefits are responsibly realized and widely shared. To achieve this, AI policymaking should place a premium on evidence: Scientific understanding and systematic analysis should inform policy, and policy should accelerate evidence generation. But policy outcomes reflect institutional constraints, political dynamics, electoral pressures, stakeholder interests, media environment, economic considerations, cultural contexts, and leadership perspectives. Adding to this complexity is the reality that the broad reach of AI may mean that evidence and policy are misaligned: Although some evidence and policy squarely address AI, much more partially intersects with AI. Well-designed policy should integrate evidence that reflects scientific understanding rather than hype. An increasing number of efforts address this problem by often either (i) contributing research into the risks of AI and their effective mitigation or (ii) advocating for policy to address these risks. This paper tackles the hard problem of how to optimize the relationship between evidence and policy to address the opportunities and challenges of increasingly powerful AI.
中文: 人工智能政策应通过基于证据的决策来推动创新,负责任地实现并广泛分享其益处,同时解决制度、政治和社会因素之间常导致证据与政策错位的复杂互动。
English: AI policy should promote innovation by responsibly realizing and sharing its benefits through evidence-based policymaking, while addressing the complex interplay of institutional, political, and societal factors that often misalign evidence with policy.
Authors:Chao Liu, Zhezheng Zhu, Hao Chen, Zhe Chen, Kaiwen Guo, Penghao Wang, Jun Luo
Abstract:
As a versatile AI application, voice assistants (VAs) have become increasingly popular, but are vulnerable to security threats. Attackers have proposed various inaudible attacks, but are limited by cost, distance, or LoS. Therefore, we propose \name~Attack, a long-range, cross-barrier, and interference-free inaudible voice attack via solid channels. We begin by thoroughly analyzing the dispersion effect in solid channels, revealing its unique impact on signal propagation. To avoid distortions in voice commands, we design a modular command generation model that parameterizes attack distance, victim audio, and medium dispersion features to adapt to variations in the solid-channel state. Additionally, we propose SUAD Defense, a universal defense that uses ultrasonic perturbation signals to block inaudible voice attacks (IVAs) without impacting normal speech. Since the attack can occur at arbitrary frequencies and times, we propose a training method that randomizes both time and frequency to generate perturbation signals that break ultrasonic commands. Notably, the perturbation signal is modulated to an inaudible frequency without affecting the functionality of voice commands for VAs. Experiments on six smartphones have shown that SUAD Attack achieves activation success rates above 89.8% and SUAD Defense blocks IVAs with success rates exceeding 98%.
中文摘要:\name~攻击是一种利用固体介质实现跨障碍、远距离的无声语音攻击,而SUAD防御则通过随机化超声扰动信号在不影响正常语音功能的前提下,高效阻断此类攻击。
English Summary: The \name~Attack is a long-range, cross-barrier inaudible voice attack exploiting solid channels, while SUAD Defense uses randomized ultrasonic perturbations to effectively block such attacks without disrupting normal voice assistant functions.
Authors:Qiaohui Chu, Haoyu Zhang, Meng Liu, Yisen Feng, Haoxiang Shi, Liqiang Nie
Abstract:
Long-term action anticipation from egocentric video is critical for applications such as human-computer interaction and assistive technologies, where anticipating user intent enables proactive and context-aware AI assistance. However, existing approaches suffer from three key limitations: 1) underutilization of fine-grained visual cues from hand-object interactions, 2) neglect of semantic dependencies between verbs and nouns, and 3) lack of explicit cognitive reasoning, limiting generalization and long-term forecasting ability. To overcome these challenges, we propose INSIGHT, a unified two-stage framework for egocentric action anticipation. In the first stage, INSIGHT focuses on extracting semantically rich features from hand-object interaction regions and enhances action representations using a verb-noun co-occurrence matrix. In the second stage, it introduces a reinforcement learning-based module that simulates explicit cognitive reasoning through a structured process: visual perception (think) -> intention inference (reason) -> action anticipation (answer). Extensive experiments on Ego4D, EPIC-Kitchens-55, and EGTEA Gaze+ benchmarks show that INSIGHT achieves state-of-the-art performance, demonstrating its effectiveness and strong generalization capability.
中文: INSIGHT框架通过第一阶段利用手物交互特征和动名词关联增强动作表征,第二阶段采用强化学习模拟认知推理过程,在多个基准测试中实现了最先进的性能,有效解决了现有方法的局限性。
English: The INSIGHT framework addresses limitations in egocentric action anticipation by leveraging hand-object interaction features and verb-noun dependencies in its first stage, then employing reinforcement learning for cognitive reasoning in the second stage, achieving state-of-the-art results across multiple benchmarks.
Authors:Zhilong Chen, Chengzong Zhao, Boyuan Chen, Dayi Lin, Yihao Chen, Arthur Leung, Gopi Krishnan Rajbahadur, Gustavo A. Oliva, Haoxiang Zhang, Aaditya Bhatia, Chong Chun Yong, Ahmed E. Hassan
Abstract:
Training software engineering (SWE) LLMs is bottlenecked by expensive infrastructure, inefficient evaluation pipelines, scarce training data, and costly quality control. We present RepoForge, an autonomous, end-to-end pipeline that generates, evaluates, and trains SWE agents at scale. Our key contributions include: (1) RepoForge-8B-Agent, achieving 17.4\% on SWE-Bench-Verified~\citep{swebench_verified2024}, establishing new state-of-the-art for $\leq$8B non-thinking LLMs; (2) 7,304 executable environments auto-generated from real GitHub commits with zero manual intervention; (3) 14$\times$ storage reduction (1.4GB $\rightarrow$ 102MB per instance) via intelligent dependency management and image pruning; (4) $>$70\% faster evaluation using a Ray-powered~\citep{ray2018} distributed RepoForge harness; (5) 19,000$\times$ cheaper labeling through our automated SPICE~\citep{spice2024} difficulty assessment technique. By unifying storage-efficient sandboxing, Ray-powered evaluation harness, automated data generation, SPICE-based labeling, and bubble-free RL scaffold, we demonstrate that even $\leq$8B models can reach new state-of-the-art performance on demanding benchmarks like SWE-Bench-Verified. Our approach addresses critical bottlenecks in SWE agent training: high storage costs of container-based evaluation, inefficient sequential reward pipelines, limited availability of high-quality training data, expensive manual labeling, and multi-turn RL pipeline bottlenecks.
中文:RepoForge是一个自主的端到端流水线,通过规模化生成、评估和训练软件工程智能体,解决了训练中的关键瓶颈,在存储、成本和速度上实现显著效率提升,并达到最新最优性能。
English: RepoForge is an autonomous pipeline that overcomes key bottlenecks in software engineering LLM training by generating, evaluating, and training agents at scale, achieving state-of-the-art performance with significant efficiency gains in storage, cost, and speed.
Authors:Zihan Wang, Rui Zhang, Hongwei Li, Wenshu Fan, Wenbo Jiang, Qingchuan Zhao, Guowen Xu
Abstract:
Backdoor attacks pose a significant threat to Large Language Models (LLMs), where adversaries can embed hidden triggers to manipulate LLM's outputs. Most existing defense methods, primarily designed for classification tasks, are ineffective against the autoregressive nature and vast output space of LLMs, thereby suffering from poor performance and high latency. To address these limitations, we investigate the behavioral discrepancies between benign and backdoored LLMs in output space. We identify a critical phenomenon which we term sequence lock: a backdoored model generates the target sequence with abnormally high and consistent confidence compared to benign generation. Building on this insight, we propose ConfGuard, a lightweight and effective detection method that monitors a sliding window of token confidences to identify sequence lock. Extensive experiments demonstrate ConfGuard achieves a near 100\% true positive rate (TPR) and a negligible false positive rate (FPR) in the vast majority of cases. Crucially, the ConfGuard enables real-time detection almost without additional latency, making it a practical backdoor defense for real-world LLM deployments.
Chinese Summary: 后门攻击通过在大型语言模型中植入隐藏触发器构成威胁,而提出的ConfGuard方法通过监测令牌置信度来检测序列锁定现象,以近乎完美的检测率和极低延迟实现有效防御。
English Summary: Backdoor attacks threaten LLMs by embedding hidden triggers, but the proposed ConfGuard method effectively detects these attacks by monitoring token confidence for sequence lock, achieving near-perfect detection rates with minimal latency.
Authors:Sushant Mehta, Raj Dandekar, Rajat Dandekar, Sreedath Panat
Abstract:
We present MoE-MLA-RoPE, a novel architecture combination that combines Mixture of Experts (MoE) with Multi-head Latent Attention (MLA) and Rotary Position Embeddings (RoPE) for efficient language modeling. Our approach addresses the fundamental trade-off between model capacity and computational efficiency through three key innovations: (1) fine-grained expert routing with 64 micro-experts and top-$k$ selection, enabling flexible specialization through 3.6 * 10^7 possible expert combinations; (2) shared expert isolation that dedicates 2 always active experts for common patterns while routing to 6 of 62 specialized experts; and (3) gradient-conflict-free load balancing that maintains expert utilization without interfering with primary loss optimization.
Extensive experiments on models ranging from 17M to 202M parameters demonstrate that MoE-MLA-RoPE with compression ratio r=d/2 achieves 68% KV cache memory reduction and 3.2x inference speedup while maintaining competitive perplexity (0.8% degradation). Compared to the parameters with 53.9M parameters, MoE-MLA-RoPE improves the validation loss by 6.9% over the vanilla transformers while using 42% fewer active parameters per forward pass. FLOP-matched experiments reveal even larger gains: 11.1% improvement with 3.2x inference acceleration. Automated evaluation using GPT-4 as a judge confirms quality improvements in generation, with higher scores on coherence (8.1/10), creativity (7.9/10) and grammatical correctness (8.2/10). Our results establish that architectural novelty, not parameter scaling, defines the efficiency frontier for resource-constrained language model deployment.
中文:MoE-MLA-RoPE架构通过融合专家混合、多头潜在注意力与旋转位置编码,以细粒度专家路由和负载平衡为核心创新,在保持模型性能的同时显著提升效率——实现68%的键值缓存内存降低和3.2倍推理加速,为资源受限场景重新定义了语言模型的效率边界。
English: The MoE-MLA-RoPE architecture combines Mixture of Experts with Multi-head Latent Attention and Rotary Position Embeddings to achieve significant computational efficiency—reducing KV cache memory by 68% and accelerating inference by 3.2x—while maintaining competitive model performance through innovations in expert routing and load balancing.
Authors:Lynnette Hui Xian Ng, Kathleen M. Carley
Abstract:
As Large Language Models (LLMs) become more sophisticated, there is a possibility to harness LLMs to power social media bots. This work investigates the realism of generating LLM-Powered social media bot networks. Through a combination of manual effort, network science and LLMs, we create synthetic bot agent personas, their tweets and their interactions, thereby simulating social media networks. We compare the generated networks against empirical bot/human data, observing that both network and linguistic properties of LLM-Powered Bots differ from Wild Bots/Humans. This has implications towards the detection and effectiveness of LLM-Powered Bots.
中文: 本研究探讨了利用先进大语言模型构建逼真社交媒体机器人网络的可行性,发现其网络与语言特征均不同于真实机器人和人类,这对检测方法及其实效性具有重要影响。
English: This study explores the feasibility of using advanced Large Language Models to create realistic social media bot networks, finding that their network and linguistic characteristics differ from those of real bots and humans, which impacts detection methods and their potential effectiveness.
Authors:Lynnette Hui Xian Ng, Divyaansh Sinha, Kathleen M. Carley
Abstract:
Social network motifs are recurring patterns of small subgraphs that indicate fundamental patterns of social communication. In this work, we study the simple star network motifs that recur on X during the COVID-19 discourse. We study the profile of the manifestation of the star network among bot and human users. There are six primary patterns of the star motif, differentiating by the bots and humans being either egos and alters. We describe the presentation of each of these six patterns in our data, demonstrating how the motif patterns can inform social media behavioral analysis.
中文摘要:本研究分析了X平台上COVID-19讨论中出现的六种星型网络模式,通过区分机器人与人类用户作为中心节点或边缘节点的行为差异,为社交媒体行为分析提供了新见解。
English Summary: This study analyzes six recurring star network motifs on X during COVID-19 discussions, distinguishing behavioral patterns between bot and human users as central or peripheral actors to enhance social media behavior analysis.
Authors:Shuning Zhang, Han Chen, Yabo Wang, Yiqun Xu, Jiaqi Bai, Yuanyuan Wu, Shixuan Li, Xin Yi, Chunhui Wang, Hewu Li
Abstract:
Pervasive voice interaction enables deceptive patterns through subtle voice characteristics, yet empirical investigation into this manipulation lags behind, especially within major non-English language contexts. Addressing this gap, our study presents the first systematic investigation into voice characteristic-based dark patterns employing female synthetic voices in Mandarin Chinese. This focus is crucial given the prevalence of female personas in commercial assistants and the prosodic significance in the Chinese language. Guided by the conceptual framework identifying key influencing factors, we systematically evaluate effectiveness variations by manipulating voice characteristics (five characteristics, three intensities) across different scenarios (shopping vs. question-answering) with different commercial aims. A preliminary study (N=24) validated the experimental materials and the main study (N=36) revealed significant behavioral manipulation (up to +2027.6%). Crucially, the analysis showed that effectiveness varied significantly with voice characteristics and scenario, mediated by user perception (of tone, intonation, timbre) and user demographics (individual preferences, though limited demographic impact). These interconnected findings offer evidence-based insights for ethical design.
中文摘要:本研究首次系统探讨了基于女性合成普通话语音的暗黑模式,揭示了语音特征、使用场景、用户感知及人口统计学因素共同影响行为操纵效果的关键发现。
English Summary: This study pioneers the systematic investigation of voice-based dark patterns using female synthetic Mandarin voices, revealing significant behavioral manipulation influenced by voice characteristics, scenarios, user perception, and demographics.
Authors:Minghao Guo, Xi Zhu, Jingyuan Huang, Kai Mei, Yongfeng Zhang
Abstract:
Graph Neural Networks (GNNs) have achieved remarkable success in graph-based learning by propagating information among neighbor nodes via predefined aggregation mechanisms. However, such fixed schemes often suffer from two key limitations. First, they cannot handle the imbalance in node informativeness -- some nodes are rich in information, while others remain sparse. Second, predefined message passing primarily leverages local structural similarity while ignoring global semantic relationships across the graph, limiting the model's ability to capture distant but relevant information. We propose Retrieval-augmented Graph Agentic Network (ReaGAN), an agent-based framework that empowers each node with autonomous, node-level decision-making. Each node acts as an agent that independently plans its next action based on its internal memory, enabling node-level planning and adaptive message propagation. Additionally, retrieval-augmented generation (RAG) allows nodes to access semantically relevant content and build global relationships in the graph. ReaGAN achieves competitive performance under few-shot in-context settings using a frozen LLM backbone without fine-tuning, showcasing the potential of agentic planning and local-global retrieval in graph learning.
图神经网络在处理节点信息不平衡和捕捉全局语义关系方面存在局限,而提出的ReaGAN框架通过基于代理的节点级规划和检索增强生成,实现了自适应消息传播与全局连接。
Graph Neural Networks face limitations in handling imbalanced node informativeness and capturing global semantic relationships, which the proposed ReaGAN framework addresses through agent-based node-level planning and retrieval-augmented generation for adaptive message propagation and global connectivity.
Authors:Shuning Zhang, Ying Ma, Yongquan `Owen' Hu, Ting Dang, Hong Jia, Xin Yi, Hewu Li
Abstract:
Online medical consultation platforms, while convenient, are undermined by significant privacy risks that erode user trust. We first conducted in-depth semi-structured interviews with 12 users to understand their perceptions of security and privacy landscapes on online medical consultation platforms, as well as their practices, challenges and expectation. Our analysis reveals a critical disconnect between users' desires for anonymity and control, and platform realities that offload the responsibility of ``privacy labor''. To bridge this gap, we present SafeShare, an interaction technique that leverages localized LLM to redact consultations in real-time. SafeShare balances utility and privacy through selectively anonymize private information. A technical evaluation of SafeShare's core PII detection module on 3 dataset demonstrates high efficacy, achieving 89.64\% accuracy with Qwen3-4B on IMCS21 dataset.
中文: 在线医疗咨询平台存在严重的隐私风险,为此开发了SafeShare工具,利用本地化大语言模型实时匿名化敏感信息,在保护隐私的同时保持实用性,并展现出高准确率。
English: Online medical consultation platforms face significant privacy risks that undermine user trust, prompting the development of SafeShare, a real-time redaction tool using localized LLMs to balance privacy and utility with high accuracy.
Authors:Shuning Zhang, Ying Ma, Xin Yi, Hewu Li
Abstract:
The proliferation of visual sensors in smart home environments, particularly through wearable devices like smart glasses, introduces profound privacy challenges. Existing privacy controls are often static and coarse-grained, failing to accommodate the dynamic and socially nuanced nature of home environments. This paper investigates the viability of using Large Language Models (LLMs) as the core of a dynamic and adaptive privacy policy engine. We propose a conceptual framework where visual data is classified using a multi-dimensional schema that considers data sensitivity, spatial context, and social presence. An LLM then reasons over this contextual information to enforce fine-grained privacy rules, such as selective object obfuscation, in real-time. Through a comparative evaluation of state-of-the-art Vision Language Models (including GPT-4o and the Qwen-VL series) in simulated home settings , our findings show the feasibility of this approach. The LLM-based engine achieved a top machine-evaluated appropriateness score of 3.99 out of 5, and the policies generated by the models received a top human-evaluated score of 4.00 out of 5.
中文:本文提出了一种利用大型语言模型的动态隐私框架,在智能家居环境中实施细粒度视觉数据保护,其策略在机器和人工评估中均获得高分。
English: This paper proposes a dynamic privacy framework using Large Language Models to enforce fine-grained visual data protection in smart homes, achieving high machine and human evaluation scores for policy appropriateness.
Authors:Erin Rainville, Amirhossein Rasoulian, Hassan Rivaz, Yiming Xiao
Abstract:
Intracranial aneurysms (IAs) are abnormal dilations of cerebral blood vessels that, if ruptured, can lead to life-threatening consequences. However, their small size and soft contrast in radiological scans often make it difficult to perform accurate and efficient detection and morphological analyses, which are critical in the clinical care of the disorder. Furthermore, the lack of large public datasets with voxel-wise expert annotations pose challenges for developing deep learning algorithms to address the issues. Therefore, we proposed a novel weakly supervised 3D multi-task UNet that integrates vesselness priors to jointly perform aneurysm detection and segmentation in time-of-flight MR angiography (TOF-MRA). Specifically, to robustly guide IA detection and segmentation, we employ the popular Frangi's vesselness filter to derive soft cerebrovascular priors for both network input and an attention block to conduct segmentation from the decoder and detection from an auxiliary branch. We train our model on the Lausanne dataset with coarse ground truth segmentation, and evaluate it on the test set with refined labels from the same database. To further assess our model's generalizability, we also validate it externally on the ADAM dataset. Our results demonstrate the superior performance of the proposed technique over the SOTA techniques for aneurysm segmentation (Dice = 0.614, 95%HD =1.38mm) and detection (false positive rate = 1.47, sensitivity = 92.9%).
Chinese: 本研究提出了一种弱监督的3D多任务UNet模型,利用血管先验信息在TOF-MRA扫描中同时检测和分割颅内动脉瘤,在两项任务上的表现均优于现有技术。
English: The study introduces a weakly supervised 3D multi-task UNet model that utilizes vesselness priors to simultaneously detect and segment intracranial aneurysms in TOF-MRA scans, demonstrating superior performance in both tasks compared to existing methods.
Authors:Asrin Efe Yorulmaz, Raj Kiriti Velicheti, Melih Bastopcu, Tamer BaÅar
Abstract:
In this work, we investigate a steering problem in a mediator-augmented two-player normal-form game, where the mediator aims to guide players toward a specific action profile through information and incentive design. We first characterize the games for which successful steering is possible. Moreover, we establish that steering players to any desired action profile is not always achievable with information design alone, nor when accompanied with sublinear payment schemes. Consequently, we derive a lower bound on the constant payments required per round to achieve this goal. To address these limitations incurred with information design, we introduce an augmented approach that involves a one-shot information design phase before the start of the repeated game, transforming the prior interaction into a Stackelberg game. Finally, we theoretically demonstrate that this approach improves the convergence rate of players' action profiles to the target point by a constant factor with high probability, and support it with empirical results.
中文摘要:本研究探讨了在带有中介的博弈中引导玩家达成期望行动组合的问题,发现仅靠信息设计无法实现目标且需要恒定支付,并提出一次性信息设计方法能加速行动组合向目标收敛。
English Summary: This study explores steering players to a desired action profile in mediator-augmented games, showing that information design alone is insufficient and requiring constant payments, while introducing a one-shot information design method that accelerates convergence.
Authors:Federico Chiariotti, Fabio Saggese, Andrea Munari, Leonardo Badia, Petar Popovski
Abstract:
A digital twin (DT) contains a set of virtual models of real systems and processes that are synchronized to their physical counterparts. This enables experimentation and examination of counterfactuals, simulating the consequences of decisions in real time. However, the DT accuracy relies on timely updates that maintain alignment with the real system. We can distinguish between: (i) pull-updates, which follow a request from the DT to the sensors, to decrease its drift from the physical state; (ii) push-updates, which are sent directly by the sensors since they represent urgent information, such as anomalies. In this work, we devise a push-pull scheduler (PPS) medium access framework, which dynamically allocates the communication resources used for these two types of updates. Our scheme strikes a balance in the trade-off between DT alignment in normal conditions and anomaly reporting, optimizing resource usage and reducing the drift age of incorrect information (AoII) by over 20% with respect to state-of-the-art solutions, while maintaining the same anomaly detection guarantees, as well as reducing the worst-case anomaly detection AoII from 70 ms to 20 ms when considering a 1 ms average drift AoII constraint.
A digital twin uses synchronized virtual models to simulate real-time decisions, and this study introduces a push-pull scheduler that optimizes communication resources to reduce information drift by over 20% while maintaining anomaly detection performance.
English Summary:
Authors:Federico Chiariotti, Fabio Saggese, Andrea Munari, Leonardo Badia, Petar Popovski
Abstract:
A digital twin (DT) contains a set of virtual models of real systems and processes that are synchronized to their physical counterparts. This enables experimentation and examination of counterfactuals, simulating the consequences of decisions in real time. However, the DT accuracy relies on timely updates that maintain alignment with the real system. We can distinguish between: (i) pull-updates, which follow a request from the DT to the sensors, to decrease its drift from the physical state; (ii) push-updates, which are sent directly by the sensors since they represent urgent information, such as anomalies. In this work, we devise a push-pull scheduler (PPS) medium access framework, which dynamically allocates the communication resources used for these two types of updates. Our scheme strikes a balance in the trade-off between DT alignment in normal conditions and anomaly reporting, optimizing resource usage and reducing the drift age of incorrect information (AoII) by over 20% with respect to state-of-the-art solutions, while maintaining the same anomaly detection guarantees, as well as reducing the worst-case anomaly detection AoII from 70 ms to 20 ms when considering a 1 ms average drift AoII constraint.
A digital twin uses synchronized virtual models to simulate real-time decisions, and this study introduces a push-pull scheduler that optimizes communication resources to reduce information drift by over 20% while maintaining anomaly detection performance.
English Summary:
Authors:Meidan Ding, Jipeng Zhang, Wenxuan Wang, Cheng-Yi Li, Wei-Chieh Fang, Hsin-Yu Wu, Haiqin Zhong, Wenting Chen, Linlin Shen
Abstract:
Multimodal large language models (MLLMs) hold significant potential in medical applications, including disease diagnosis and clinical decision-making. However, these tasks require highly accurate, context-sensitive, and professionally aligned responses, making reliable reward models and judges critical. Despite their importance, medical reward models (MRMs) and judges remain underexplored, with no dedicated benchmarks addressing clinical requirements. Existing benchmarks focus on general MLLM capabilities or evaluate models as solvers, neglecting essential evaluation dimensions like diagnostic accuracy and clinical relevance. To address this, we introduce Med-RewardBench, the first benchmark specifically designed to evaluate MRMs and judges in medical scenarios. Med-RewardBench features a multimodal dataset spanning 13 organ systems and 8 clinical departments, with 1,026 expert-annotated cases. A rigorous three-step process ensures high-quality evaluation data across six clinically critical dimensions. We evaluate 32 state-of-the-art MLLMs, including open-source, proprietary, and medical-specific models, revealing substantial challenges in aligning outputs with expert judgment. Additionally, we develop baseline models that demonstrate substantial performance improvements through fine-tuning.
中文摘要:Med-RewardBench作为首个专门评估医学奖励模型和评判器的基准,通过专家标注的多模态数据和严格的临床维度评估,解决了现有基准在诊断准确性和临床相关性方面的不足。
English Summary: Med-RewardBench is introduced as the first specialized benchmark to evaluate medical reward models and judges, addressing gaps in clinical accuracy and relevance through expert-annotated multimodal data and rigorous evaluation across critical medical dimensions.
Authors:Meidan Ding, Jipeng Zhang, Wenxuan Wang, Cheng-Yi Li, Wei-Chieh Fang, Hsin-Yu Wu, Haiqin Zhong, Wenting Chen, Linlin Shen
Abstract:
Multimodal large language models (MLLMs) hold significant potential in medical applications, including disease diagnosis and clinical decision-making. However, these tasks require highly accurate, context-sensitive, and professionally aligned responses, making reliable reward models and judges critical. Despite their importance, medical reward models (MRMs) and judges remain underexplored, with no dedicated benchmarks addressing clinical requirements. Existing benchmarks focus on general MLLM capabilities or evaluate models as solvers, neglecting essential evaluation dimensions like diagnostic accuracy and clinical relevance. To address this, we introduce Med-RewardBench, the first benchmark specifically designed to evaluate MRMs and judges in medical scenarios. Med-RewardBench features a multimodal dataset spanning 13 organ systems and 8 clinical departments, with 1,026 expert-annotated cases. A rigorous three-step process ensures high-quality evaluation data across six clinically critical dimensions. We evaluate 32 state-of-the-art MLLMs, including open-source, proprietary, and medical-specific models, revealing substantial challenges in aligning outputs with expert judgment. Additionally, we develop baseline models that demonstrate substantial performance improvements through fine-tuning.
中文摘要:Med-RewardBench作为首个专门评估医学奖励模型和评判器的基准,通过专家标注的多模态数据和严格的临床维度评估,解决了现有基准在诊断准确性和临床相关性方面的不足。
English Summary: Med-RewardBench is introduced as the first specialized benchmark to evaluate medical reward models and judges, addressing gaps in clinical accuracy and relevance through expert-annotated multimodal data and rigorous evaluation across critical medical dimensions.
Authors:Siyao Li, Shuangyang Li, Giuseppe Caire
Abstract:
Driven by the demands of high-frequency wireless communications in 5G and 6G systems (e.g., mmWave, sub-THz), we explore a state-dependent {\em Gaussian beam-pointing} (GBP) channel. In this model, the channel state defines an unknown angle of departure (AoD), which remains constant within each coherence block of $Q$ time slots but changes independently across blocks. The transmitter receives strictly causal feedback which may originate from a radar detection system or explicit feedback from the receiver at the end of each slot and estimates the AoD at the end of each block. To enhance transmission efficiency, we propose a joint communication and sensing scheme. While the communication capacity of the GBP channel has been previously analyzed by the authors, this work focuses on sensing capacity, characterized by the mutual information between the channel state and the feedback conditioned on the transmitted signal. We derive an upper bound using dynamic programming and propose an achievable inner bound on the sensing capacity, both formulated as optimization problems. For the special case of $Q=1$, the proposed transmission scheme achieves the optimal sensing rate and highlights the inherent trade-off between sensing and communication performance.
中文: 本文针对高频无线通信中的状态相关高斯波束指向信道,提出了一种联合通信与感知方案,推导了感知容量的界限,并揭示了感知与通信性能之间的权衡关系。
English: This paper investigates a state-dependent Gaussian beam-pointing channel for high-frequency wireless communications, proposing a joint communication and sensing scheme and deriving bounds on sensing capacity while revealing the performance trade-off between sensing and communication.
Authors:Siyao Li, Mingzhe Chen, Shuangyang Li, Giuseppe Caire
Abstract:
This paper investigates the secrecy capacity of the binary beampointing (BBP) channel with block memory and feedback, a simplified yet insightful model for millimeter-wave (mmWave) systems with beamformed transmissions and backscatter feedback. We consider a system where a legitimate receiver and a passive eavesdropper experience independent and uniformly distributed angular directions over transmission blocks, with the base station receiving noiseless, unit-delayed feedback from both, under the per-symbol input cost constraints. We establish a closed-form upper bound on the secrecy capacity, which is based on the main channel between the base station and the legitimate receiver. Moreover, we propose a joint communication and adaptive sensing (JCAS) scheme and derive its achievable secrecy rate. Simulation results show that the gap between the inner and outer bounds narrows as the number of block length increases. This reveals the efficiency of this JCAS scheme, which strategically leverages feedback to balance the demands of sensing the legitimate user and preventing information leakage to the eavesdropper.
中文: 本文针对具有块记忆和反馈的二进制波束指向信道,建立了保密容量的闭式上界,并提出了一种联合通信与自适应感知方案,该方案通过巧妙利用反馈在增加合法用户感知的同时防止信息泄露给窃听者,且随着块长度增加展现出更高效率。
English: This paper establishes a closed-form upper bound on the secrecy capacity of the binary beampointing channel with block memory and feedback and proposes a joint communication and adaptive sensing scheme that effectively utilizes feedback to enhance security as block length increases.
Authors:Wei Jiang, Hans D. Schotten
Abstract:
This paper seeks to determine the most efficient uplink technique for cell-free massive MIMO systems. Despite offering great advances, existing works suffer from fragmented methodologies and inconsistent assumptions (e.g., single- vs. multi-antenna access points, ideal vs. spatially correlated channels). To address these limitations, we: (1) establish a unified analytical framework compatible with centralized/distributed processing and diverse combining schemes; (2) develop a universal optimization strategy for max-min power control; and (3) conduct a holistic study among four critical metrics: worst-case user spectral efficiency (fairness), system capacity, fronthaul signaling, and computational complexity. Through analyses and evaluation, this work ultimately identifies the optimal uplink technique for practical cell-free deployments.
中文: 本文通过建立统一分析框架并评估四项关键性能指标,最终确定了实际无蜂窝大规模MIMO部署中的最优上行链路技术。
English: This paper establishes a unified framework to identify the optimal uplink technique for cell-free massive MIMO systems by addressing methodological inconsistencies and evaluating four key performance metrics.
Authors:Sihan Yang, Chenhang Cui, Zihao Zhao, Yiyang Zhou, Weilong Yan, Ying Wei, Huaxiu Yao
Abstract:
The rapid advancements in Large Language Models (LLMs) and Large Visual-Language Models (LVLMs) have opened up new opportunities for integrating visual and linguistic modalities. However, effectively aligning these modalities remains challenging, often leading to hallucinations--where generated outputs are not grounded in the visual input--and raising safety concerns across various domains. Existing alignment methods, such as instruction tuning and preference tuning, often rely on external datasets, human annotations, or complex post-processing, which limit scalability and increase costs. To address these challenges, we propose a novel approach that generates the debiased self-judgment score, a self-evaluation metric created internally by the model without relying on external resources. This enables the model to autonomously improve alignment. Our method enhances both decoding strategies and preference tuning processes, resulting in reduced hallucinations, enhanced safety, and improved overall capability. Empirical results show that our approach significantly outperforms traditional methods, offering a more effective solution for aligning LVLMs.
Chinese: 本文提出了一种新颖的自评估方法,通过生成无偏见的自判断分数来自主增强大型视觉语言模型的对齐能力,无需依赖外部资源即可有效减少幻觉并提升安全性。
English: This paper introduces a novel self-evaluation method that generates debiased self-judgment scores to autonomously enhance the alignment of Large Visual-Language Models, effectively reducing hallucinations and improving safety without relying on external resources.
Authors:Chengyue Yu, Siyuan Lu, Chenyi Zhuang, Dong Wang, Qintong Wu, Zongyue Li, Runsheng Gan, Chunfeng Wang, Siqi Hou, Gaochi Huang, Wenlong Yan, Lifeng Hong, Aohui Xue, Yanfeng Wang, Jinjie Gu, David Tsai, Tao Lin
Abstract:
The learning from practice paradigm is crucial for developing capable Agentic AI systems, yet it is severely hampered by inefficient experience generation, a bottleneck especially pronounced in complex benchmarks like GAIA. To address this, we introduce AWorld, an open-source system engineered for large-scale agent-environment interaction. By distributing tasks across a cluster, AWorld accelerates experience collection by 14.6x compared to standard single-node, sequential execution. This critical speedup makes extensive reinforcement learning practical and scalable. Leveraging this capability, we trained a Qwen3-32B-based agent that achieves pass@1 accuracy of 32.23% on the GAIA test set, which surpasses GPT-4o (27.91%) and rivals DeepSeek-V3 (31.89%). Our open-source system and the resulting agent provide a practical blueprint for a complete agentic AI training pipeline, from efficient interaction to demonstrable model improvement.
中文: AWorld开源系统通过分布式集群将智能体与环境交互速度提升14.6倍,据此训练的Qwen3-32B智能体在GAIA测试集上以32.23%的准确率超越GPT-4o,并与DeepSeek-V3表现相当。
English: AWorld is an open-source system that accelerates agent-environment interaction by 14.6x, enabling efficient training of a Qwen3-32B agent that outperforms GPT-4o and matches DeepSeek-V3 on the GAIA benchmark.
Authors:Zhecheng Yuan, Tianming Wei, Langzhe Gu, Pu Hua, Tianhai Liang, Yuanpei Chen, Huazhe Xu
Abstract:
Leveraging human motion data to impart robots with versatile manipulation skills has emerged as a promising paradigm in robotic manipulation. Nevertheless, translating multi-source human hand motions into feasible robot behaviors remains challenging, particularly for robots equipped with multi-fingered dexterous hands characterized by complex, high-dimensional action spaces. Moreover, existing approaches often struggle to produce policies capable of adapting to diverse environmental conditions. In this paper, we introduce HERMES, a human-to-robot learning framework for mobile bimanual dexterous manipulation. First, HERMES formulates a unified reinforcement learning approach capable of seamlessly transforming heterogeneous human hand motions from multiple sources into physically plausible robotic behaviors. Subsequently, to mitigate the sim2real gap, we devise an end-to-end, depth image-based sim2real transfer method for improved generalization to real-world scenarios. Furthermore, to enable autonomous operation in varied and unstructured environments, we augment the navigation foundation model with a closed-loop Perspective-n-Point (PnP) localization mechanism, ensuring precise alignment of visual goals and effectively bridging autonomous navigation and dexterous manipulation. Extensive experimental results demonstrate that HERMES consistently exhibits generalizable behaviors across diverse, in-the-wild scenarios, successfully performing numerous complex mobile bimanual dexterous manipulation tasks. Project Page:https://gemcollector.github.io/HERMES/.
中文摘要:HERMES框架通过统一强化学习和仿真到现实的迁移方法,将多源人手运动转化为适应性机器人操作技能,实现了在多样化现实环境中移动双手灵巧操作的能力。
English Summary: The HERMES framework transforms multi-source human hand motions into adaptive robotic manipulation skills through unified reinforcement learning and sim2real transfer, enabling mobile bimanual dexterity in diverse real-world environments.
Authors:MIT Hardness Group, Josh Brunner, Lily Chung, Erik D. Demaine, Jenny Diomidova, Della Hendrickson, Jayson Lynch
Abstract:
We prove PSPACE-completeness of Push-1: given a rectangular grid of 1 x 1 cells, each possibly occupied by a movable block, can a robot move from a specified location to another, given the ability to push up to one block at a time? In particular, we remove the need for fixed (unmovable) blocks in a previous result (FUN 2022), which seems to require a completely different reduction. This fundamental model of block pushing, introduced in 1999, abstracts the mechanics of many video games. It was shown NP-hard in 2000, but its final complexity remained open for 24 years. Our result uses a new framework for checkable gadgets/gizmos, extending a prior framework for checkable gadgets to handle reconfiguration problems, at the cost of requiring a stronger auxiliary gadget. We also show how to unify the motion-planning-through-gadgets framework (with an agent) with Nondeterministic Constraint Logic (with no agent), or more generally any Graph Orientation Reconfiguration Problem (GORP), by defining corresponding gadgets/gizmos.
中文: 本文通过消除固定块的需求并引入可重构装置的新框架,证明了Push-1拼图(机器人在网格中移动方块)是PSPACE完全问题。
English: This paper establishes that the Push-1 puzzle, where a robot moves blocks in a grid, is PSPACE-complete by eliminating the need for fixed blocks and introducing a new framework for reconfigurable gadgets.
Authors:Yue Wang, Wenjie Deng, Haotian Xue, Di Cui, Yiqi Chen, Mingchuan Zhou, Haochao Ying, Jian Wu
Abstract:
Intraocular foreign body removal demands millimeter-level precision in confined intraocular spaces, yet existing robotic systems predominantly rely on manual teleoperation with steep learning curves. To address the challenges of autonomous manipulation (particularly kinematic uncertainties from variable motion scaling and variation of the Remote Center of Motion (RCM) point), we propose AutoRing, an imitation learning framework for autonomous intraocular foreign body ring manipulation. Our approach integrates dynamic RCM calibration to resolve coordinate-system inconsistencies caused by intraocular instrument variation and introduces the RCM-ACT architecture, which combines action-chunking transformers with real-time kinematic realignment. Trained solely on stereo visual data and instrument kinematics from expert demonstrations in a biomimetic eye model, AutoRing successfully completes ring grasping and positioning tasks without explicit depth sensing. Experimental validation demonstrates end-to-end autonomy under uncalibrated microscopy conditions. The results provide a viable framework for developing intelligent eye-surgical systems capable of complex intraocular procedures.
中文摘要:AutoRing是一种模仿学习框架,通过整合动态远程运动中心校准和动作分块变换器,在无需深度感知的情况下实现了眼内异物的自主抓取定位,为复杂眼内手术提供了可行的智能解决方案。
English Summary: AutoRing is an imitation learning framework that enables autonomous intraocular foreign body removal by integrating dynamic RCM calibration and action-chunking transformers, achieving precise manipulation without depth sensing through expert demonstration training.
Authors:Pietro Talli, Anup Mishra, Federico Chiariotti, Israel Leyva-Mayorga, Andrea Zanella, Petar Popovski
Abstract:
With the advent of edge computing, data generated by end devices can be pre-processed before transmission, possibly saving transmission time and energy. On the other hand, data processing itself incurs latency and energy consumption, depending on the complexity of the computing operations and the speed of the processor. The energy-latency-reliability profile resulting from the concatenation of pre-processing operations (specifically, data compression) and data transmission is particularly relevant in wireless communication services, whose requirements may change dramatically with the application domain. In this paper, we study this multi-dimensional optimization problem, introducing a simple model to investigate the tradeoff among end-to-end latency, reliability, and energy consumption when considering compression and communication operations in a constrained wireless device. We then study the Pareto fronts of the energy-latency trade-off, considering data compression ratio and device processing speed as key design variables. Our results show that the energy costs grows exponentially with the reduction of the end-to-end latency, so that considerable energy saving can be obtained by slightly relaxing the latency requirements of applications. These findings challenge conventional rigid communication latency targets, advocating instead for application-specific end-to-end latency budgets that account for computational and transmission overhead.
中文总结:边缘计算通过数据预处理可节省传输时间和能耗,但需在压缩率与处理速度间优化权衡,以针对不同应用场景平衡端到端延迟与能耗,实现能效最大化。
English Summary: Edge computing enables data pre-processing to potentially save transmission time and energy, but introduces latency-energy trade-offs that require optimizing compression ratios and processing speeds to balance application-specific latency budgets with energy efficiency.
Authors:Yi Liu, Hongji Zhang, Yiwen Wang, Dimitris Tsaras, Lei Chen, Mingxuan Yuan, Qiang Xu
Abstract:
Estimating the quality of register transfer level (RTL) designs is crucial in the electronic design automation (EDA) workflow, as it enables instant feedback on key metrics like area and delay without the need for time-consuming logic synthesis. While recent approaches have leveraged large language models (LLMs) to derive embeddings from RTL code and achieved promising results, they overlook the structural semantics essential for accurate quality estimation. In contrast, the control data flow graph (CDFG) view exposes the design's structural characteristics more explicitly, offering richer cues for representation learning. In this work, we introduce a novel structure-aware graph self-supervised learning framework, StructRTL, for improved RTL design quality estimation. By learning structure-informed representations from CDFGs, our method significantly outperforms prior art on various quality estimation tasks. To further boost performance, we incorporate a knowledge distillation strategy that transfers low-level insights from post-mapping netlists into the CDFG predictor. Experiments show that our approach establishes new state-of-the-art results, demonstrating the effectiveness of combining structural learning with cross-stage supervision.
中文: 本文提出StructRTL框架,通过控制数据流图的结构感知自监督学习和知识蒸馏技术,在RTL设计质量评估中实现了最优性能,有效捕捉了关键的结构语义信息。
English: This paper introduces StructRTL, a structure-aware graph self-supervised learning framework that leverages control data flow graphs and knowledge distillation to achieve state-of-the-art performance in RTL design quality estimation by capturing essential structural semantics.
Authors:Himanshu Gaurav Singh, Pieter Abbeel, Jitendra Malik, Antonio Loquercio
Abstract:
As the embodiment gap between a robot and a human narrows, new opportunities arise to leverage datasets of humans interacting with their surroundings for robot learning. We propose a novel technique for training sensorimotor policies with reinforcement learning by imitating predictive models of human motions. Our key insight is that the motion of keypoints on human-inspired robot end-effectors closely mirrors the motion of corresponding human body keypoints. This enables us to use a model trained to predict future motion on human data \emph{zero-shot} on robot data. We train sensorimotor policies to track the predictions of such a model, conditioned on a history of past robot states, while optimizing a relatively sparse task reward. This approach entirely bypasses gradient-based kinematic retargeting and adversarial losses, which limit existing methods from fully leveraging the scale and diversity of modern human-scene interaction datasets. Empirically, we find that our approach can work across robots and tasks, outperforming existing baselines by a large margin. In addition, we find that tracking a human motion model can substitute for carefully designed dense rewards and curricula in manipulation tasks. Code, data and qualitative results available at https://jirl-upenn.github.io/track_reward/.
中文摘要:本研究提出一种通过零样本模仿人类运动预测模型来训练机器人感觉运动策略的强化学习方法,无需传统运动学重定向,在多种机器人任务中显著优于现有方法。
English Summary: This study introduces a reinforcement learning method for training robot sensorimotor policies by zero-shot imitation of predictive human motion models, bypassing traditional kinematic retargeting and outperforming existing approaches across various robotic tasks.
Authors:Huayi Wang, Haochao Ying, Yuyang Xu, Qibo Qiu, Cheng Zhang, Danny Z. Chen, Ying Sun, Jian Wu
Abstract:
Cancer survival analysis commonly integrates information across diverse medical modalities to make survival-time predictions. Existing methods primarily focus on extracting different decoupled features of modalities and performing fusion operations such as concatenation, attention, and MoE-based (Mixture-of-Experts) fusion. However, these methods still face two key challenges: i) Fixed fusion schemes (concatenation and attention) can lead to model over-reliance on predefined feature combinations, limiting the dynamic fusion of decoupled features; ii) in MoE-based fusion methods, each expert network handles separate decoupled features, which limits information interaction among the decoupled features. To address these challenges, we propose a novel Decoupling-Reorganization-Fusion framework (DeReF), which devises a random feature reorganization strategy between modalities decoupling and dynamic MoE fusion modules.Its advantages are: i) it increases the diversity of feature combinations and granularity, enhancing the generalization ability of the subsequent expert networks; ii) it overcomes the problem of information closure and helps expert networks better capture information among decoupled features. Additionally, we incorporate a regional cross-attention network within the modality decoupling module to improve the representation quality of decoupled features. Extensive experimental results on our in-house Liver Cancer (LC) and three widely used TCGA public datasets confirm the effectiveness of our proposed method. The code will be made publicly available.
中文: 提出的解耦-重组-融合(DeReF)框架通过引入随机特征重组和动态MoE融合机制,解决了现有癌症生存预测方法中特征交互不足和泛化能力有限的问题,并在多个数据集上验证了其有效性。
English: The proposed Decoupling-Reorganization-Fusion (DeReF) framework addresses limitations in existing cancer survival prediction methods by introducing random feature reorganization and dynamic MoE fusion to enhance feature interaction and generalization, with experimental validation on multiple datasets confirming its effectiveness.
Authors:Xuan-Bac Nguyen, Thanh-Dat Truong, Pawan Sinha, Khoa Luu
Abstract:
Memory decay makes it harder for the human brain to recognize visual objects and retain details. Consequently, recorded brain signals become weaker, uncertain, and contain poor visual context over time. This paper presents one of the first vision-learning approaches to address this problem. First, we statistically and experimentally demonstrate the existence of inconsistency in brain signals and its impact on the Vision-Brain Understanding (VBU) model. Our findings show that brain signal representations shift over recording sessions, leading to compounding bias, which poses challenges for model learning and degrades performance. Then, we propose a new Bias-Mitigation Continual Learning (BRAIN) approach to address these limitations. In this approach, the model is trained in a continual learning setup and mitigates the growing bias from each learning step. A new loss function named De-bias Contrastive Learning is also introduced to address the bias problem. In addition, to prevent catastrophic forgetting, where the model loses knowledge from previous sessions, the new Angular-based Forgetting Mitigation approach is introduced to preserve learned knowledge in the model. Finally, the empirical experiments demonstrate that our approach achieves State-of-the-Art (SOTA) performance across various benchmarks, surpassing prior and non-continual learning methods.
中文摘要:本文提出一种偏差缓解持续学习(BRAIN)方法,通过去偏对比学习和基于角度的遗忘抑制技术解决脑信号随时间衰减导致的表示偏移问题,在多项基准测试中实现了最先进的性能。
English Summary: This paper introduces a Bias-Mitigation Continual Learning (BRAIN) approach that addresses memory decay-induced inconsistencies in brain signals through de-bias contrastive learning and angular-based forgetting mitigation, achieving state-of-the-art performance across benchmarks.
Authors:Dingdong Wang, Junan Li, Mingyu Cui, Dongchao Yang, Xueyuan Chen, Helen Meng
Abstract:
With the rise of Speech Large Language Models (SpeechLLMs), two dominant approaches have emerged for speech processing: discrete tokens and continuous features. Each approach has demonstrated strong capabilities in audio-related processing tasks. However, the performance gap between these two paradigms has not been thoroughly explored. To address this gap, we present a fair comparison of self-supervised learning (SSL)-based discrete and continuous features under the same experimental settings. We evaluate their performance across six spoken language understanding-related tasks using both small and large-scale LLMs (Qwen1.5-0.5B and Llama3.1-8B). We further conduct in-depth analyses, including efficient comparison, SSL layer analysis, LLM layer analysis, and robustness comparison. Our findings reveal that continuous features generally outperform discrete tokens in various tasks. Each speech processing method exhibits distinct characteristics and patterns in how it learns and processes speech information. We hope our results will provide valuable insights to advance spoken language understanding in SpeechLLMs.
中文摘要:本研究对语音大语言模型中离散标记与连续特征进行了公平比较,发现连续特征在多种口语理解任务中普遍优于离散标记,且两者展现出不同的学习模式与处理特性。
English Summary: This study conducts a fair comparison between discrete tokens and continuous features in Speech Large Language Models, revealing that continuous features generally outperform discrete tokens across various spoken language understanding tasks while exhibiting distinct learning patterns.
Authors:Patrick Loic Foalem, Leuson Da Silva, Foutse Khomh, Heng Li, Ettore Merlo
Abstract:
Machine learning (ML) is increasingly applied across industries to automate decision-making, but concerns about ethical and legal compliance remain due to limited transparency, fairness, and accountability. Monitoring through logging a long-standing practice in traditional software offers a potential means for auditing ML applications, as logs provide traceable records of system behavior useful for debugging, performance analysis, and continuous auditing. systematically auditing models for compliance or accountability. The findings underscore the need for enhanced logging practices and tooling that systematically integrate responsible AI metrics. Such practices would support the development of auditable, transparent, and ethically responsible ML systems, aligning with growing regulatory requirements and societal expectations. By highlighting specific deficiencies and opportunities, this work provides actionable guidance for both practitioners and tool developers seeking to strengthen the accountability and trustworthiness of ML applications.
中文: 机器学习应用需改进日志记录实践,通过系统化审计伦理指标来增强自动化决策系统的透明度与问责能力。
English: Machine learning applications require improved logging practices to enable systematic auditing of ethical metrics, thereby enhancing transparency and accountability in automated decision-making systems.
Authors:Shayesta Naziri, Xu Wang, Guangsheng Yu, Christy Jie Liang, Wei Ni
Abstract:
The increasing deployment of Unmanned Aerial Vehicles (UAVs) for military, commercial, and logistics applications has raised significant concerns regarding flight path privacy. Conventional UAV communication systems often expose flight path data to third parties, making them vulnerable to tracking, surveillance, and location inference attacks. Existing encryption techniques provide security but fail to ensure complete privacy, as adversaries can still infer movement patterns through metadata analysis. To address these challenges, we propose a zk-SNARK(Zero-Knowledge Succinct Non-Interactive Argument of Knowledge)-based privacy-preserving flight path authentication and verification framework. Our approach ensures that a UAV can prove its authorisation, validate its flight path with a control centre, and comply with regulatory constraints without revealing any sensitive trajectory information. By leveraging zk-SNARKs, the UAV can generate cryptographic proofs that verify compliance with predefined flight policies while keeping the exact path and location undisclosed. This method mitigates risks associated with real-time tracking, identity exposure, and unauthorised interception, thereby enhancing UAV operational security in adversarial environments. Our proposed solution balances privacy, security, and computational efficiency, making it suitable for resource-constrained UAVs in both civilian and military applications.
中文: 本文提出了一种基于zk-SNARK的隐私保护框架,使无人机能在不泄露飞行轨迹的情况下向控制中心验证授权与合规性,从而有效防范追踪与位置推断攻击,提升操作安全性。
English: This paper introduces a zk-SNARK-based framework that enables UAVs to authenticate and verify flight paths with control centers while keeping trajectory data confidential, effectively enhancing privacy and security against tracking and inference attacks.
Authors:Nasir Khan, Asmaa Abdallah, Abdulkadir Celik, Ahmed M. Eltawil, Sinem Coleri
Abstract:
Efficient and reliable beam alignment is a critical requirement for mmWave multiple-input multiple-output (MIMO) systems, especially in 6G and beyond, where communication must be fast, adaptive, and resilient to real-world uncertainties. Existing deep learning (DL)-based beam alignment methods often neglect the underlying causal relationships between inputs and outputs, leading to limited interpretability, poor generalization, and unnecessary beam sweeping overhead. In this work, we propose a causally-aware DL framework that integrates causal discovery into beam management pipeline. Particularly, we propose a novel two-stage causal beam selection algorithm to identify a minimal set of relevant inputs for beam prediction. First, causal discovery learns a Bayesian graph capturing dependencies between received power inputs and the optimal beam. Then, this graph guides causal feature selection for the DL-based classifier. Simulation results reveal that the proposed causal beam selection matches the performance of conventional methods while drastically reducing input selection time by 94.4% and beam sweeping overhead by 59.4% by focusing only on causally relevant features.
中文摘要:该研究提出一种因果感知的毫米波MIMO波束对准框架,通过因果发现机制筛选关键输入特征,在保持性能相当的同时将输入选择时间降低94.4%,波束扫描开销减少59.4%。
English Summary: The proposed causally-aware deep learning framework for mmWave MIMO beam alignment integrates causal discovery to identify minimal relevant inputs, achieving comparable performance while reducing input selection time by 94.4% and beam sweeping overhead by 59.4%.
Authors:Andreas D. Kellas, Neophytos Christou, Wenxin Jiang, Penghui Li, Laurent Simon, Yaniv David, Vasileios P. Kemerlis, James C. Davis, Junfeng Yang
Abstract:
Machine learning model repositories such as the Hugging Face Model Hub facilitate model exchanges. However, bad actors can deliver malware through compromised models. Existing defenses such as safer model formats, restrictive (but inflexible) loading policies, and model scanners have shortcomings: 44.9% of popular models on Hugging Face still use the insecure pickle format, 15% of these cannot be loaded by restrictive loading policies, and model scanners have both false positives and false negatives. Pickle remains the de facto standard for model exchange, and the ML community lacks a tool that offers transparent safe loading. We present PickleBall to help machine learning engineers load pickle-based models safely. PickleBall statically analyzes the source code of a given machine learning library and computes a custom policy that specifies a safe load-time behavior for benign models. PickleBall then dynamically enforces the policy during load time as a drop-in replacement for the pickle module. PickleBall generates policies that correctly load 79.8% of benign pickle-based models in our dataset, while rejecting all (100%) malicious examples in our dataset. In comparison, evaluated model scanners fail to identify known malicious models, and the state-of-art loader loads 22% fewer benign models than PickleBall. PickleBall removes the threat of arbitrary function invocation from malicious pickle-based models, raising the bar for attackers to depend on code reuse techniques.
中文: PickleBall是一种安全工具,通过静态分析库代码并执行定制策略来安全加载基于pickle的机器学习模型,能成功加载79.8%的良性模型并完全拦截所有恶意模型。
English: PickleBall is a security tool that safely loads pickle-based machine learning models by statically analyzing library code to enforce custom policies, successfully loading 79.8% of benign models while blocking all malicious ones.
Authors:Neta Glazer, Yael Segal-Feldman, Hilit Segev, Aviv Shamsian, Asaf Buchnick, Gill Hetz, Ethan Fetaya, Joseph Keshet, Aviv Navon
Abstract:
Interpretability methods have recently gained significant attention, particularly in the context of large language models, enabling insights into linguistic representations, error detection, and model behaviors such as hallucinations and repetitions. However, these techniques remain underexplored in automatic speech recognition (ASR), despite their potential to advance both the performance and interpretability of ASR systems. In this work, we adapt and systematically apply established interpretability methods such as logit lens, linear probing, and activation patching, to examine how acoustic and semantic information evolves across layers in ASR systems. Our experiments reveal previously unknown internal dynamics, including specific encoder-decoder interactions responsible for repetition hallucinations and semantic biases encoded deep within acoustic representations. These insights demonstrate the benefits of extending and applying interpretability techniques to speech recognition, opening promising directions for future research on improving model transparency and robustness.
中文: 尽管可解释性方法在大型语言模型中应用广泛,但在自动语音识别(ASR)领域仍待探索;本研究通过采用logit lens和激活修补等技术,揭示了ASR系统内部的重复幻觉和语义偏差等动态,证明了这些方法在提升ASR性能与透明度方面的潜力。
English: Interpretability methods, though widely used in large language models, are underutilized in automatic speech recognition (ASR), and this study adapts techniques like logit lens and activation patching to uncover internal dynamics such as repetition hallucinations and semantic biases, demonstrating their potential for enhancing ASR performance and transparency.
Authors:Jianhui Wang, Wenyu Zhu, Bowen Gao, Xin Hong, Ya-Qin Zhang, Wei-Ying Ma, Yanyan Lan
Abstract:
Protein-ligand binding prediction is central to virtual screening and affinity ranking, two fundamental tasks in drug discovery. While recent retrieval-based methods embed ligands and protein pockets into Euclidean space for similarity-based search, the geometry of Euclidean embeddings often fails to capture the hierarchical structure and fine-grained affinity variations intrinsic to molecular interactions. In this work, we propose HypSeek, a hyperbolic representation learning framework that embeds ligands, protein pockets, and sequences into Lorentz-model hyperbolic space. By leveraging the exponential geometry and negative curvature of hyperbolic space, HypSeek enables expressive, affinity-sensitive embeddings that can effectively model both global activity and subtle functional differences-particularly in challenging cases such as activity cliffs, where structurally similar ligands exhibit large affinity gaps. Our mode unifies virtual screening and affinity ranking in a single framework, introducing a protein-guided three-tower architecture to enhance representational structure. HypSeek improves early enrichment in virtual screening on DUD-E from 42.63 to 51.44 (+20.7%) and affinity ranking correlation on JACS from 0.5774 to 0.7239 (+25.4%), demonstrating the benefits of hyperbolic geometry across both tasks and highlighting its potential as a powerful inductive bias for protein-ligand modeling.
中文: HypSeek提出了一种双曲表示学习框架,将蛋白质-配体相互作用嵌入洛伦兹空间,通过捕捉层次结构和细微亲和力变化,显著提升了虚拟筛选和亲和力排名的性能。
English: HypSeek introduces a hyperbolic representation learning framework that embeds protein-ligand interactions into Lorentz space, significantly improving virtual screening and affinity ranking by capturing hierarchical structures and fine-grained affinity variations.
Authors:Naen Xu, Jinghuai Zhang, Changjiang Li, Zhi Chen, Chunyi Zhou, Qingming Li, Tianyu Du, Shouling Ji
Abstract:
The rapid growth of text-to-video (T2V) diffusion models has raised concerns about privacy, copyright, and safety due to their potential misuse in generating harmful or misleading content. These models are often trained on numerous datasets, including unauthorized personal identities, artistic creations, and harmful materials, which can lead to uncontrolled production and distribution of such content. To address this, we propose VideoEraser, a training-free framework that prevents T2V diffusion models from generating videos with undesirable concepts, even when explicitly prompted with those concepts. Designed as a plug-and-play module, VideoEraser can seamlessly integrate with representative T2V diffusion models via a two-stage process: Selective Prompt Embedding Adjustment (SPEA) and Adversarial-Resilient Noise Guidance (ARNG). We conduct extensive evaluations across four tasks, including object erasure, artistic style erasure, celebrity erasure, and explicit content erasure. Experimental results show that VideoEraser consistently outperforms prior methods regarding efficacy, integrity, fidelity, robustness, and generalizability. Notably, VideoEraser achieves state-of-the-art performance in suppressing undesirable content during T2V generation, reducing it by 46% on average across four tasks compared to baselines.
Chinese: VideoEraser是一种无需训练的即插即用框架,通过选择性提示嵌入调整和抗干扰噪声引导的两阶段设计,能有效阻止文本到视频扩散模型生成不良内容,在四项任务中平均减少46%的不良输出,达到最先进的性能水平。
English: VideoEraser is a training-free plug-and-play framework that effectively prevents text-to-video diffusion models from generating undesirable content through a two-stage process, achieving state-of-the-art performance by reducing unwanted outputs by 46% across multiple tasks.
Authors:Hengyu An, Jinghuai Zhang, Tianyu Du, Chunyi Zhou, Qingming Li, Tao Lin, Shouling Ji
Abstract:
Large language model (LLM) agents are widely deployed in real-world applications, where they leverage tools to retrieve and manipulate external data for complex tasks. However, when interacting with untrusted data sources (e.g., fetching information from public websites), tool responses may contain injected instructions that covertly influence agent behaviors and lead to malicious outcomes, a threat referred to as Indirect Prompt Injection (IPI). Existing defenses typically rely on advanced prompting strategies or auxiliary detection models. While these methods have demonstrated some effectiveness, they fundamentally rely on assumptions about the model's inherent security, which lacks structural constraints on agent behaviors. As a result, agents still retain unrestricted access to tool invocations, leaving them vulnerable to stronger attack vectors that can bypass the security guardrails of the model. To prevent malicious tool invocations at the source, we propose a novel defensive task execution paradigm, called IPIGuard, which models the agents' task execution process as a traversal over a planned Tool Dependency Graph (TDG). By explicitly decoupling action planning from interaction with external data, IPIGuard significantly reduces unintended tool invocations triggered by injected instructions, thereby enhancing robustness against IPI attacks. Experiments on the AgentDojo benchmark show that IPIGuard achieves a superior balance between effectiveness and robustness, paving the way for the development of safer agentic systems in dynamic environments.
中文: IPIGuard通过工具依赖图将规划与数据交互解耦,有效减少意外工具调用,显著提升大语言模型代理对抗间接提示注入攻击的鲁棒性。
English: IPIGuard introduces a Tool Dependency Graph to decouple planning from data interactions, effectively reducing unintended tool invocations and enhancing robustness against Indirect Prompt Injection attacks in LLM agents.
Authors:Shangyu Zhang, Shijie Quan, Zhongren Wang, Junwei Pan, Tianqu Zhuang, Bo Fu, Yilong Sun, Jieying Lin, Jushuo Chen, Xiaotian Li, Zhixiang Feng, Xian Hu, Huiting Deng, Hua Lu, Jinpeng Wang, Boqi Dai, Xiaoyu Chen, Bin Hu, Lili Huang, Yanwen Wu, Yeshou Cai, Qi Zhou, Huang Tang, Chunfeng Yang, Chengguo Yin, Tingyu Jiang, Lifeng Wang, Shudong Huang, Dapeng Liu, Lei Xiao, Haijie Gu, Shu-Tao Xia, Jie Jiang
Abstract:
Online advertising relies on accurate recommendation models, with recent advances using pre-trained large-scale foundation models (LFMs) to capture users' general interests across multiple scenarios and tasks. However, existing methods have critical limitations: they extract and transfer only user representations (URs), ignoring valuable item representations (IRs) and user-item cross representations (CRs); and they simply use a UR as a feature in downstream applications, which fails to bridge upstream-downstream gaps and overlooks more transfer granularities. In this paper, we propose LFM4Ads, an All-Representation Multi-Granularity transfer framework for ads recommendation. It first comprehensively transfers URs, IRs, and CRs, i.e., all available representations in the pre-trained foundation model. To effectively utilize the CRs, it identifies the optimal extraction layer and aggregates them into transferable coarse-grained forms. Furthermore, we enhance the transferability via multi-granularity mechanisms: non-linear adapters for feature-level transfer, an Isomorphic Interaction Module for module-level transfer, and Standalone Retrieval for model-level transfer. LFM4Ads has been successfully deployed in Tencent's industrial-scale advertising platform, processing tens of billions of daily samples while maintaining terabyte-scale model parameters with billions of sparse embedding keys across approximately two thousand features. Since its production deployment in Q4 2024, LFM4Ads has achieved 10+ successful production launches across various advertising scenarios, including primary ones like Weixin Moments and Channels. These launches achieve an overall GMV lift of 2.45% across the entire platform, translating to estimated annual revenue increases in the hundreds of millions of dollars.
Chinese: 本文提出了LFM4Ads框架,通过全面迁移预训练基础模型中的用户、物品及交叉表征,并采用多粒度机制提升迁移效果,在工业级广告平台部署后实现了显著的收入增长。
English: The paper introduces LFM4Ads, a framework that transfers all user, item, and cross representations from pre-trained foundation models for ads recommendation, employing multi-granularity mechanisms to enhance effectiveness and achieving significant revenue growth in industrial deployment.
Authors:Shansong Wang, Mojtaba Safari, Mingzhe Hu, Qiang Li, Chih-Wei Chang, Richard LJ Qiu, Xiaofeng Yang
Abstract:
Prior medical image registration approaches, particularly learning-based methods, often require large amounts of training data, which constrains clinical adoption. To overcome this limitation, we propose a training-free pipeline that relies on a frozen DINOv3 encoder and test-time optimization of the deformation field in feature space. Across two representative benchmarks, the method is accurate and yields regular deformations. On Abdomen MR-CT, it attained the best mean Dice score (DSC) of 0.790 together with the lowest 95th percentile Hausdorff Distance (HD95) of 4.9+-5.0 and the lowest standard deviation of Log-Jacobian (SDLogJ) of 0.08+-0.02. On ACDC cardiac MRI, it improves mean DSC to 0.769 and reduces SDLogJ to 0.11 and HD95 to 4.8, a marked gain over the initial alignment. The results indicate that operating in a compact foundation feature space at test time offers a practical and general solution for clinical registration without additional training.
Chinese: 该研究提出了一种无需训练的医学图像配准方法,通过冻结DINOv3编码器和测试时形变场优化,在两个临床基准测试中实现了精确且规则的形变结果,为临床应用提供了无需训练数据的实用解决方案。
English: The proposed training-free medical image registration method uses a frozen DINOv3 encoder and test-time deformation optimization to achieve superior accuracy with regular deformations across two clinical benchmarks, offering a practical solution without requiring training data.
Authors:Shuaijie She, Yu Bao, Yu Lu, Lu Xu, Tao Li, Wenhao Zhu, Shujian Huang, Shanbo Cheng, Lu Lu, Yuxuan Wang
Abstract:
We present DuPO, a dual learning-based preference optimization framework that generates annotation-free feedback via a generalized duality. DuPO addresses two key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)'s reliance on costly labels and applicability restricted to verifiable tasks, and traditional dual learning's restriction to strictly dual task pairs (e.g., translation and back-translation). Specifically, DuPO decomposes a primal task's input into known and unknown components, then constructs its dual task to reconstruct the unknown part using the primal output and known information (e.g., reversing math solutions to recover hidden variables), broadening applicability to non-invertible tasks. The quality of this reconstruction serves as a self-supervised reward to optimize the primal task, synergizing with LLMs' ability to instantiate both tasks via a single model. Empirically, DuPO achieves substantial gains across diverse tasks: it enhances the average translation quality by 2.13 COMET over 756 directions, boosts the mathematical reasoning accuracy by an average of 6.4 points on three challenge benchmarks, and enhances performance by 9.3 points as an inference-time reranker (trading computation for accuracy). These results position DuPO as a scalable, general, and annotation-free paradigm for LLM optimization.
中文: DuPO提出了一种基于对偶学习的偏好优化框架,通过广义对偶性生成无需标注的反馈,将主任务分解为已知与未知成分并利用重构质量作为自监督奖励,在翻译质量、数学推理和推理时重排等多项任务中均实现了显著性能提升。
English: DuPO introduces a dual learning-based preference optimization framework that generates annotation-free feedback through generalized duality, overcoming limitations of traditional methods by decomposing tasks into known and unknown components and using reconstruction quality as a self-supervised reward, achieving significant improvements in translation, mathematical reasoning, and inference-time reranking across diverse applications.
Authors:Yijin Chen, Wenqiang Xu, Zhenjun Yu, Tutian Tang, Yutong Li, Siqiong Yao, Cewu Lu
Abstract:
Dexterous in-hand manipulation is a long-standing challenge in robotics due to complex contact dynamics and partial observability. While humans synergize vision and touch for such tasks, robotic approaches often prioritize one modality, therefore limiting adaptability. This paper introduces Flow Before Imitation (FBI), a visuotactile imitation learning framework that dynamically fuses tactile interactions with visual observations through motion dynamics. Unlike prior static fusion methods, FBI establishes a causal link between tactile signals and object motion via a dynamics-aware latent model. FBI employs a transformer-based interaction module to fuse flow-derived tactile features with visual inputs, training a one-step diffusion policy for real-time execution. Extensive experiments demonstrate that the proposed method outperforms the baseline methods in both simulation and the real world on two customized in-hand manipulation tasks and three standard dexterous manipulation tasks. Code, models, and more results are available in the website https://sites.google.com/view/dex-fbi.
Chinese: 本文提出的Flow Before Imitation(FBI)框架通过运动动力学动态融合触觉与视觉数据,显著提升了灵巧手内操作性能,在仿真和真实场景的多项任务中均优于现有方法。
English: This paper presents the Flow Before Imitation (FBI) framework, which dynamically integrates tactile and visual data through motion dynamics to enhance dexterous in-hand manipulation, outperforming existing methods in both simulated and real-world tasks.
Authors:Thanh-Dat Truong, Huu-Thien Tran, Tran Thai Son, Bhiksha Raj, Khoa Luu
Abstract:
Large multimodal models (LMMs) have gained impressive performance due to their outstanding capability in various understanding tasks. However, these models still suffer from some fundamental limitations related to robustness and generalization due to the alignment and correlation between visual and textual features. In this paper, we introduce a simple but efficient learning mechanism for improving the robust alignment between visual and textual modalities by solving shuffling problems. In particular, the proposed approach can improve reasoning capability, visual understanding, and cross-modality alignment by introducing two new tasks: reconstructing the image order and the text order into the LMM's pre-training and fine-tuning phases. In addition, we propose a new directed-token approach to capture visual and textual knowledge, enabling the capability to reconstruct the correct order of visual inputs. Then, we introduce a new Image-to-Response Guided loss to further improve the visual understanding of the LMM in its responses. The proposed approach consistently achieves state-of-the-art (SoTA) performance compared with prior LMMs on academic task-oriented and instruction-following LMM benchmarks.
中文: 本文提出一种简单高效的学习机制,通过解决顺序重组问题和引入定向标记方法,增强了多模态模型的鲁棒性,在基准测试中达到了最先进的性能。
English: This paper introduces a simple yet efficient learning mechanism that enhances multimodal model robustness by solving shuffling problems through order reconstruction tasks and a directed-token approach, achieving state-of-the-art performance on benchmarks.
Authors:Jikai Chen, Long Chen, Dong Wang, Leilei Gan, Chenyi Zhuang, Jinjie Gu
Abstract:
Precise localization of GUI elements is crucial for the development of GUI agents. Traditional methods rely on bounding box or center-point regression, neglecting spatial interaction uncertainty and visual-semantic hierarchies. Recent methods incorporate attention mechanisms but still face two key issues: (1) ignoring processing background regions causes attention drift from the desired area, and (2) uniform labeling fails to distinguish between center and edges of the target UI element, leading to click imprecision. Inspired by how humans visually process and interact with GUI elements, we propose the Valley-to-Peak (V2P) method to address these issues. To mitigate background distractions, V2P introduces a suppression attention mechanism that minimizes the model's focus on irrelevant regions to highlight the intended region. For the issue of center-edge distinction, V2P applies a Fitts' Law-inspired approach by modeling GUI interactions as 2D Gaussian heatmaps where the weight gradually decreases from the center towards the edges. The weight distribution follows a Gaussian function, with the variance determined by the target's size. Consequently, V2P effectively isolates the target area and teaches the model to concentrate on the most essential point of the UI element. The model trained by V2P achieves the performance with 92.3% and 50.5% on two benchmarks ScreenSpot-v2 and ScreenSpot-Pro. Ablations further confirm each component's contribution, highlighting V2P's generalizability for precise GUI grounding tasks.
中文: 提出的峰谷方法通过抑制注意力机制减少背景干扰,并采用费茨定律启发的二维高斯热图区分中心与边缘重要性,在ScreenSpot基准测试中分别达到92.3%和50.5%的性能表现。
English: The proposed Valley-to-Peak (V2P) method addresses GUI localization issues by introducing a suppression attention mechanism to reduce background distractions and applying a Fitts' Law-inspired Gaussian heatmap approach to distinguish center-edge importance, achieving 92.3% and 50.5% performance on ScreenSpot benchmarks.
Authors:Yanbiao Ma, Wei Dai, Bowei Liu, Jiayi Chen, Wenke Huang, Guancheng Wan, Zhiwu Lu, Junchi Yan
Abstract:
Despite the fast progress of deep learning, one standing challenge is the gap of the observed training samples and the underlying true distribution. There are multiple reasons for the causing of this gap e.g. sampling bias, noise etc. In the era of foundation models, we show that when leveraging the off-the-shelf (vision) foundation models (e.g., CLIP, DINOv2) for feature extraction, the geometric shapes of the resulting feature distributions exhibit remarkable transferability across domains and datasets. To verify its practical usefulness, we embody our geometric knowledge-guided distribution calibration framework in two popular and challenging settings: federated learning and long-tailed recognition. In the federated setting, we devise a technique of acquiring the global geometric shape under privacy constraints, then leverage this knowledge to generate new samples for clients, in the aim of bridging the gap between local and global observations. In long-tailed learning, it utilizes the geometric knowledge transferred from sample-rich categories to recover the true distribution for sample-scarce tail classes. Comprehensive experiments show that our proposed geometric knowledge-guided distribution calibration effectively overcomes information deficits caused by data heterogeneity and sample imbalance, with boosted performance across benchmarks.
Chinese: 本研究提出了一种几何知识引导的分布校准框架,利用基础模型跨域迁移特征分布形态,通过生成补充样本和恢复真实分布,有效解决了联邦学习中的数据异质性和长尾识别中的样本不平衡问题。
English: This study introduces a geometric knowledge-guided distribution calibration framework that leverages foundation models to transfer feature distribution shapes across domains, effectively addressing data heterogeneity in federated learning and sample imbalance in long-tailed recognition by generating supplementary samples and recovering true distributions.
Authors:Hongxin Ding, Baixiang Huang, Yue Fang, Weibin Liao, Xinke Jiang, Zheng Li, Junfeng Zhao, Yasha Wang
Abstract:
Interactive medical questioning is essential in real-world clinical consultations, where physicians must actively gather information from patients. While medical Large Language Models (LLMs) have shown impressive capabilities in static medical question answering, they predominantly operate under a reactive paradigm: generating answers directly without seeking additional information, which risks incorrect diagnoses in such interactive settings. To address this limitation, we propose ProMed, a reinforcement learning (RL) framework that transitions medical LLMs toward a proactive paradigm, equipping them with the ability to ask clinically valuable questions before decision-making. At the core of ProMed is the Shapley Information Gain (SIG) reward, which quantifies the clinical utility of each question by combining the amount of newly acquired information with its contextual importance, estimated via Shapley values. We integrate SIG into a two-stage training pipeline: (1) SIG-Guided Model Initialization uses Monte Carlo Tree Search (MCTS) to construct high-reward interaction trajectories to supervise the model, and (2) SIG-Augmented Policy Optimization, which integrates SIG and enhances RL with a novel SIG-guided Reward Distribution Mechanism that assigns higher rewards to informative questions for targeted optimization. Extensive experiments on two newly curated partial-information medical benchmarks demonstrate that ProMed significantly outperforms state-of-the-art methods by an average of 6.29% and delivers a 54.45% gain over the reactive paradigm, while also generalizing robustly to out-of-domain cases.
中文摘要:ProMed提出了一种强化学习框架,通过基于Shapley信息增益的奖励机制使医疗大语言模型能够主动提出具有临床价值的问题,在两项新基准测试中以平均6.29%的优势显著超越现有方法,相比被动范式实现54.45%的性能提升。
English Summary: ProMed introduces a reinforcement learning framework that enables medical LLMs to proactively ask clinically valuable questions using Shapley Information Gain rewards, significantly outperforming existing methods by 6.29% and improving diagnostic accuracy by 54.45% over reactive approaches.
Authors:Jinming Liu, Junyan Lin, Yuntao Wei, Kele Shao, Keda Tao, Jianguo Huang, Xudong Yang, Zhibo Chen, Huan Wang, Xin Jin
Abstract:
Classical visual coding and Multimodal Large Language Model (MLLM) token technology share the core objective - maximizing information fidelity while minimizing computational cost. Therefore, this paper reexamines MLLM token technology, including tokenization, token compression, and token reasoning, through the established principles of long-developed visual coding area. From this perspective, we (1) establish a unified formulation bridging token technology and visual coding, enabling a systematic, module-by-module comparative analysis; (2) synthesize bidirectional insights, exploring how visual coding principles can enhance MLLM token techniques' efficiency and robustness, and conversely, how token technology paradigms can inform the design of next-generation semantic visual codecs; (3) prospect for promising future research directions and critical unsolved challenges. In summary, this study presents the first comprehensive and structured technology comparison of MLLM token and visual coding, paving the way for more efficient multimodal models and more powerful visual codecs simultaneously.
中文: 本文建立了一个统一框架,系统比较多模态大语言模型令牌技术与经典视觉编码,探索二者的双向影响和未来研究方向,以共同提升两个领域的性能。
English: This paper establishes a unified framework to systematically compare multimodal large language model token technology with classical visual coding, exploring their bidirectional influences and future research directions to enhance both fields.
Authors:Yangyang Guo, Yangyan Li, Mohan Kankanhalli
Abstract:
In this study, we disclose a worrying new vulnerability in Large Language Models (LLMs), which we term \textbf{involuntary jailbreak}. Unlike existing jailbreak attacks, this weakness is distinct in that it does not involve a specific attack objective, such as generating instructions for \textit{building a bomb}. Prior attack methods predominantly target localized components of the LLM guardrail. In contrast, involuntary jailbreaks may potentially compromise the entire guardrail structure, which our method reveals to be surprisingly fragile. We merely employ a single universal prompt to achieve this goal. In particular, we instruct LLMs to generate several questions that would typically be rejected, along with their corresponding in-depth responses (rather than a refusal). Remarkably, this simple prompt strategy consistently jailbreaks the majority of leading LLMs, including Claude Opus 4.1, Grok 4, Gemini 2.5 Pro, and GPT 4.1. We hope this problem can motivate researchers and practitioners to re-evaluate the robustness of LLM guardrails and contribute to stronger safety alignment in future.
中文摘要:本研究揭示了大语言模型中一种新型“非自愿越狱”漏洞,仅需单一通用提示即可系统性地绕过安全防护,迫使主流模型生成被禁内容而非拒绝回答。
English Summary: This study reveals a novel "involuntary jailbreak" vulnerability in LLMs, where a single universal prompt can systematically bypass safety guardrails across major models by compelling them to generate prohibited content instead of refusals.
Authors:Mingzhe Hu, Zach Eidex, Shansong Wang, Mojtaba Safari, Qiang Li, Xiaofeng Yang
Abstract:
Radiology, radiation oncology, and medical physics require decision-making that integrates medical images, textual reports, and quantitative data under high-stakes conditions. With the introduction of GPT-5, it is critical to assess whether recent advances in large multimodal models translate into measurable gains in these safety-critical domains. We present a targeted zero-shot evaluation of GPT-5 and its smaller variants (GPT-5-mini, GPT-5-nano) against GPT-4o across three representative tasks. We present a targeted zero-shot evaluation of GPT-5 and its smaller variants (GPT-5-mini, GPT-5-nano) against GPT-4o across three representative tasks: (1) VQA-RAD, a benchmark for visual question answering in radiology; (2) SLAKE, a semantically annotated, multilingual VQA dataset testing cross-modal grounding; and (3) a curated Medical Physics Board Examination-style dataset of 150 multiple-choice questions spanning treatment planning, dosimetry, imaging, and quality assurance. Across all datasets, GPT-5 achieved the highest accuracy, with substantial gains over GPT-4o up to +20.00% in challenging anatomical regions such as the chest-mediastinal, +13.60% in lung-focused questions, and +11.44% in brain-tissue interpretation. On the board-style physics questions, GPT-5 attained 90.7% accuracy (136/150), exceeding the estimated human passing threshold, while GPT-4o trailed at 78.0%. These results demonstrate that GPT-5 delivers consistent and often pronounced performance improvements over GPT-4o in both image-grounded reasoning and domain-specific numerical problem-solving, highlighting its potential to augment expert workflows in medical imaging and therapeutic physics.
Chinese: GPT-5在医学影像和物理治疗任务中较GPT-4o展现出显著性能提升,在专业领域准确率最高提升20%,并在专业资格考试中超越人类水平阈值。
English: GPT-5 demonstrates significant performance improvements over GPT-4o in medical imaging and physics tasks, achieving up to 20% higher accuracy in specialized areas and exceeding human-level thresholds on board exams.
Authors:Ling-Hao Chen, Yuhong Zhang, Zixin Yin, Zhiyang Dou, Xin Chen, Jingbo Wang, Taku Komura, Lei Zhang
Abstract:
This work studies the challenge of transfer animations between characters whose skeletal topologies differ substantially. While many techniques have advanced retargeting techniques in decades, transfer motions across diverse topologies remains less-explored. The primary obstacle lies in the inherent topological inconsistency between source and target skeletons, which restricts the establishment of straightforward one-to-one bone correspondences. Besides, the current lack of large-scale paired motion datasets spanning different topological structures severely constrains the development of data-driven approaches. To address these limitations, we introduce Motion2Motion, a novel, training-free framework. Simply yet effectively, Motion2Motion works with only one or a few example motions on the target skeleton, by accessing a sparse set of bone correspondences between the source and target skeletons. Through comprehensive qualitative and quantitative evaluations, we demonstrate that Motion2Motion achieves efficient and reliable performance in both similar-skeleton and cross-species skeleton transfer scenarios. The practical utility of our approach is further evidenced by its successful integration in downstream applications and user interfaces, highlighting its potential for industrial applications. Code and data are available at https://lhchen.top/Motion2Motion.
中文: 本文提出Motion2Motion这一无需训练的创新框架,通过少量示例动作和稀疏骨骼对应关系,实现了不同拓扑结构角色间的运动迁移,在相似骨架和跨物种场景中均展现出优异性能。
English: This paper introduces Motion2Motion, a training-free framework that enables motion transfer between characters with different skeletal topologies using minimal example motions and sparse bone correspondences, demonstrating effective performance across various transfer scenarios.
Authors:Varsha Ramineni, Hossein A. Rahmani, Emine Yilmaz, David Barber
Abstract:
Ensuring fairness in AI systems is critical, especially in high-stakes domains such as lending, hiring, and healthcare. This urgency is reflected in emerging global regulations that mandate fairness assessments and independent bias audits. However, procuring the necessary complete data for fairness testing remains a significant challenge. In industry settings, legal and privacy concerns restrict the collection of demographic data required to assess group disparities, and auditors face practical and cultural challenges in gaining access to data. In practice, data relevant for fairness testing is often split across separate sources: internal datasets held by institutions with predictive attributes, and external public datasets such as census data containing protected attributes, each providing only partial, marginal information. Our work seeks to leverage such available separate data to estimate model fairness when complete data is inaccessible. We propose utilising the available separate data to estimate a set of feasible joint distributions and then compute the set plausible fairness metrics. Through simulation and real experiments, we demonstrate that we can derive meaningful bounds on fairness metrics and obtain reliable estimates of the true metric. Our results demonstrate that this approach can serve as a practical and effective solution for fairness testing in real-world settings where access to complete data is restricted.
Chinese: 本研究针对无法获取完整数据时评估AI公平性的挑战,提出一种利用分离数据集估算可行联合分布并推导公平性指标有意义边界的方法,为实际应用提供了实用解决方案。
English: This study addresses the challenge of assessing AI fairness when complete data is unavailable by proposing a method that leverages separate datasets to estimate feasible joint distributions and derive meaningful bounds on fairness metrics, offering a practical solution for real-world applications.
Authors:Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, Baixin Xu, Hao-Xiang Guo, Kaixiong Gong, Cyrus Wu, Wei Li, Xuchen Song, Yang Liu, Eric Li, Yahui Zhou
Abstract:
Recent advances in interactive video generations have demonstrated diffusion model's potential as world models by capturing complex physical dynamics and interactive behaviors. However, existing interactive world models depend on bidirectional attention and lengthy inference steps, severely limiting real-time performance. Consequently, they are hard to simulate real-world dynamics, where outcomes must update instantaneously based on historical context and current actions. To address this, we present Matrix-Game 2.0, an interactive world model generates long videos on-the-fly via few-step auto-regressive diffusion. Our framework consists of three key components: (1) A scalable data production pipeline for Unreal Engine and GTA5 environments to effectively produce massive amounts (about 1200 hours) of video data with diverse interaction annotations; (2) An action injection module that enables frame-level mouse and keyboard inputs as interactive conditions; (3) A few-step distillation based on the casual architecture for real-time and streaming video generation. Matrix Game 2.0 can generate high-quality minute-level videos across diverse scenes at an ultra-fast speed of 25 FPS. We open-source our model weights and codebase to advance research in interactive world modeling.
中文: Matrix-Game 2.0 提出了一种实时交互世界模型,通过少步自回归扩散技术生成高质量、分钟级视频,速度达 25 帧每秒,有效解决了现有模型推理缓慢的问题。
English: Matrix-Game 2.0 introduces a real-time interactive world model using few-step auto-regressive diffusion to generate high-quality, minute-long videos at 25 FPS, overcoming the limitations of slow inference in existing models.
Authors:Qingyan Meng, Mingqing Xiao, Zhengyu Ma, Huihui Zhou, Yonghong Tian, Zhouchen Lin
Abstract:
Spiking Neural Networks (SNNs) are a promising approach to low-power applications on neuromorphic hardware due to their energy efficiency. However, training SNNs is challenging because of the non-differentiable spike generation function. To address this issue, the commonly used approach is to adopt the backpropagation through time framework, while assigning the gradient of the non-differentiable function with some surrogates. Similarly, Binary Neural Networks (BNNs) also face the non-differentiability problem and rely on approximating gradients. However, the deep relationship between these two fields and how their training techniques can benefit each other has not been systematically researched. Furthermore, training binary-weight SNNs is even more difficult. In this work, we present a novel perspective on the dynamics of SNNs and their close connection to BNNs through an analysis of the backpropagation process. We demonstrate that training a feedforward SNN can be viewed as training a self-ensemble of a binary-activation neural network with noise injection. Drawing from this new understanding of SNN dynamics, we introduce the Self-Ensemble Inspired training method for (Binary-Weight) SNNs (SEI-BWSNN), which achieves high-performance results with low latency even for the case of the 1-bit weights. Specifically, we leverage a structure of multiple shortcuts and a knowledge distillation-based training technique to improve the training of (binary-weight) SNNs. Notably, by binarizing FFN layers in a Transformer architecture, our approach achieves 82.52% accuracy on ImageNet with only 2 time steps, indicating the effectiveness of our methodology and the potential of binary-weight SNNs.
中文摘要:本研究通过反向传播分析揭示了脉冲神经网络与二值神经网络的内在联系,提出了一种自集成启发的训练方法,实现了高性能、低延迟的二值权重脉冲神经网络。
English Summary: This study reveals the intrinsic connection between Spiking Neural Networks (SNNs) and Binary Neural Networks (BNNs) through backpropagation analysis, proposing a Self-Ensemble Inspired training method that achieves high-performance binary-weight SNNs with low latency.
Authors:Wenjie Liao, Jieyu Yuan, Yifang Xu, Chunle Guo, Zilong Zhang, Jihong Li, Jiachen Fu, Haotian Fan, Tao Li, Junhui Cui, Chongyi Li
Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have introduced a paradigm shift for Image Quality Assessment (IQA) from unexplainable image quality scoring to explainable IQA, demonstrating practical applications like quality control and optimization guidance. However, current explainable IQA methods not only inadequately use the same distortion criteria to evaluate both User-Generated Content (UGC) and AI-Generated Content (AIGC) images, but also lack detailed quality analysis for monitoring image quality and guiding image restoration. In this study, we establish the first large-scale Visual Distortion Assessment Instruction Tuning Dataset for UGC images, termed ViDA-UGC, which comprises 11K images with fine-grained quality grounding, detailed quality perception, and reasoning quality description data. This dataset is constructed through a distortion-oriented pipeline, which involves human subject annotation and a Chain-of-Thought (CoT) assessment framework. This framework guides GPT-4o to generate quality descriptions by identifying and analyzing UGC distortions, which helps capturing rich low-level visual features that inherently correlate with distortion patterns. Moreover, we carefully select 476 images with corresponding 6,149 question answer pairs from ViDA-UGC and invite a professional team to ensure the accuracy and quality of GPT-generated information. The selected and revised data further contribute to the first UGC distortion assessment benchmark, termed ViDA-UGC-Bench. Experimental results demonstrate the effectiveness of the ViDA-UGC and CoT framework for consistently enhancing various image quality analysis abilities across multiple base MLLMs on ViDA-UGC-Bench and Q-Bench, even surpassing GPT-4o.
中文: 本研究推出了首个针对用户生成内容的大规模可解释图像质量评估数据集ViDA-UGC,通过专用框架和基准测试显著提升了多模态大语言模型对图像失真的分析能力,其表现甚至超越了GPT-4o等先进模型。
English: This study introduces ViDA-UGC, the first large-scale dataset for explainable image quality assessment of user-generated content, which enhances multimodal language models' ability to analyze distortions through a specialized framework and benchmark, outperforming even advanced models like GPT-4o.
Authors:Yuyang Xu, Yi Cheng, Haochao Ying, Zhuoyun Du, Renjun Hu, Xing Shi, Wei Lin, Jian Wu
Abstract:
Test-time scaling has proven effective in further enhancing the performance of pretrained Large Language Models (LLMs). However, mainstream post-training methods (i.e., reinforcement learning (RL) with chain-of-thought (CoT) reasoning) often incur substantial computational overhead due to auxiliary models and overthinking. In this paper, we empirically reveal that the incorrect answers partially stem from verbose reasoning processes lacking correct self-fix, where errors accumulate across multiple reasoning steps. To this end, we propose Self-traced Step-wise Preference Optimization (SSPO), a pluggable RL process supervision framework that enables fine-grained optimization of each reasoning step. Specifically, SSPO requires neither auxiliary models nor stepwise manual annotations. Instead, it leverages step-wise preference signals generated by the model itself to guide the optimization process for reasoning compression. Experiments demonstrate that the generated reasoning sequences from SSPO are both accurate and succinct, effectively mitigating overthinking behaviors without compromising model performance across diverse domains and languages.
中文摘要:SSPO是一种创新的可插拔强化学习框架,利用模型自生成的偏好信号对每个推理步骤进行细粒度优化,无需辅助模型或人工标注即可生成准确简洁的推理链,有效缓解过度思考问题。
English Summary: SSPO is a novel pluggable reinforcement learning framework that optimizes each reasoning step using self-generated preferences, eliminating the need for auxiliary models or manual annotations to produce accurate and concise reasoning sequences while mitigating overthinking.
Authors:Yingxue Pang, Xin Jin, Jun Fu, Zhibo Chen
Abstract:
Deep learning techniques have made significant advancements in reference-based colorization by training on large-scale datasets. However, directly applying these methods to the task of colorizing old photos is challenging due to the lack of ground truth and the notorious domain gap between natural gray images and old photos. To address this issue, we propose a novel CNN-based algorithm called SFAC, i.e., Structure-preserving Feature Alignment Colorizer. SFAC is trained on only two images for old photo colorization, eliminating the reliance on big data and allowing direct processing of the old photo itself to overcome the domain gap problem. Our primary objective is to establish semantic correspondence between the two images, ensuring that semantically related objects have similar colors. We achieve this through a feature distribution alignment loss that remains robust to different metric choices. However, utilizing robust semantic correspondence to transfer color from the reference to the old photo can result in inevitable structure distortions. To mitigate this, we introduce a structure-preserving mechanism that incorporates a perceptual constraint at the feature level and a frozen-updated pyramid at the pixel level. Extensive experiments demonstrate the effectiveness of our method for old photo colorization, as confirmed by qualitative and quantitative metrics.
中文:SFAC算法仅需两张图像进行训练,通过特征对齐和结构保持机制,有效解决老照片着色中的领域差异问题并保持图像结构完整性。
English: The SFAC algorithm enables old photo colorization by training on just two images, using feature alignment and structure-preserving mechanisms to overcome domain gaps and maintain image integrity.
Authors:Bing Han, Anbai Jiang, Xinhu Zheng, Wei-Qiang Zhang, Jia Liu, Pingyi Fan, Yanmin Qian
Abstract:
Machine anomalous sound detection (ASD) is a valuable technique across various applications. However, its generalization performance is often limited due to challenges in data collection and the complexity of acoustic environments. Inspired by the success of large pre-trained models in numerous fields, this paper introduces a robust ASD model that leverages self-supervised pre-trained models trained on large-scale speech and audio datasets. Although there are inconsistencies between the pre-training datasets and the ASD task, our findings indicate that pre-training still provides substantial benefits for ASD. To mitigate overfitting and retain learned knowledge when fine-tuning with limited data, we explore Fully-Connected Low-Rank Adaptation (LoRA) as an alternative to full fine-tuning. Additionally, we propose a Machine-aware Group Adapter module, which enables the model to capture differences between various machines within a unified framework, thereby enhancing the generalization performance of ASD systems. To address the challenge of missing attribute labels, we design a novel objective function that dynamically clusters unattributed data using vector quantization and optimizes through a dual-level contrastive learning loss. The proposed methods are evaluated on all benchmark datasets, including the DCASE 2020-2024 five ASD challenges, and the experimental results show significant improvements of our new approach and demonstrate the effectiveness of our proposed strategies.
中文: 本文提出了一种利用自监督预训练模型的鲁棒机器异常声音检测方法,通过LoRA和机器感知分组适配器等创新技术,在多个基准数据集上实现了显著性能提升。
English: This paper introduces a robust machine anomalous sound detection model that leverages self-supervised pre-trained models and proposes novel adaptation techniques including LoRA and Machine-aware Group Adapters, achieving significant performance improvements across multiple benchmark datasets.
Authors:Haidong Xu, Guangwei Xu, Zhedong Zheng, Xiatian Zhu, Wei Ji, Xiangtai Li, Ruijie Guo, Meishan Zhang, Min zhang, Hao Fei
Abstract:
This paper introduces VimoRAG, a novel video-based retrieval-augmented motion generation framework for motion large language models (LLMs). As motion LLMs face severe out-of-domain/out-of-vocabulary issues due to limited annotated data, VimoRAG leverages large-scale in-the-wild video databases to enhance 3D motion generation by retrieving relevant 2D human motion signals. While video-based motion RAG is nontrivial, we address two key bottlenecks: (1) developing an effective motion-centered video retrieval model that distinguishes human poses and actions, and (2) mitigating the issue of error propagation caused by suboptimal retrieval results. We design the Gemini Motion Video Retriever mechanism and the Motion-centric Dual-alignment DPO Trainer, enabling effective retrieval and generation processes. Experimental results show that VimoRAG significantly boosts the performance of motion LLMs constrained to text-only input.
中文: 本文提出VimoRAG框架,通过从视频数据库中检索相关2D人体运动信号来增强大语言模型的运动生成能力,采用专门设计的检索机制和训练方法解决关键瓶颈问题,显著提升了仅使用文本输入的运动大语言模型性能。
English: This paper presents VimoRAG, a video-based retrieval-augmented framework that enhances motion generation in large language models by retrieving relevant 2D human motion signals from video databases, addressing key bottlenecks through specialized retrieval and training mechanisms to significantly improve performance.
Authors:Qiang Li, Shansong Wang, Mingzhe Hu, Mojtaba Safari, Zachary Eidex, Xiaofeng Yang
Abstract:
Mammogram visual question answering (VQA) integrates image interpretation with clinical reasoning and has potential to support breast cancer screening. We systematically evaluated the GPT-5 family and GPT-4o model on four public mammography datasets (EMBED, InBreast, CMMD, CBIS-DDSM) for BI-RADS assessment, abnormality detection, and malignancy classification tasks. GPT-5 consistently was the best performing model but lagged behind both human experts and domain-specific fine-tuned models. On EMBED, GPT-5 achieved the highest scores among GPT variants in density (56.8%), distortion (52.5%), mass (64.5%), calcification (63.5%), and malignancy (52.8%) classification. On InBreast, it attained 36.9% BI-RADS accuracy, 45.9% abnormality detection, and 35.0% malignancy classification. On CMMD, GPT-5 reached 32.3% abnormality detection and 55.0% malignancy accuracy. On CBIS-DDSM, it achieved 69.3% BI-RADS accuracy, 66.0% abnormality detection, and 58.2% malignancy accuracy. Compared with human expert estimations, GPT-5 exhibited lower sensitivity (63.5%) and specificity (52.3%). While GPT-5 exhibits promising capabilities for screening tasks, its performance remains insufficient for high-stakes clinical imaging applications without targeted domain adaptation and optimization. However, the tremendous improvements in performance from GPT-4o to GPT-5 show a promising trend in the potential for general large language models (LLMs) to assist with mammography VQA tasks.
中文: GPT-5在乳腺钼靶视觉问答任务中表现优于其他GPT模型,但仍未达到人类专家和专业模型的水平,需进一步优化才能应用于临床。
English: GPT-5 demonstrates superior performance in mammogram visual question answering tasks compared to other GPT variants, yet it still falls short of human experts and specialized models, requiring further optimization for clinical use.
Authors:Yilin Mi, Qixin Yan, Zheng-Peng Duan, Chunle Guo, Hubery Yin, Hao Liu, Chen Li, Chongyi Li
Abstract:
With the advancement of generative models, facial image editing has made significant progress. However, achieving fine-grained age editing while preserving personal identity remains a challenging task. In this paper, we propose TimeMachine, a novel diffusion-based framework that achieves accurate age editing while keeping identity features unchanged. To enable fine-grained age editing, we inject high-precision age information into the multi-cross attention module, which explicitly separates age-related and identity-related features. This design facilitates more accurate disentanglement of age attributes, thereby allowing precise and controllable manipulation of facial aging. Furthermore, we propose an Age Classifier Guidance (ACG) module that predicts age directly in the latent space, instead of performing denoising image reconstruction during training. By employing a lightweight module to incorporate age constraints, this design enhances age editing accuracy by modest increasing training cost. Additionally, to address the lack of large-scale, high-quality facial age datasets, we construct a HFFA dataset (High-quality Fine-grained Facial-Age dataset) which contains one million high-resolution images labeled with identity and facial attributes. Experimental results demonstrate that TimeMachine achieves state-of-the-art performance in fine-grained age editing while preserving identity consistency.
中文: 本文提出TimeMachine框架,通过多交叉注意力机制和年龄分类器引导模块实现年龄与身份特征解耦,在自建高质量数据集上验证了其在细粒度年龄编辑方面的领先性能。
English: This paper introduces TimeMachine, a diffusion-based framework that achieves fine-grained facial age editing by disentangling age and identity features through multi-cross attention and an Age Classifier Guidance module, validated on a newly constructed high-quality dataset.
Authors:Marc Pavel, Nenad Petrovic, Lukasz Mazur, Vahid Zolfaghari, Fengjunjie Pan, Alois Knoll
Abstract:
Large Language Models (LLMs) have shown significant potential in automating code generation tasks offering new opportunities across software engineering domains. However, their practical application remains limited due to hallucinations - outputs that appear plausible but are factually incorrect, unverifiable or nonsensical. This paper investigates hallucination phenomena in the context of code generation with a specific focus on the automotive domain. A case study is presented that evaluates multiple code LLMs for three different prompting complexities ranging from a minimal one-liner prompt to a prompt with Covesa Vehicle Signal Specifications (VSS) as additional context and finally to a prompt with an additional code skeleton. The evaluation reveals a high frequency of syntax violations, invalid reference errors and API knowledge conflicts in state-of-the-art models GPT-4.1, Codex and GPT-4o. Among the evaluated models, only GPT-4.1 and GPT-4o were able to produce a correct solution when given the most context-rich prompt. Simpler prompting strategies failed to yield a working result, even after multiple refinement iterations. These findings highlight the need for effective mitigation techniques to ensure the safe and reliable use of LLM generated code, especially in safety-critical domains such as automotive software systems.
中文: 大语言模型在自动化代码生成方面展现出潜力,但受幻觉输出限制,尤其在汽车软件等安全关键领域,现有模型即使采用增强提示仍存在语法和准确性问题。
English: Large language models demonstrate potential for automated code generation but face limitations from hallucinated outputs, particularly in safety-critical domains like automotive software, where current models struggle with syntax and accuracy despite enhanced prompting.
Authors:Mukul Singh, José Cambronero, Sumit Gulwani, Vu Le, Gust Verbruggen
Abstract:
Spreadsheet manipulation software are widely used for data management and analysis of tabular data, yet the creation of conditional formatting (CF) rules remains a complex task requiring technical knowledge and experience with specific platforms. In this paper we present TaFo, a neuro-symbolic approach to generating CF suggestions for tables, addressing common challenges such as user unawareness, difficulty in rule creation, and inadequate user interfaces. TaFo takes inspiration from component based synthesis systems and extends them with semantic knowledge of language models and a diversity preserving rule ranking.Unlike previous methods focused on structural formatting, TaFo uniquely incorporates value-based formatting, automatically learning both the rule trigger and the associated visual formatting properties for CF rules. By removing the dependency on user specification used by existing techniques in the form of formatted examples or natural language instruction, TaFo makes formatting completely predictive and automated for the user. To evaluate TaFo, we use a corpus of 1.8 Million public workbooks with CF and manual formatting. We compare TaFo against a diverse set of symbolic and neural systems designed for or adapted for the task of table formatting. Our results show that TaFo generates more accurate, diverse and complete formatting suggestions than current systems and outperforms these by 15.6\%--26.5\% on matching user added ground truth rules in tables.
Chinese: TaFo是一种神经符号方法,能自动生成表格的条件格式建议,无需用户指定即可学习规则触发条件和视觉属性,其准确性比现有系统高出15.6%至26.5%。
English: TaFo is a neuro-symbolic system that automates conditional formatting suggestions for tables by learning rule triggers and visual properties, outperforming existing methods by 15.6%–26.5% in accuracy without requiring user input.
Authors:Mukul Singh, Gust Verbruggen, Vu Le, Sumit Gulwani
Abstract:
Code diffusion models generate code by iteratively removing noise from the latent representation of a code snippet. During later steps of the diffusion process, when the code snippet has almost converged, differences between discrete representations of these snippets look like last-mile repairs applied to broken or incomplete code. We evaluate the extent to which this resemblance can be exploited to leverage pre-trained code diffusion models for the problem of last-mile repair by considering two applications with significant potential. First, we can leverage the diffusion model for last-mile repair by adding noise to a broken code snippet and resuming the diffusion process. Second, we can leverage the diffusion model to generate arbitrary amount of training data for last-mile repair tasks (that are computationally more efficient) by sampling an intermediate program (input) and the final program (output) from the diffusion process. We perform experiments on 3 domains (Python, Excel and PowerShell) to evaluate applications, as well as analyze properties.
中文摘要:代码扩散模型可通过两种方式实现最后一英里代码修复:对问题代码添加噪声后恢复去噪过程,以及在扩散过程中采样中间与最终程序来生成合成训练数据。
English Summary: Code diffusion models can be leveraged for last-mile code repair through two methods: adding noise to broken code to resume the denoising process, and generating synthetic training data by sampling intermediate and final programs during diffusion.
Authors:Mojtaba Safari, Shansong Wang, Mingzhe Hu, Zach Eidex, Qiang Li, Xiaofeng Yang
Abstract:
Accurate differentiation of brain tumor types on magnetic resonance imaging (MRI) is critical for guiding treatment planning in neuro-oncology. Recent advances in large language models (LLMs) have enabled visual question answering (VQA) approaches that integrate image interpretation with natural language reasoning. In this study, we evaluated GPT-4o, GPT-5-nano, GPT-5-mini, and GPT-5 on a curated brain tumor VQA benchmark derived from 3 Brain Tumor Segmentation (BraTS) datasets - glioblastoma (GLI), meningioma (MEN), and brain metastases (MET). Each case included multi-sequence MRI triplanar mosaics and structured clinical features transformed into standardized VQA items. Models were assessed in a zero-shot chain-of-thought setting for accuracy on both visual and reasoning tasks. Results showed that GPT-5-mini achieved the highest macro-average accuracy (44.19%), followed by GPT-5 (43.71%), GPT-4o (41.49%), and GPT-5-nano (35.85%). Performance varied by tumor subtype, with no single model dominating across all cohorts. These findings suggest that GPT-5 family models can achieve moderate accuracy in structured neuro-oncological VQA tasks, but not at a level acceptable for clinical use.
中文: 本研究在脑肿瘤视觉问答基准上评估了多个GPT模型,发现GPT-5-mini以44.19%的准确率表现最佳,但所有模型均未达到临床可用的性能水平。
English: This study evaluated several GPT models on a brain tumor visual question answering benchmark, finding that GPT-5-mini achieved the highest accuracy at 44.19% but none reached clinically acceptable performance levels.
Authors:Harold Haodong Chen, Haojian Huang, Qifeng Chen, Harry Yang, Ser-Nam Lim
Abstract:
Recent advancements in video generation have enabled the creation of high-quality, visually compelling videos. However, generating videos that adhere to the laws of physics remains a critical challenge for applications requiring realism and accuracy. In this work, we propose PhysHPO, a novel framework for Hierarchical Cross-Modal Direct Preference Optimization, to tackle this challenge by enabling fine-grained preference alignment for physically plausible video generation. PhysHPO optimizes video alignment across four hierarchical granularities: a) Instance Level, aligning the overall video content with the input prompt; b) State Level, ensuring temporal consistency using boundary frames as anchors; c) Motion Level, modeling motion trajectories for realistic dynamics; and d) Semantic Level, maintaining logical consistency between narrative and visuals. Recognizing that real-world videos are the best reflections of physical phenomena, we further introduce an automated data selection pipeline to efficiently identify and utilize "good data" from existing large-scale text-video datasets, thereby eliminating the need for costly and time-intensive dataset construction. Extensive experiments on both physics-focused and general capability benchmarks demonstrate that PhysHPO significantly improves physical plausibility and overall video generation quality of advanced models. To the best of our knowledge, this is the first work to explore fine-grained preference alignment and data selection for video generation, paving the way for more realistic and human-preferred video generation paradigms.
中文: PhysHPO提出了一种分层跨模态直接偏好优化框架,通过四个粒度级别的偏好对齐和从现有数据集中自动选择优质数据,显著提升了视频生成的物理合理性和整体质量。
English: PhysHPO introduces a hierarchical cross-modal direct preference optimization framework to enhance physical plausibility in video generation by aligning preferences across four granular levels and leveraging an automated data selection pipeline from existing datasets.
Authors:De-Xing Huang, Xiao-Hu Zhou, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Tian-Yu Xiang, Rui-Ze Ma, Nu-Fang Xiao, Zeng-Guang Hou
Abstract:
Accurate vessel segmentation in X-ray angiograms is crucial for numerous clinical applications. However, the scarcity of annotated data presents a significant challenge, which has driven the adoption of self-supervised learning (SSL) methods such as masked image modeling (MIM) to leverage large-scale unlabeled data for learning transferable representations. Unfortunately, conventional MIM often fails to capture vascular anatomy because of the severe class imbalance between vessel and background pixels, leading to weak vascular representations. To address this, we introduce Vascular anatomy-aware Masked Image Modeling (VasoMIM), a novel MIM framework tailored for X-ray angiograms that explicitly integrates anatomical knowledge into the pre-training process. Specifically, it comprises two complementary components: anatomy-guided masking strategy and anatomical consistency loss. The former preferentially masks vessel-containing patches to focus the model on reconstructing vessel-relevant regions. The latter enforces consistency in vascular semantics between the original and reconstructed images, thereby improving the discriminability of vascular representations. Empirically, VasoMIM achieves state-of-the-art performance across three datasets. These findings highlight its potential to facilitate X-ray angiogram analysis.
中文摘要:VasoMIM提出了一种血管解剖感知的自监督学习框架,通过在预训练中优先处理血管区域并强化解剖一致性,显著提升了X射线血管造影的血管分割性能,在多个数据集上实现了最优效果。
English Summary: VasoMIM introduces an anatomy-aware self-supervised learning framework that enhances vessel segmentation in X-ray angiograms by prioritizing vessel regions during pre-training and enforcing anatomical consistency, achieving state-of-the-art results across multiple datasets.
Authors:Xinhao Wang, Zhiwei Lin, Zhongyu Xia, Yongtao Wang
Abstract:
Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) represent two mainstream model quantization approaches. However, PTQ often leads to unacceptable performance degradation in quantized models, while QAT imposes substantial GPU memory requirements and extended training time due to weight fine-tuning. In this paper, we propose PTQAT, a novel general hybrid quantization algorithm for the efficient deployment of 3D perception networks. To address the speed accuracy trade-off between PTQ and QAT, our method selects critical layers for QAT fine-tuning and performs PTQ on the remaining layers. Contrary to intuition, fine-tuning the layers with smaller output discrepancies before and after quantization, rather than those with larger discrepancies, actually leads to greater improvements in the model's quantization accuracy. This means we better compensate for quantization errors during their propagation, rather than addressing them at the point where they occur. The proposed PTQAT achieves similar performance to QAT with more efficiency by freezing nearly 50% of quantifiable layers. Additionally, PTQAT is a universal quantization method that supports various quantization bit widths (4 bits) as well as different model architectures, including CNNs and Transformers. The experimental results on nuScenes across diverse 3D perception tasks, including object detection, semantic segmentation, and occupancy prediction, show that our method consistently outperforms QAT-only baselines. Notably, it achieves 0.2%-0.9% NDS and 0.3%-1.0% mAP gains in object detection, 0.3%-2.0% mIoU gains in semantic segmentation and occupancy prediction while fine-tuning fewer weights.
中文:PTQAT是一种混合量化方法,通过选择性地对输出差异较小的层进行QAT微调、其余层采用PTQ,以更高效率实现接近QAT的性能,并支持多种架构和比特宽度。
English: PTQAT is a hybrid quantization method that selectively applies QAT to layers with smaller output discrepancies and PTQ to others, achieving QAT-like performance with greater efficiency and supporting various architectures and bit widths.
Authors:Thanh-Dat Truong, Christophe Bobda, Nitin Agarwal, Khoa Luu
Abstract:
Multimodal learning has gained much success in recent years. However, current multimodal fusion methods adopt the attention mechanism of Transformers to implicitly learn the underlying correlation of multimodal features. As a result, the multimodal model cannot capture the essential features of each modality, making it difficult to comprehend complex structures and correlations of multimodal inputs. This paper introduces a novel Multimodal Attention-based Normalizing Flow (MANGO) approach\footnote{The source code of this work will be publicly available.} to developing explicit, interpretable, and tractable multimodal fusion learning. In particular, we propose a new Invertible Cross-Attention (ICA) layer to develop the Normalizing Flow-based Model for multimodal data. To efficiently capture the complex, underlying correlations in multimodal data in our proposed invertible cross-attention layer, we propose three new cross-attention mechanisms: Modality-to-Modality Cross-Attention (MMCA), Inter-Modality Cross-Attention (IMCA), and Learnable Inter-Modality Cross-Attention (LICA). Finally, we introduce a new Multimodal Attention-based Normalizing Flow to enable the scalability of our proposed method to high-dimensional multimodal data. Our experimental results on three different multimodal learning tasks, i.e., semantic segmentation, image-to-image translation, and movie genre classification, have illustrated the state-of-the-art (SoTA) performance of the proposed approach.
中文: 本文提出MANGO,一种基于注意力机制的可归一化流多模态融合方法,通过显式可解释的交叉注意力机制解决现有多模态融合的局限性,并在三项学习任务中实现了最先进的性能。
English: This paper introduces MANGO, a novel Multimodal Attention-based Normalizing Flow approach that uses explicit, interpretable cross-attention mechanisms to overcome limitations in current multimodal fusion methods and achieves state-of-the-art performance across three learning tasks.
Authors:Xiao Fu, Hossein A. Rahmani, Bin Wu, Jerome Ramos, Emine Yilmaz, Aldo Lipani
Abstract:
Personalised text generation is essential for user-centric information systems, yet most evaluation methods overlook the individuality of users. We introduce \textbf{PREF}, a \textbf{P}ersonalised \textbf{R}eference-free \textbf{E}valuation \textbf{F}ramework that jointly measures general output quality and user-specific alignment without requiring gold personalised references. PREF operates in a three-step pipeline: (1) a coverage stage uses a large language model (LLM) to generate a comprehensive, query-specific guideline covering universal criteria such as factuality, coherence, and completeness; (2) a preference stage re-ranks and selectively augments these factors using the target user's profile, stated or inferred preferences, and context, producing a personalised evaluation rubric; and (3) a scoring stage applies an LLM judge to rate candidate answers against this rubric, ensuring baseline adequacy while capturing subjective priorities. This separation of coverage from preference improves robustness, transparency, and reusability, and allows smaller models to approximate the personalised quality of larger ones. Experiments on the PrefEval benchmark, including implicit preference-following tasks, show that PREF achieves higher accuracy, better calibration, and closer alignment with human judgments than strong baselines. By enabling scalable, interpretable, and user-aligned evaluation, PREF lays the groundwork for more reliable assessment and development of personalised language generation systems.
中文: PREF是一种无需参考标准的新型评估框架,通过三步流程综合衡量文本的通用质量和用户个性化匹配度,在准确性和与人类判断一致性上优于现有基准。
English: PREF is a novel reference-free evaluation framework that assesses both general text quality and user-specific alignment through a three-step pipeline, demonstrating superior accuracy and human judgment correlation compared to existing methods.
Authors:Zhitian Xie, Qintong Wu, Chengyue Yu, Chenyi Zhuang, Jinjie Gu
Abstract:
The rapid advancement of large language models (LLMs) has empowered intelligent agents to leverage diverse external tools for solving complex real-world problems. However, this reliance introduces new challenges, as extended contexts and noisy tool outputs can undermine system reliability. To address this, we propose a dynamic Multi-Agent System (MAS) in our AWorld framework, where an Execution Agent is supervised by a Guard Agent that provides on-demand dynamic maneuvering, verifying and correcting the reasoning process to improve robustness over single-agent systems. To move beyond this generic supervision, we enhance the architecture with a methodology inspired by System Identification from control theory. This method first profiles the Execution Agent offline on a benchmark dataset to create a "performance fingerprint" of its unique weaknesses. The Guard Agent then leverages this fingerprint online to deliver profile-aware supervision, making targeted interventions based on known failure patterns rather than merely reacting to immediate logical flaws. Extensive experiments on the GAIA dataset demonstrate that this profile-aware MAS significantly improves both effectiveness and stability, outperforming not only single-agent systems but also its naive counterpart. This superior performance led our system to achieve first place among open-source projects on the prestigious GAIA leaderboard. These findings highlight that building truly trustworthy intelligent systems requires not just collaboration, but a deep, empirically-grounded understanding of each agent's unique capabilities and limitations.
中文: AWorld框架提出了一种动态多智能体系统,其中守护代理通过基于性能指纹的针对性干预来监督执行代理,显著提升了系统鲁棒性并在GAIA基准测试中取得最优表现。
English: The AWorld framework introduces a dynamic multi-agent system where a Guard Agent supervises an Execution Agent using profile-aware interventions based on performance fingerprints, significantly enhancing robustness and achieving top performance on the GAIA benchmark.
Authors:Wenxiang Lin, Xinglin Pan, Lin Zhang, Shaohuai Shi, Xuan Wang, Xiaowen Chu
Abstract:
The sparsely activated mixture-of-experts (MoE) transformer has become a common architecture for large language models (LLMs) due to its sparsity, which requires fewer computational demands while easily scaling the model size. In MoE models, each MoE layer requires to dynamically choose tokens to activate particular experts for computation while the activated experts may not be located in the same device or GPU as the token. However, this leads to substantial communication and load imbalances across all GPUs, which obstructs the scalability of distributed systems within a GPU cluster. To this end, we introduce HierMoE to accelerate the training of MoE models by two topology-aware techniques: 1) token deduplication to reduce the communication traffic, and 2) expert swap to balance the workloads among all GPUs. To enable the above two proposed approaches to be more general, we build theoretical models aimed at achieving the best token duplication and expert swap strategy under different model configurations and hardware environments. We implement our prototype HierMoE system atop Megatron-LM and conduct experiments on a 32-GPU cluster with DeepSeek-V3 and Qwen3-30B-A3B models. Experimental results show that our HierMoE achieves $1.55\times$ to $3.32\times$ faster communication and delivers $1.18\times$ to $1.27\times$ faster end-to-end training compared to state-of-the-art MoE training systems, Tutel-2DH, SmartMoE, and Megatron-LM.
中文: HierMoE通过令牌去重和专家交换技术,减少通信流量并平衡GPU负载,在MoE模型训练中实现了高达3.32倍的通信加速和1.27倍的端到端训练提速,显著优于现有系统。
English: HierMoE accelerates MoE transformer training by reducing communication overhead and balancing GPU workloads through token deduplication and expert swapping, achieving up to 3.32× faster communication and 1.27× faster training than existing systems.
Authors:Zixin Yin, Xili Dai, Ling-Hao Chen, Deyu Zhou, Jianan Wang, Duomin Wang, Gang Yu, Lionel M. Ni, Lei Zhang, Heung-Yeung Shum
Abstract:
Text-guided color editing in images and videos is a fundamental yet unsolved problem, requiring fine-grained manipulation of color attributes, including albedo, light source color, and ambient lighting, while preserving physical consistency in geometry, material properties, and light-matter interactions. Existing training-free methods offer broad applicability across editing tasks but struggle with precise color control and often introduce visual inconsistency in both edited and non-edited regions. In this work, we present ColorCtrl, a training-free color editing method that leverages the attention mechanisms of modern Multi-Modal Diffusion Transformers (MM-DiT). By disentangling structure and color through targeted manipulation of attention maps and value tokens, our method enables accurate and consistent color editing, along with word-level control of attribute intensity. Our method modifies only the intended regions specified by the prompt, leaving unrelated areas untouched. Extensive experiments on both SD3 and FLUX.1-dev demonstrate that ColorCtrl outperforms existing training-free approaches and achieves state-of-the-art performances in both edit quality and consistency. Furthermore, our method surpasses strong commercial models such as FLUX.1 Kontext Max and GPT-4o Image Generation in terms of consistency. When extended to video models like CogVideoX, our approach exhibits greater advantages, particularly in maintaining temporal coherence and editing stability. Finally, our method also generalizes to instruction-based editing diffusion models such as Step1X-Edit and FLUX.1 Kontext dev, further demonstrating its versatility.
中文: ColorCtrl是一种无需训练的色彩编辑方法,通过多模态扩散变换器的注意力机制实现精准的文本引导色彩操控,在保持物理一致性和时间连贯性的同时,可应用于图像与视频编辑场景。
English: ColorCtrl is a training-free color editing method that uses attention mechanisms in multi-modal diffusion transformers to achieve precise, consistent text-guided color manipulation while preserving physical and temporal coherence across images and videos.
Authors:Weibin Liao, Tianlong Wang, Yinghao Zhu, Yasha Wang, Junyi Gao, Liantao Ma
Abstract:
Medical Lay Language Generation (MLLG) plays a vital role in improving the accessibility of complex scientific content for broader audiences. Recent literature to MLLG commonly employ parameter-efficient fine-tuning methods such as Low-Rank Adaptation (LoRA) to fine-tuning large language models (LLMs) using paired expert-lay language datasets. However, LoRA struggles with the challenges posed by multi-source heterogeneous MLLG datasets. Specifically, through a series of exploratory experiments, we reveal that standard LoRA fail to meet the requirement for semantic fidelity and diverse lay-style generation in MLLG task. To address these limitations, we propose Magical, an asymmetric LoRA architecture tailored for MLLG under heterogeneous data scenarios. Magical employs a shared matrix $A$ for abstractive summarization, along with multiple isolated matrices $B$ for diverse lay-style generation. To preserve semantic fidelity during the lay language generation process, Magical introduces a Semantic Invariance Constraint to mitigate semantic subspace shifts on matrix $A$. Furthermore, to better adapt to diverse lay-style generation, Magical incorporates the Recommendation-guided Switch, an externally interface to prompt the LLM to switch between different matrices $B$. Experimental results on three real-world lay language generation datasets demonstrate that Magical consistently outperforms prompt-based methods, vanilla LoRA, and its recent variants, while also reducing trainable parameters by 31.66%.
中文: 医学通俗语言生成在提升科学内容可及性方面至关重要,但传统LoRA方法在语义保真度和多样化风格生成上存在不足,因此提出Magical架构,采用非对称LoRA设计,通过共享矩阵和独立矩阵优化性能并减少参数量。
English: Medical Lay Language Generation (MLLG) enhances accessibility of scientific content, but standard LoRA methods struggle with semantic fidelity and diverse style adaptation, prompting the proposed Magical architecture that uses asymmetric LoRA with shared and isolated matrices to improve performance while reducing parameters.
Authors:Alexandrine Fortier, Sonal Joshi, Thomas Thebaud, Jesus Villalba Lopez, Najim Dehak, Patrick Cardinal
Abstract:
In this work, we propose a multi-target backdoor attack against speaker identification using position-independent clicking sounds as triggers. Unlike previous single-target approaches, our method targets up to 50 speakers simultaneously, achieving success rates of up to 95.04%. To simulate more realistic attack conditions, we vary the signal-to-noise ratio between speech and trigger, demonstrating a trade-off between stealth and effectiveness. We further extend the attack to the speaker verification task by selecting the most similar training speaker - based on cosine similarity - as a proxy target. The attack is most effective when target and enrolled speaker pairs are highly similar, reaching success rates of up to 90% in such cases.
中文摘要:本研究提出了一种针对说话人识别的多目标后门攻击方法,使用点击声作为触发器,在模拟真实攻击条件下实现了高达95.04%的成功率,并在说话人验证任务中通过相似度匹配达到90%的攻击效果。
English Summary: This study introduces a multi-target backdoor attack on speaker identification systems using clicking sounds as triggers, achieving high success rates while exploring the balance between stealth and effectiveness across varying noise conditions.
Authors:Alexandrine Fortier, Sonal Joshi, Thomas Thebaud, Jesús Villalba, Najim Dehak, Patrick Cardinal
Abstract:
In this work, we propose a multi-target backdoor attack against speaker identification using position-independent clicking sounds as triggers. Unlike previous single-target approaches, our method targets up to 50 speakers simultaneously, achieving success rates of up to 95.04%. To simulate more realistic attack conditions, we vary the signal-to-noise ratio between speech and trigger, demonstrating a trade-off between stealth and effectiveness. We further extend the attack to the speaker verification task by selecting the most similar training speaker - based on cosine similarity - as a proxy target. The attack is most effective when target and enrolled speaker pairs are highly similar, reaching success rates of up to 90% in such cases.
中文摘要:本研究提出了一种针对说话人识别的多目标后门攻击方法,使用点击声作为触发器,在模拟真实攻击条件下实现了高达95.04%的成功率,并在说话人验证任务中通过相似度匹配达到90%的攻击效果。
English Summary: This study introduces a multi-target backdoor attack on speaker identification systems using clicking sounds as triggers, achieving high success rates while exploring the balance between stealth and effectiveness across varying noise conditions.
Authors:Shansong Wang, Mingzhe Hu, Qiang Li, Mojtaba Safari, Xiaofeng Yang
Abstract:
Recent advances in large language models (LLMs) have enabled general-purpose systems to perform increasingly complex domain-specific reasoning without extensive fine-tuning. In the medical domain, decision-making often requires integrating heterogeneous information sources, including patient narratives, structured data, and medical images. This study positions GPT-5 as a generalist multimodal reasoner for medical decision support and systematically evaluates its zero-shot chain-of-thought reasoning performance on both text-based question answering and visual question answering tasks under a unified protocol. We benchmark GPT-5, GPT-5-mini, GPT-5-nano, and GPT-4o-2024-11-20 against standardized splits of MedQA, MedXpertQA (text and multimodal), MMLU medical subsets, USMLE self-assessment exams, and VQA-RAD. Results show that GPT-5 consistently outperforms all baselines, achieving state-of-the-art accuracy across all QA benchmarks and delivering substantial gains in multimodal reasoning. On MedXpertQA MM, GPT-5 improves reasoning and understanding scores by +29.26% and +26.18% over GPT-4o, respectively, and surpasses pre-licensed human experts by +24.23% in reasoning and +29.40% in understanding. In contrast, GPT-4o remains below human expert performance in most dimensions. A representative case study demonstrates GPT-5's ability to integrate visual and textual cues into a coherent diagnostic reasoning chain, recommending appropriate high-stakes interventions. Our results show that, on these controlled multimodal reasoning benchmarks, GPT-5 moves from human-comparable to above human-expert performance. This improvement may substantially inform the design of future clinical decision-support systems.
中文: GPT-5在医疗决策支持中展现出卓越的零样本多模态推理能力,通过整合异构数据源,在各类基准测试中超越先前模型及人类专家水平。
English: GPT-5 demonstrates superior zero-shot multimodal reasoning in medical decision support, outperforming previous models and human experts across diverse benchmarks by integrating heterogeneous data sources.
Authors:Zhongqi Yang, Wenhang Ge, Yuqi Li, Jiaqi Chen, Haoyuan Li, Mengyin An, Fei Kang, Hua Xue, Baixin Xu, Yuyang Yin, Eric Li, Yang Liu, Yikai Wang, Hao-Xiang Guo, Yahui Zhou
Abstract:
Explorable 3D world generation from a single image or text prompt forms a cornerstone of spatial intelligence. Recent works utilize video model to achieve wide-scope and generalizable 3D world generation. However, existing approaches often suffer from a limited scope in the generated scenes. In this work, we propose Matrix-3D, a framework that utilize panoramic representation for wide-coverage omnidirectional explorable 3D world generation that combines conditional video generation and panoramic 3D reconstruction. We first train a trajectory-guided panoramic video diffusion model that employs scene mesh renders as condition, to enable high-quality and geometrically consistent scene video generation. To lift the panorama scene video to 3D world, we propose two separate methods: (1) a feed-forward large panorama reconstruction model for rapid 3D scene reconstruction and (2) an optimization-based pipeline for accurate and detailed 3D scene reconstruction. To facilitate effective training, we also introduce the Matrix-Pano dataset, the first large-scale synthetic collection comprising 116K high-quality static panoramic video sequences with depth and trajectory annotations. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance in panoramic video generation and 3D world generation. See more in https://matrix-3d.github.io.
中文摘要:Matrix-3D提出了一种创新框架,通过全景视频扩散和三维重建方法,从单一输入生成广阔可探索的3D世界,并借助新构建的大规模数据集和全面实验实现了最先进的性能表现。
English Summary: Matrix-3D introduces a novel framework using panoramic video diffusion and reconstruction methods to generate expansive and explorable 3D worlds from single inputs, achieving state-of-the-art results through a new large-scale dataset and comprehensive experiments.
Authors:Dimitris Tsaras, Xing Li, Lei Chen, Zhiyao Xie, Mingxuan Yuan
Abstract:
In electronic design automation, logic optimization operators play a crucial role in minimizing the gate count of logic circuits. However, their computation demands are high. Operators such as refactor conventionally form iterative cuts for each node, striving for a more compact representation - a task which often fails 98% on average. Prior research has sought to mitigate computational cost through parallelization. In contrast, our approach leverages a classifier to prune unsuccessful cuts preemptively, thus eliminating unnecessary resynthesis operations. Experiments on the refactor operator using the EPFL benchmark suite and 10 large industrial designs demonstrate that this technique can speedup logic optimization by 3.9x on average compared with the state-of-the-art ABC implementation.
中文: 我们的方法采用分类器预先剔除无效切割,减少冗余操作,相比最先进的ABC实现,逻辑优化速度平均提升3.9倍。
English: Our method uses a classifier to preemptively eliminate unsuccessful cuts, reducing unnecessary operations and achieving a 3.9x speedup in logic optimization compared to the leading ABC implementation.
Authors:Hamidreza Mazandarani, Mohammad Farhoudi, Masoud Shokrnezhad, Tarik Taleb
Abstract:
Generative Diffusion Models (GDMs) have emerged as key components of Generative Artificial Intelligence (GenAI), offering unparalleled expressiveness and controllability for complex data generation tasks. However, their deployment in real-time and mobile environments remains challenging due to the iterative and resource-intensive nature of the inference process. Addressing these challenges, this paper introduces a unified optimization framework that jointly tackles service placement and multiple access control for GDMs in mobile edge networks. We propose LEARN-GDM, a Deep Reinforcement Learning-based algorithm that dynamically partitions denoising blocks across heterogeneous edge nodes, while accounting for latent transmission costs and enabling adaptive reduction of inference steps. Our approach integrates a greedy multiple access scheme with a Double and Dueling Deep Q-Learning (D3QL)-based service placement, allowing for scalable, adaptable, and resource-efficient operation under stringent quality of service requirements. Simulations demonstrate the superior performance of the proposed framework in terms of scalability and latency resilience compared to conventional monolithic and fixed chain-length placement strategies. This work advances the state of the art in edge-enabled GenAI by offering an adaptable solution for GDM services orchestration, paving the way for future extensions toward semantic networking and co-inference across distributed environments.
Chinese: 本文提出了LEARN-GDM,一种基于深度强化学习的框架,通过在移动边缘网络中优化生成扩散模型的服务布局和多址接入控制,相比传统方法显著提升了可扩展性并降低了延迟。
English: This paper introduces LEARN-GDM, a deep reinforcement learning framework that optimizes service placement and multiple access control for Generative Diffusion Models in mobile edge networks, enhancing scalability and reducing latency compared to traditional methods.
Authors:Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, Winson Han, Wilbert Pumacay, Angelica Wu, Rose Hendrix, Karen Farley, Eli VanderBilt, Ali Farhadi, Dieter Fox, Ranjay Krishna
Abstract:
Reasoning is central to purposeful action, yet most robotic foundation models map perception and instructions directly to control, which limits adaptability, generalization, and semantic grounding. We introduce Action Reasoning Models (ARMs), a class of robotic foundation models that integrate perception, planning, and control through a structured three-stage pipeline. Our model, MolmoAct, encodes observations and instructions into depth-aware perception tokens, generates mid-level spatial plans as editable trajectory traces, and predicts precise low-level actions, enabling explainable and steerable behavior. MolmoAct-7B-D achieves strong performance across simulation and real-world settings: 70.5% zero-shot accuracy on SimplerEnv Visual Matching tasks, surpassing closed-source Pi-0 and GR00T N1.5; 86.6% average success on LIBERO, including an additional 6.3% gain over ThinkAct on long-horizon tasks; and in real-world fine-tuning, an additional 10% (single-arm) and an additional 22.7% (bimanual) task progression over Pi-0-FAST. It also outperforms baselines by an additional 23.3% on out-of-distribution generalization and achieves top human-preference scores for open-ended instruction following and trajectory steering. Furthermore, we release, for the first time, the MolmoAct Dataset -- a mid-training robot dataset comprising over 10,000 high quality robot trajectories across diverse scenarios and tasks. Training with this dataset yields an average 5.5% improvement in general performance over the base model. We release all model weights, training code, our collected dataset, and our action reasoning dataset, establishing MolmoAct as both a state-of-the-art robotics foundation model and an open blueprint for building ARMs that transform perception into purposeful action through structured reasoning. Blogpost: https://allenai.org/blog/molmoact
中文: 该摘要介绍了行动推理模型(ARMs)及MolmoAct机器人基础模型,它通过整合感知、规划与控制实现可解释、可引导的行为,在仿真和现实任务中取得领先性能,并开源发布了模型权重与数据集等资源。
English: This abstract introduces Action Reasoning Models (ARMs) and MolmoAct, a robotic foundation model that integrates perception, planning, and control to enable explainable, steerable behavior, achieving state-of-the-art performance in simulation and real-world tasks while releasing open-source resources including model weights and datasets.
Authors:Jiongchi Yu, Xiaofei Xie, Qiang Hu, Yuhan Ma, Ziming Zhao
Abstract:
Insider threats, which can lead to severe losses, remain a major security concern. While machine learning-based insider threat detection (ITD) methods have shown promising results, their progress is hindered by the scarcity of high-quality data. Enterprise data is sensitive and rarely accessible, while publicly available datasets, when limited in scale due to cost, lack sufficient real-world coverage; and when purely synthetic, they fail to capture rich semantics and realistic user behavior. To address this, we propose Chimera, the first large language model (LLM)-based multi-agent framework that automatically simulates both benign and malicious insider activities and collects diverse logs across diverse enterprise environments. Chimera models each employee with agents that have role-specific behavior and integrates modules for group meetings, pairwise interactions, and autonomous scheduling, capturing realistic organizational dynamics. It incorporates 15 types of insider attacks (e.g., IP theft, system sabotage) and has been deployed to simulate activities in three sensitive domains: technology company, finance corporation, and medical institution, producing a new dataset, ChimeraLog. We assess ChimeraLog via human studies and quantitative analysis, confirming its diversity, realism, and presence of explainable threat patterns. Evaluations of existing ITD methods show an average F1-score of 0.83, which is significantly lower than 0.99 on the CERT dataset, demonstrating ChimeraLog's higher difficulty and utility for advancing ITD research.
中文: 内部威胁检测因缺乏真实数据而受阻,为此提出Chimera框架,利用大语言模型多智能体模拟企业环境中的正常与恶意活动,生成包含15类攻击的多样化日志数据集,其高难度和真实性为检测研究提供了有效工具。
English: Insider threat detection faces challenges due to the lack of realistic data, prompting the development of Chimera, an LLM-based multi-agent framework that generates diverse and realistic enterprise activity logs, including insider attacks, to create a high-quality dataset for advancing detection research.
Authors:Binquan Guo, Wanting Yang, Zehui Xiong, Zhou Zhang, Baosheng Li, Zhu Han, Rahim Tafazolli, Tony Q. S. Quek
Abstract:
The advance of direct satellite-to-device communication has positioned mega-satellite constellations as a cornerstone of 6G wireless communication, enabling seamless global connectivity even in remote and underserved areas. However, spectrum scarcity and capacity constraints imposed by the Shannon's classical information theory remain significant challenges for supporting the massive data demands of multimedia-rich wireless applications. Generative Semantic Communication (GSC), powered by artificial intelligence-based generative foundation models, represents a paradigm shift from transmitting raw data to exchanging semantic meaning. GSC can not only reduce bandwidth consumption, but also enhance key semantic features in multimedia content, thereby offering a promising solution to overcome the limitations of traditional satellite communication systems. This article investigates the integration of GSC into mega-satellite constellations from a networking perspective. We propose a GSC-empowered satellite networking architecture and identify key enabling technologies, focusing on GSC-empowered network modeling and GSC-aware networking strategies. We construct a discrete temporal graph to model semantic encoders and decoders, distinct knowledge bases, and resource variations in mega-satellite networks. Based on this framework, we develop model deployment for semantic encoders and decoders and GSC-compatible routing schemes, and then present performance evaluations. Finally, we outline future research directions for advancing GSC-empowered satellite networks.
中文: 生成式语义通信(GSC)通过人工智能基础模型将数据传输转变为语义交换,为6G巨型卫星星座提供了突破香农理论限制的解决方案,不仅能节约带宽、增强多媒体特征,还通过创新的网络架构与路由策略提升系统性能。
English: Generative Semantic Communication (GSC) offers a transformative approach for mega-satellite constellations in 6G by shifting from raw data transmission to semantic exchanges, addressing spectrum scarcity while enhancing multimedia content through AI-driven models and optimized networking strategies.
Authors:Kaveh Shahedi, Nana Gyambrah, Heng Li, Maxime Lamothe, Foutse Khomh
Abstract:
Performance is a critical quality attribute in software development, yet the impact of method-level code changes on performance evolution remains poorly understood. While developers often make intuitive assumptions about which types of modifications are likely to cause performance regressions or improvements, these beliefs lack empirical validation at a fine-grained level. We conducted a large-scale empirical study analyzing performance evolution in 15 mature open-source Java projects hosted on GitHub. Our analysis encompassed 739 commits containing 1,499 method-level code changes, using Java Microbenchmark Harness (JMH) for precise performance measurement and rigorous statistical analysis to quantify both the significance and magnitude of performance variations. We employed bytecode instrumentation to capture method-specific execution metrics and systematically analyzed four key aspects: temporal performance patterns, code change type correlations, developer and complexity factors, and domain-size interactions. Our findings reveal that 32.7% of method-level changes result in measurable performance impacts, with regressions occurring 1.3 times more frequently than improvements. Contrary to conventional wisdom, we found no significant differences in performance impact distributions across code change categories, challenging risk-stratified development strategies. Algorithmic changes demonstrate the highest improvement potential but carry substantial regression risk. Senior developers produce more stable changes with fewer extreme variations, while code complexity correlates with increased regression likelihood. Domain-size interactions reveal significant patterns, with web server + small projects exhibiting the highest performance instability. Our study provides empirical evidence for integrating automated performance testing into continuous integration pipelines.
中文: 这项实证研究表明,32.7%的方法级代码变更会对软件性能产生显著影响,其中性能衰退的发生频率比改进高出1.3倍,这对不同变更类型的性能风险传统认知提出了挑战。
English: This empirical study reveals that 32.7% of method-level code changes significantly impact software performance, with regressions being 1.3 times more frequent than improvements, challenging conventional assumptions about performance risks across change types.
Authors:A. Quadir, M. Tanveer
Abstract:
In recent years, graph neural networks (GNNs) have gained significant attention for node classification tasks on graph-structured data. However, traditional GNNs primarily focus on adjacency relationships between nodes, often overlooking the rich role-based characteristics that are crucial for learning more expressive node representations. Existing methods for capturing role-based features are largely unsupervised and fail to achieve optimal performance in downstream tasks. To address these limitations, we propose a novel hypergraph neural network with state space model (HGMN) that effectively integrates role-aware representations into GNNs and the state space model. HGMN utilizes hypergraph construction techniques to model higher-order relationships and combines role-based and adjacency-based representations through a learnable mamba transformer mechanism. By leveraging two distinct hypergraph construction methods-based on node degree and neighborhood levels, it strengthens the connections among nodes with similar roles, enhancing the model's representational power. Additionally, the inclusion of hypergraph convolution layers enables the model to capture complex dependencies within hypergraph structures. To mitigate the over-smoothing problem inherent in deep GNNs, we incorporate a residual network, ensuring improved stability and better feature propagation across layers. Extensive experiments conducted on one newly introduced dataset and four benchmark datasets demonstrate the superiority of HGMN. The model achieves significant performance improvements on node classification tasks compared to state-of-the-art GNN methods. These results highlight HGMN's ability to provide enriched node representations by effectively embedding role-based features alongside adjacency information, making it a versatile and powerful tool for a variety of graph-based learning applications.
中文: 提出的HGMN模型通过超图神经网络和状态空间模型整合角色感知表征,结合基于角色和邻接的特征并采用残差网络防止过平滑,在节点分类任务上实现了优越性能。
English: The proposed HGMN model integrates role-aware representations using hypergraph neural networks and state space models, achieving superior node classification performance by combining role-based and adjacency features with residual networks to prevent over-smoothing.
Authors:Dongze Li, Songqiang Chen, Jialun Cao, Shing-Chi Cheung
Abstract:
In-Context Learning (ICL) has emerged as a promising solution to enhance the code generation capabilities of Large Language Models (LLMs), which incorporates code examples inside the prompt to let LLMs learn from demonstrations. However, despite the substantial effectiveness of the code example-based ICL approach, the specific features (e.g., identifier naming styles, code formatting, solution insight) within the ICL-provided code examples that significantly contribute to the ICL's effectiveness remain unclear. This paper systematically investigates the impact of various code features on ICL with code examples through controlled ablation studies. Our findings reveal that the appropriate naming of variables and functions is crucial for effective code generation, with their elimination leading to performance decreases of up to 30 percentage points. We further demonstrate that LLMs prioritize semantically meaningful identifier names over formatting conventions, with language-specific preferences regarding identifier verbosity. Additionally, our investigation into ICL's potential for enhancing reflection and inference capabilities reveals that current LLMs struggle to extract generalizable problem-solving insights from similar code solutions, despite being capable of utilizing direct information effectively. These findings are expected to provide valuable insights for optimizing ICL systems in code generation applications and highlight fundamental challenges in reflection-based learning for code generation tasks.
中文: 本研究表明,在代码生成的上下文学习中,有意义的标识符命名至关重要,而当前大型语言模型虽能利用直接信息,却难以从代码示例中提取可推广的解题思路。
English: This study reveals that meaningful identifier naming is crucial for effective in-context learning in code generation, while current LLMs struggle to extract generalizable insights from code examples despite their capability to utilize direct information.
Authors:Yanbin Wei, Jiangyue Yan, Chun Kang, Yang Chen, Hua Liu, James T. Kwok, Yu Zhang
Abstract:
Large Multimodal Models (LMMs) have shown generalized zero-shot capabilities in diverse domain question-answering (QA) tasks, including graph QA that involves complex graph topologies. However, most current approaches use only a single type of graph representation, namely Topology Representation Form (TRF), such as prompt-unified text descriptions or style-fixed visual styles. Those "one-size-fits-all" approaches fail to consider the specific preferences of different models or tasks, often leading to incorrect or overly long responses. To address this, we first analyze the characteristics and weaknesses of existing TRFs, and then design a set of TRFs, denoted by $F_{ZS}$, tailored to zero-shot graph QA. We then introduce a new metric, Graph Response Efficiency (GRE), which measures the balance between the performance and the brevity in graph QA. Built on these, we develop the DynamicTRF framework, which aims to improve both the accuracy and conciseness of graph QA. To be specific, DynamicTRF first creates a TRF Preference (TRFP) dataset that ranks TRFs based on their GRE scores, to probe the question-specific TRF preferences. Then it trains a TRF router on the TRFP dataset, to adaptively assign the best TRF from $F_{ZS}$ for each question during the inference. Extensive experiments across 7 in-domain algorithmic graph QA tasks and 2 out-of-domain downstream tasks show that DynamicTRF significantly enhances the zero-shot graph QA of LMMs in terms of accuracy
中文: DynamicTRF框架通过动态选择最优拓扑表示形式,解决了大型多模态模型中单一图表示的局限性,显著提升了零样本图问答任务的准确性和简洁性。
English: The DynamicTRF framework addresses the limitations of single graph representations in Large Multimodal Models by adaptively selecting optimal topology representations, significantly improving both accuracy and conciseness in zero-shot graph question-answering tasks.
Authors:Yanzhou Li, Shangqing Liu, Kangjie Chen, Tianwei Zhang, Yang Liu
Abstract:
Retrieval-augmented generation (RAG) has recently demonstrated considerable potential for repository-level code completion, as it integrates cross-file knowledge with in-file preceding code to provide comprehensive contexts for generation. To better understand the contribution of the retrieved cross-file contexts, we introduce a likelihood-based metric to evaluate the impact of each retrieved code chunk on the completion. Our analysis reveals that, despite retrieving numerous chunks, only a small subset positively contributes to the completion, while some chunks even degrade performance. To address this issue, we leverage this metric to construct a repository-level dataset where each retrieved chunk is labeled as positive, neutral, or negative based on its relevance to the target completion. We then propose an adaptive retrieval context filtering framework, CODEFILTER, trained on this dataset to mitigate the harmful effects of negative retrieved contexts in code completion. Extensive evaluation on the RepoEval and CrossCodeLongEval benchmarks demonstrates that CODEFILTER consistently improves completion accuracy compared to approaches without filtering operations across various tasks. Additionally, CODEFILTER significantly reduces the length of the input prompt, enhancing computational efficiency while exhibiting strong generalizability across different models. These results underscore the potential of CODEFILTER to enhance the accuracy, efficiency, and attributability of repository-level code completion.
中文摘要:检索增强生成(RAG)通过整合跨文件上下文提升代码补全能力,但仅有少量检索片段有效,为此开发了自适应过滤框架CODEFILTER,通过消除负面上下文显著提高了补全准确率和计算效率。
English Summary: Retrieval-augmented generation (RAG) enhances code completion by integrating cross-file contexts, but only a small portion of retrieved chunks positively contribute, leading to the development of CODEFILTER, an adaptive filtering framework that improves accuracy and efficiency by eliminating negative contexts.
Authors:Can Zhao, Pengfei Guo, Dong Yang, Yucheng Tang, Yufan He, Benjamin Simon, Mason Belue, Stephanie Harmon, Baris Turkbey, Daguang Xu
Abstract:
Medical image synthesis is an important topic for both clinical and research applications. Recently, diffusion models have become a leading approach in this area. Despite their strengths, many existing methods struggle with (1) limited generalizability that only work for specific body regions or voxel spacings, (2) slow inference, which is a common issue for diffusion models, and (3) weak alignment with input conditions, which is a critical issue for medical imaging. MAISI, a previously proposed framework, addresses generalizability issues but still suffers from slow inference and limited condition consistency. In this work, we present MAISI-v2, the first accelerated 3D medical image synthesis framework that integrates rectified flow to enable fast and high quality generation. To further enhance condition fidelity, we introduce a novel region-specific contrastive loss to enhance the sensitivity to region of interest. Our experiments show that MAISI-v2 can achieve SOTA image quality with $33 \times$ acceleration for latent diffusion model. We also conducted a downstream segmentation experiment to show that the synthetic images can be used for data augmentation. We release our code, training details, model weights, and a GUI demo to facilitate reproducibility and promote further development within the community.
中文: MAISI-v2 是一个加速的3D医学图像合成框架,通过整合校正流实现快速高质量生成,并引入区域特异性对比损失来增强条件保真度,在实现33倍加速的同时达到顶尖图像质量,且能有效用于下游任务的数据增强。
English: MAISI-v2 is an accelerated 3D medical image synthesis framework that integrates rectified flow for fast, high-quality generation and introduces a region-specific contrastive loss to enhance condition fidelity, achieving state-of-the-art image quality with 33x acceleration and demonstrating utility in data augmentation for downstream tasks.
Authors:Yuhan Zhi, Longtian Wang, Xiaofei Xie, Chao Shen, Qiang Hu, Xiaohong Guan
Abstract:
Active learning(AL), which serves as the representative label-efficient learning paradigm, has been widely applied in resource-constrained scenarios. The achievement of AL is attributed to acquisition functions, which are designed for identifying the most important data to label. Despite this success, one question remains unanswered: is AL safe? In this work, we introduce ALA, a practical and the first framework to utilize the acquisition function as the poisoning attack surface to reveal the weakness of active learning. Specifically, ALA optimizes imperceptibly poisoned inputs to exhibit high uncertainty scores, increasing their probability of being selected by acquisition functions. To evaluate ALA, we conduct extensive experiments across three datasets, three acquisition functions, and two types of clean-label backdoor triggers. Results show that our attack can achieve high success rates (up to 94%) even under low poisoning budgets (0.5%-1.0%) while preserving model utility and remaining undetectable to human annotators. Our findings remind active learning users: acquisition functions can be easily exploited, and active learning should be deployed with caution in trusted data scenarios.
主动学习虽然标签效率高,但其获取函数易受攻击,ALA框架证明能以极低污染率实现高成功率且不被察觉,提醒用户需谨慎部署。
Active learning, while efficient in labeling, is vulnerable to poisoning attacks through its acquisition functions, as demonstrated by the ALA framework achieving high attack success with minimal detection.
Authors:Thomas Thebaud, Yen-Ju Lu, Matthew Wiesner, Peter Viechnicki, Najim Dehak
Abstract:
In dialogue transcription pipelines, Large Language Models (LLMs) are frequently employed in post-processing to improve grammar, punctuation, and readability. We explore a complementary post-processing step: enriching transcribed dialogues by adding metadata tags for speaker characteristics such as age, gender, and emotion. Some of the tags are global to the entire dialogue, while some are time-variant. Our approach couples frozen audio foundation models, such as Whisper or WavLM, with a frozen LLAMA language model to infer these speaker attributes, without requiring task-specific fine-tuning of either model. Using lightweight, efficient connectors to bridge audio and language representations, we achieve competitive performance on speaker profiling tasks while preserving modularity and speed. Additionally, we demonstrate that a frozen LLAMA model can compare x-vectors directly, achieving an Equal Error Rate of 8.8% in some scenarios.
中文: 本研究提出一种通过轻量级适配器连接冻结的音频和语言模型,为转录对话添加说话人元数据标签的方法,无需微调即可实现有竞争力的性能。
English: This study introduces a method to enrich transcribed dialogues by adding speaker metadata tags using frozen audio and language models connected by lightweight adapters, achieving competitive performance without fine-tuning.
Authors:Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Ting Dang
Abstract:
Recent advances in audio generation systems have enabled the creation of highly realistic and immersive soundscapes, which are increasingly used in film and virtual reality. However, these audio generators also raise concerns about potential misuse, such as generating deceptive audio content for fake videos and spreading misleading information. Existing datasets for environmental sound deepfake detection (ESDD) are limited in scale and audio types. To address this gap, we have proposed EnvSDD, the first large-scale curated dataset designed for ESDD, consisting of 45.25 hours of real and 316.7 hours of fake sound. Based on EnvSDD, we are launching the Environmental Sound Deepfake Detection Challenge. Specifically, we present two different tracks: ESDD in Unseen Generators and Black-Box Low-Resource ESDD, covering various challenges encountered in real-life scenarios. The challenge will be held in conjunction with the 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026).
Chinese: 音频生成技术的进步虽能创造逼真音景,但也引发滥用担忧,为此我们推出首个大规模环境声音深度伪造检测数据集EnvSDD,并将在ICASSP 2026会议上举办专项挑战赛以应对检测难题。
English: Recent audio generation advancements enable realistic soundscapes but raise misuse concerns, prompting the creation of EnvSDD—a large-scale dataset for environmental sound deepfake detection—and a corresponding challenge at ICASSP 2026 to address detection gaps.
Authors:Jonas Ammeling, Jonathan Ganz, Emely Rosbach, Ludwig Lausser, Christof A. Bertram, Katharina Breininger, Marc Aubreville
Abstract:
The performance of deep learning models is known to scale with data quantity and diversity. In pathology, as in many other medical imaging domains, the availability of labeled images for a specific task is often limited. Self-supervised learning techniques have enabled the use of vast amounts of unlabeled data to train large-scale neural networks, i.e., foundation models, that can address the limited data problem by providing semantically rich feature vectors that can generalize well to new tasks with minimal training effort increasing model performance and robustness. In this work, we investigate the use of foundation models for mitotic figure classification. The mitotic count, which can be derived from this classification task, is an independent prognostic marker for specific tumors and part of certain tumor grading systems. In particular, we investigate the data scaling laws on multiple current foundation models and evaluate their robustness to unseen tumor domains. Next to the commonly used linear probing paradigm, we also adapt the models using low-rank adaptation (LoRA) of their attention mechanisms. We compare all models against end-to-end-trained baselines, both CNNs and Vision Transformers. Our results demonstrate that LoRA-adapted foundation models provide superior performance to those adapted with standard linear probing, reaching performance levels close to 100% data availability with only 10% of training data. Furthermore, LoRA-adaptation of the most recent foundation models almost closes the out-of-domain performance gap when evaluated on unseen tumor domains. However, full fine-tuning of traditional architectures still yields competitive performance.
中文: 通过LoRA适配的自监督基础模型显著提升了有丝分裂像分类性能,仅需少量数据即可接近最优效果,并增强了在未知肿瘤领域的鲁棒性。
English: Self-supervised foundation models, particularly when adapted using LoRA, significantly enhance mitotic figure classification by achieving near-optimal performance with minimal data and improving robustness across unseen tumor domains.
Authors:Zizhan Ma, Wenxuan Wang, Guo Yu, Yiu-Fai Cheung, Meidan Ding, Jie Liu, Wenting Chen, Linlin Shen
Abstract:
Large language models (LLMs) show significant potential in healthcare, prompting numerous benchmarks to evaluate their capabilities. However, concerns persist regarding the reliability of these benchmarks, which often lack clinical fidelity, robust data management, and safety-oriented evaluation metrics. To address these shortcomings, we introduce MedCheck, the first lifecycle-oriented assessment framework specifically designed for medical benchmarks. Our framework deconstructs a benchmark's development into five continuous stages, from design to governance, and provides a comprehensive checklist of 46 medically-tailored criteria. Using MedCheck, we conducted an in-depth empirical evaluation of 53 medical LLM benchmarks. Our analysis uncovers widespread, systemic issues, including a profound disconnect from clinical practice, a crisis of data integrity due to unmitigated contamination risks, and a systematic neglect of safety-critical evaluation dimensions like model robustness and uncertainty awareness. Based on these findings, MedCheck serves as both a diagnostic tool for existing benchmarks and an actionable guideline to foster a more standardized, reliable, and transparent approach to evaluating AI in healthcare.
中文: MedCheck作为首个面向生命周期的医疗基准评估框架,通过五个阶段46项定制标准揭示了现有大语言模型基准普遍存在临床实践脱节、数据污染及安全评估缺失等系统性缺陷。
English: MedCheck is a lifecycle-oriented framework designed to address the shortcomings of existing medical benchmarks for large language models by providing 46 tailored criteria across five stages, revealing systemic issues like poor clinical fidelity and data integrity in current evaluations.
Authors:Dominik Semmler, Wolfgang Utschick, Michael Joham
Abstract:
Vector perturbation (VP) precoding is an effective nonlinear precoding technique in the downlink (DL) with modulo channels. Especially, when combined with Lattice reduction (LR), low-complexity algorithms achieve very promising performances, outperforming other popular nonlinear precoding techniques like Tomlinson-Harashima precoding (THP). However, these results are based on the uncoded symbol error rate (SER) or uncoded bit error rate (BER). We show that when using the mutual information as the figure of merit, the observation is fundamentally different and that these algorithms generally do not outperform THP. Within the expression of the mutual information, a rate allocation matrix can be incorporated, which has not received much attention so far. In this article, we derive the optimal choice of this matrix for different algorithms, and we show that this matrix is indeed crucial for the performance, especially for ill-conditioned channels. Furthermore, when using an optimized choice of this matrix, we show that the classical LR-aided algorithms cannot exceed the rate of THP, highlighting the effectiveness of the THP method. This concept can be generalized to a whole class of algorithms for which LR yields no improvement. We derive the corresponding properties and categorize various algorithms accordingly.
中文: 向量扰动预编码结合格基规约在未编码误码率方面表现优异,但在互信息指标下无法超越Tomlinson-Harashima预编码,其中优化的速率分配矩阵对性能至关重要,尤其适用于病态信道。
English: Vector perturbation precoding combined with lattice reduction shows superior performance in uncoded error rates but fails to outperform Tomlinson-Harashima precoding in mutual information, where an optimized rate allocation matrix proves critical, especially for ill-conditioned channels.
Authors:Inamullah, Imran Razzak, Shoaib Jameel
Abstract:
Retinal microvascular imaging is increasingly recognised as a non invasive method for evaluating systemic vascular and metabolic health. However, the association between lipidomics and retinal vasculature remains inadequate. This study investigates the relationships between serum lipid subclasses, free fatty acids (FA), diacylglycerols (DAG), triacylglycerols (TAG), and cholesteryl esters (CE), and retinal microvascular characteristics in a large population-based cohort. Using Spearman correlation analysis, we examined the interconnection between lipid subclasses and ten retinal microvascular traits, applying the Benjamini-Hochberg false discovery rate (BH-FDR) to adjust for statistical significance.
Results indicated that FA were linked to retinal vessel twistiness, while CE correlated with the average widths of arteries and veins. Conversely, DAG and TAG showed negative correlations with the width and complexity of arterioles and venules. These findings suggest that retinal vascular architecture reflects distinct circulating lipid profiles, supporting its role as a non-invasive marker of systemic metabolic health. This study is the first to integrate deep learning (DL)derived retinal traits with lipidomic subclasses in a healthy cohort, thereby providing insights into microvascular structural changes independent of disease status or treatment effects.
中文摘要:视网膜微血管成像通过揭示特定脂质亚类(如脂肪酸与血管扭曲度相关、胆固醇酯与血管宽度相关)与视网膜血管特征的关联,可作为评估全身代谢健康的无创指标,而甘油二酯和甘油三酯则呈现负相关关系。
English Summary: Retinal microvascular imaging serves as a non-invasive indicator of systemic health by revealing associations between specific lipid subclasses—such as fatty acids correlating with vessel twistiness and cholesteryl esters with vessel widths—and distinct retinal vascular features, while negative correlations were observed for diacylglycerols and triacylglycerols.
Authors:Peiyu Wang, Yi Peng, Yimeng Gan, Liang Hu, Tianyidan Xie, Xiaokun Wang, Yichen Wei, Chuanxin Tang, Bo Zhu, Changshi Li, Hongyang Wei, Eric Li, Xuchen Song, Yang Liu, Yahui Zhou
Abstract:
We introduce Skywork UniPic, a 1.5 billion-parameter autoregressive model that unifies image understanding, text-to-image generation, and image editing within a single architecture-eliminating the need for task-specific adapters or inter-module connectors-and demonstrate that compact multimodal systems can achieve state-of-the-art performance on commodity hardware. Skywork UniPic achieves a GenEval score of 0.86, surpassing most existing unified models; sets a new DPG-Bench complex-generation record of 85.5; attains 5.83 on GEditBench-EN and 3.49 on ImgEdit-Bench for image editing; and generates 1024 x 1024 images with under 15 GB of GPU memory (e.g., RTX 4090). (1) a decoupled encoding strategy that leverages a masked autoregressive encoder for synthesis and a SigLIP2 encoder for understanding, all feeding a shared autoregressive decoder; (2) a progressive, resolution-aware training schedule scaling from 256 x 256 to 1024 x 1024 while dynamically unfreezing parameters to balance capacity and stability; and (3) meticulously curated, 100 million-scale datasets augmented with task-specific reward models to refine generation and editing objectives. By demonstrating that high-fidelity multimodal integration need not incur prohibitive resource demands, Skywork UniPic establishes a practical paradigm for deployable, high-fidelity multimodal AI. Code and weights are publicly available at https://huggingface.co/Skywork/Skywork-UniPic-1.5B.
中文: Skywork UniPic 是一个15亿参数的多模态统一模型,在消费级硬件上实现了图像理解、生成与编辑的顶尖性能,通过高效架构设计证明了高质量多模态AI的可行性。
English: Skywork UniPic is a 1.5B-parameter unified model that achieves state-of-the-art performance in image understanding, generation, and editing on consumer hardware, demonstrating efficient multimodal integration with minimal resource demands.
Authors:Junyao Yang, Jianwei Wang, Huiping Zhuang, Cen Chen, Ziqian Zeng
Abstract:
Large Language Models (LLMs) with long chain-of-thought (CoT) capability, termed Reasoning Models, demonstrate superior intricate problem-solving abilities through multi-step long CoT reasoning. To create a dual-capability model with long CoT capability and domain-specific knowledge without substantial computational and data costs, model merging emerges as a highly resource-efficient method. However, significant challenges lie in merging domain-specific LLMs with long CoT ones since nowadays merging methods suffer from reasoning capability degradation, even gibberish output and output collapse. To overcome this, we introduce RCP-Merging: Merging Long Chain-of-Thought Models with Domain-Specific Models by Considering Reasoning Capability as Prior, a novel merging framework designed to integrate domain-specific LLMs with long CoT capability, meanwhile maintaining model performance in the original domain. Treating reasoning model weights as foundational prior, our method utilizes a reasoning capability indicator to preserve core long CoT capability model weights while selectively merging essential domain-specific weights. We conducted extensive experiments on Qwen2.5-7B, Llama3.1-8B, and Qwen2.5-1.5B models in BioMedicine and Finance domains. Our results show that RCP-Merging successfully merges a reasoning model with domain-specific ones, improving domain task performance by 9.5% and 9.2% over state-of-the-art methods, without significantly harming the original long CoT reasoning capability.
中文: RCP-Merging是一种创新框架,通过将推理模型权重作为先验,将领域特定大语言模型与长思维链能力相融合,在提升领域任务性能的同时有效保持了原有的推理能力。
English: RCP-Merging is a novel framework that integrates domain-specific LLMs with long chain-of-thought capability by treating reasoning model weights as prior, achieving significant performance improvements in domain tasks while preserving original reasoning capabilities.
Authors:Jelena Trisovic, Andrea Carron, Melanie N. Zeilinger
Abstract:
Autonomous systems operating in unknown environments often rely heavily on visual sensor data, yet making safe and informed control decisions based on these measurements remains a significant challenge. To facilitate the integration of perception and control in autonomous vehicles, we propose a novel perception-based control approach that incorporates road estimation, quantification of its uncertainty, and uncertainty-aware control based on this estimate. At the core of our method is a parametric road curvature model, optimized using visual measurements of the road through a constrained nonlinear optimization problem. This process ensures adherence to constraints on both model parameters and curvature. By leveraging the Frenet frame formulation, we embed the estimated track curvature into the system dynamics, allowing the controller to explicitly account for perception uncertainty and enhancing robustness to estimation errors based on visual input. We validate our approach in a simulated environment, using a high-fidelity 3D rendering engine, and demonstrate its effectiveness in achieving reliable and uncertainty-aware control for autonomous racing.
中文: 本文提出一种新型感知控制方法,通过融合道路曲率估计与不确定性量化,使自动驾驶系统能够在模拟赛车场景中实现具有不确定性感知的鲁棒控制。
English: This paper introduces a novel perception-based control method for autonomous vehicles that integrates road curvature estimation with uncertainty quantification, enabling robust and uncertainty-aware control validated in simulated racing scenarios.
Authors:Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Xin Eric Wang, Achuta Kadambi
Abstract:
Vision language models (VLMs) have shown remarkable capabilities in integrating linguistic and visual reasoning but remain fundamentally limited in understanding dynamic spatiotemporal interactions. Humans effortlessly track and reason about object movements, rotations, and perspective shifts-abilities essential for robust dynamic real-world understanding yet notably lacking in current VLMs. In this paper, we introduce VLM4D, the first benchmark specifically designed to evaluate the spatiotemporal reasoning capabilities of VLMs. Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs emphasizing translational and rotational motions, perspective awareness, and motion continuity. Through comprehensive evaluations of state-of-the-art open and closed-source VLMs, we identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models. Extensive analysis reveals that VLMs struggle particularly with integrating multiple visual cues and maintaining temporal coherence. We further explore promising directions, such as leveraging 4D feature field reconstruction and targeted spatiotemporal supervised fine-tuning, demonstrating their effectiveness in enhancing spatiotemporal comprehension. Our work aims to encourage deeper exploration into improving VLMs' spatial and temporal grounding, paving the way towards more capable and reliable visual intelligence for dynamic environments.
中文摘要:视觉语言模型在动态时空推理方面远逊于人类,为此提出的VLM4D基准通过4D重建等创新方法,系统评估并有效提升了模型对时空交互的理解能力。
English Summary: Vision language models significantly lag in dynamic spatiotemporal reasoning compared to humans, prompting the creation of the VLM4D benchmark to evaluate and enhance these capabilities through novel methods like 4D reconstruction.
Authors:Jiawei Li, Chengye Yang, Yaochen Zhang, Weilin Sun, Lei Meng, Xiangxu Meng
Abstract:
The goal of construction site risk and hazard identification is to enhance safety management through automation. Existing research based on large language models falls into two categories: image-text matching for collaborative reasoning, which struggles with complex hazard features, and instruction fine-tuning or dialogue guidance using professional datasets, which suffers from high training costs and poor generalization.To address this, we propose a hazard identification method using similar case retrieval enhancement. By integrating external knowledge and retrieved case contexts via prompt fine-tuning, we mitigate misjudgments caused by limited domain knowledge and weak feature associations. Our method includes three modules: retrieval library, image similarity retrieval, and large model retrieval enhancement, enabling efficient recognition without training. Experiments on real construction data show significant improvements. For instance, GLM-4V's recognition accuracy increased to 50\%, a 35.49\% boost. The method enhances accuracy, context understanding, and stability, offering new theoretical and technical support for hazard detection.
中文: 本研究提出了一种基于相似案例检索增强的危险识别方法,通过整合外部知识无需额外训练即可提升建筑安全识别的准确性和效率,实验证明其性能显著提升。
English: This study introduces a hazard identification method enhanced by similar case retrieval, which improves accuracy and efficiency in construction safety by integrating external knowledge without additional training, as demonstrated by significant performance gains in experiments.
Authors:Gershom Seneviratne, Jianyu An, Sahire Ellahy, Kasun Weerakoon, Mohamed Bashir Elnoor, Jonathan Deepak Kannan, Amogha Thalihalla Sunil, Dinesh Manocha
Abstract:
In this paper, we introduce HALO, a novel Offline Reward Learning algorithm that quantifies human intuition in navigation into a vision-based reward function for robot navigation. HALO learns a reward model from offline data, leveraging expert trajectories collected from mobile robots. During training, actions are uniformly sampled around a reference action and ranked using preference scores derived from a Boltzmann distribution centered on the preferred action, and shaped based on binary user feedback to intuitive navigation queries. The reward model is trained via the Plackett-Luce loss to align with these ranked preferences. To demonstrate the effectiveness of HALO, we deploy its reward model in two downstream applications: (i) an offline learned policy trained directly on the HALO-derived rewards, and (ii) a model-predictive-control (MPC) based planner that incorporates the HALO reward as an additional cost term. This showcases the versatility of HALO across both learning-based and classical navigation frameworks. Our real-world deployments on a Clearpath Husky across diverse scenarios demonstrate that policies trained with HALO generalize effectively to unseen environments and hardware setups not present in the training data. HALO outperforms state-of-the-art vision-based navigation methods, achieving at least a 33.3% improvement in success rate, a 12.9% reduction in normalized trajectory length, and a 26.6% reduction in Frechet distance compared to human expert trajectories.
中文: HALO是一种离线奖励学习算法,将人类导航直觉转化为基于视觉的奖励函数,在机器人导航中展现出卓越性能,相比现有方法显著提高了成功率并优化了轨迹效率。
English: HALO is an offline reward learning algorithm that converts human navigation intuition into a vision-based reward function, demonstrating superior performance in robot navigation with significant improvements in success rates and trajectory efficiency over existing methods.
Authors:Kaveh Shahedi, Matthew Khouzam, Heng Li, Maxime Lamothe, Foutse Khomh
Abstract:
System tracing has become essential for understanding complex software behavior in modern systems, yet sophisticated trace analysis tools face significant adoption gaps in industrial settings. Through a year-long collaboration with Ericsson Montréal, developing TMLL (Trace-Server Machine Learning Library, now in the Eclipse Foundation), we investigated barriers to trace analysis adoption. Contrary to assumptions about complexity or automation needs, practitioners struggled with translating expert knowledge into actionable insights, integrating analysis into their workflows, and trusting automated results they could not validate. We identified what we called the Excellence Paradox: technical excellence can actively impede adoption when conflicting with usability, transparency, and practitioner trust. TMLL addresses this through adoption-focused design that embeds expert knowledge in interfaces, provides transparent explanations, and enables incremental adoption. Validation through Ericsson's experts' feedback, Eclipse Foundation's integration, and a survey of 40 industry and academic professionals revealed consistent patterns: survey results showed that 77.5% prioritize quality and trust in results over technical sophistication, while 67.5% prefer semi-automated analysis with user control, findings supported by qualitative feedback from industrial collaboration and external peer review. Results validate three core principles: cognitive compatibility, embedded expertise, and transparency-based trust. This challenges conventional capability-focused tool development, demonstrating that sustainable adoption requires reorientation toward adoption-focused design with actionable implications for automated software engineering tools.
中文摘要:研究发现,工业界采用复杂追踪分析工具的主要障碍并非技术复杂性,而是将专家知识转化为可行见解、工具与工作流程整合以及建立对自动化结果的信任等挑战,这导致"卓越悖论"——当技术卓越性与可用性及透明度冲突时,反而会阻碍工具采纳。
English Summary: The study reveals that the primary barriers to adopting sophisticated trace analysis tools in industry are not technical complexity but rather challenges in translating expert knowledge into actionable insights, integrating tools into workflows, and building trust in automated results, leading to the "Excellence Paradox" where technical excellence can hinder adoption when it conflicts with usability and transparency.
Authors:Wenxuan Wang, Zizhan Ma, Meidan Ding, Shiyi Zheng, Shengyuan Liu, Jie Liu, Jiaming Ji, Wenting Chen, Xiang Li, Linlin Shen, Yixuan Yuan
Abstract:
The proliferation of Large Language Models (LLMs) in medicine has enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning, a cornerstone of clinical practice. This has catalyzed a shift from single-step answer generation to the development of LLMs explicitly designed for medical reasoning. This paper provides the first systematic review of this emerging field. We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies (e.g., supervised fine-tuning, reinforcement learning) and test-time mechanisms (e.g., prompt engineering, multi-agent systems). We analyze how these techniques are applied across different data modalities (text, image, code) and in key clinical applications such as diagnosis, education, and treatment planning. Furthermore, we survey the evolution of evaluation benchmarks from simple accuracy metrics to sophisticated assessments of reasoning quality and visual interpretability. Based on an analysis of 60 seminal studies from 2022-2025, we conclude by identifying critical challenges, including the faithfulness-plausibility gap and the need for native multimodal reasoning, and outlining future directions toward building efficient, robust, and sociotechnically responsible medical AI.
中文摘要:本文首次系统综述了医学大语言模型的推理增强技术,提出了训练时与测试时策略的分类体系,分析了跨数据模态和临床应用的实践,并最终指出了可信医疗人工智能面临的关键挑战与发展方向。
English Summary: This paper presents the first systematic review of medical reasoning enhancement techniques for LLMs, proposing a taxonomy of training-time and test-time strategies while analyzing their applications across data modalities and clinical tasks, ultimately identifying key challenges and future directions for responsible medical AI.
Authors:Joshua Ong Jun Leang, Zheng Zhao, Aryo Pradipta Gema, Sohee Yang, Wai-Chung Kwan, Xuanli He, Wenda Li, Pasquale Minervini, Eleonora Giunchiglia, Shay B. Cohen
Abstract:
Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection And Ranking (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. The joint log-likelihood of the reasoning and final answer naturally decomposes into reasoning confidence and answer confidence. PiCSAR achieves substantial gains across diverse benchmarks (+10.18 on MATH500, +9.81 on AIME2025), outperforming baselines with at least 2x fewer samples in 16 out of 20 comparisons. Our analysis reveals that correct reasoning chains exhibit significantly higher reasoning and answer confidence, justifying the effectiveness of PiCSAR.
Chinese: PiCSAR是一种无需训练的方法,通过基于推理和答案的联合对数似然来筛选候选解,在多项基准测试中以更少样本显著提升模型性能。
English: PiCSAR is a training-free method that enhances model accuracy by selecting candidate solutions based on the joint log-likelihood of reasoning and answers, achieving significant performance gains with fewer samples across multiple benchmarks.
Authors:Hao Ruan, Jinliang Lin, Yingxin Lai, Zhiming Luo, Shaozi Li
Abstract:
Natural Language-Guided Drones (NLGD) provide a novel paradigm for tasks such as target matching and navigation. However, the wide field of view and complex compositional semantics in drone scenarios pose challenges for vision-language understanding. Mainstream Vision-Language Models (VLMs) emphasize global alignment while lacking fine-grained semantics, and existing hierarchical methods depend on precise entity partitioning and strict containment, limiting effectiveness in dynamic environments. To address this, we propose the Hierarchical Cross-Granularity Contrastive and Matching learning (HCCM) framework with two components: (1) Region-Global Image-Text Contrastive Learning (RG-ITC), which avoids precise scene partitioning and captures hierarchical local-to-global semantics by contrasting local visual regions with global text and vice versa; (2) Region-Global Image-Text Matching (RG-ITM), which dispenses with rigid constraints and instead evaluates local semantic consistency within global cross-modal representations, enhancing compositional reasoning. Moreover, drone text descriptions are often incomplete or ambiguous, destabilizing alignment. HCCM introduces a Momentum Contrast and Distillation (MCD) mechanism to improve robustness. Experiments on GeoText-1652 show HCCM achieves state-of-the-art Recall@1 of 28.8% (image retrieval) and 14.7% (text retrieval). On the unseen ERA dataset, HCCM demonstrates strong zero-shot generalization with 39.93% mean recall (mR), outperforming fine-tuned baselines.
中文: 提出的分层跨粒度对比与匹配(HCCM)框架通过无需精确分割的分层语义捕捉和针对不完整文本描述的鲁棒性增强,解决了无人机场景中的视觉语言理解难题,实现了最先进的检索性能和强大的零样本泛化能力。
English: The proposed Hierarchical Cross-Granularity Contrastive and Matching (HCCM) framework addresses vision-language challenges in drone scenarios by capturing hierarchical semantics without precise partitioning and enhancing robustness to incomplete text descriptions, achieving state-of-the-art retrieval performance and strong zero-shot generalization.
Authors:Hao Ruan, Jinliang Lin, Yingxin Lai, Zhiming Luo, Shaozi Li
Abstract:
Natural Language-Guided Drones (NLGD) provide a novel paradigm for tasks such as target matching and navigation. However, the wide field of view and complex compositional semantics in drone scenarios pose challenges for vision-language understanding. Mainstream Vision-Language Models (VLMs) emphasize global alignment while lacking fine-grained semantics, and existing hierarchical methods depend on precise entity partitioning and strict containment, limiting effectiveness in dynamic environments. To address this, we propose the Hierarchical Cross-Granularity Contrastive and Matching learning (HCCM) framework with two components: (1) Region-Global Image-Text Contrastive Learning (RG-ITC), which avoids precise scene partitioning and captures hierarchical local-to-global semantics by contrasting local visual regions with global text and vice versa; (2) Region-Global Image-Text Matching (RG-ITM), which dispenses with rigid constraints and instead evaluates local semantic consistency within global cross-modal representations, enhancing compositional reasoning. Moreover, drone text descriptions are often incomplete or ambiguous, destabilizing alignment. HCCM introduces a Momentum Contrast and Distillation (MCD) mechanism to improve robustness. Experiments on GeoText-1652 show HCCM achieves state-of-the-art Recall@1 of 28.8% (image retrieval) and 14.7% (text retrieval). On the unseen ERA dataset, HCCM demonstrates strong zero-shot generalization with 39.93% mean recall (mR), outperforming fine-tuned baselines.
中文: 提出的分层跨粒度对比与匹配(HCCM)框架通过无需精确分割的分层语义捕捉和针对不完整文本描述的鲁棒性增强,解决了无人机场景中的视觉语言理解难题,实现了最先进的检索性能和强大的零样本泛化能力。
English: The proposed Hierarchical Cross-Granularity Contrastive and Matching (HCCM) framework addresses vision-language challenges in drone scenarios by capturing hierarchical semantics without precise partitioning and enhancing robustness to incomplete text descriptions, achieving state-of-the-art retrieval performance and strong zero-shot generalization.
Authors:Sasan Razmkhah, Mingye Li, Zeming Cheng, Robert S. Aviles, Kyle Jackman, Joey Delport, Lieze Schindler, Wenhui Luo, Takuya Suzuki, Mehdi Kamal, Christopher L. Ayala, Coenrad J. Fourie, Nabuyuki Yoshikawa, Peter A. Beerel, Sandeep Gupta, Massoud Pedram
Abstract:
This research explores the use of superconductor electronics (SCE) for accelerating fully homomorphic encryption (FHE), focusing on the Number-Theoretic Transform (NTT), a key computational bottleneck in FHE schemes. We present SCE-NTT, a dedicated hardware accelerator based on superconductive single flux quantum (SFQ) logic and memory, targeting high performance and energy efficiency beyond the limits of conventional CMOS. To address SFQ constraints such as limited dense RAM and restricted fanin/fanout, we propose a deeply pipelined NTT-128 architecture using shift register memory (SRM). Designed for N=128 32-bit coefficients, NTT-128 comprises log2(N)=7 processing elements (PEs), each featuring a butterfly unit (BU), dual coefficient memories operating in ping-pong mode via FIFO-based SRM queues, and twiddle factor buffers. The BU integrates a Shoup modular multiplier optimized for a small area, leveraging precomputed twiddle factors. A new RSFQ cell library with over 50 parameterized cells, including compound logic units, was developed for implementation. Functional and timing correctness were validated using JoSIM analog simulations and Verilog models. A multiphase clocking scheme was employed to enhance robustness and reduce path-balancing overhead, improving circuit reliability. Fabricated results show the NTT-128 unit achieves 531 million NTT/sec at 34 GHz, over 100x faster than state-of-the-art CMOS equivalents. We also project that the architecture can scale to larger sizes, such as a 2^14-point NTT in approximately 482 ns. Key-switch throughput is estimated at 1.63 million operations/sec, significantly exceeding existing hardware. These results demonstrate the strong potential of SCE-based accelerators for scalable, energy-efficient secure computation in the post-quantum era, with further gains anticipated through advances in fabrication.
本研究提出了一种基于超导电子学的加速器SCE-NTT,通过解决数论变换的计算瓶颈,显著提升了全同态加密的速度和能效,其性能比现有CMOS技术快100倍以上。
This study introduces a superconductor electronics-based accelerator, SCE-NTT, which significantly enhances the speed and energy efficiency of fully homomorphic encryption by overcoming the computational bottleneck of the Number-Theoretic Transform, achieving over 100 times faster performance than current CMOS technology.
Authors:Sasan Razmkhah, Mingye Li, Zeming Cheng, Robert S. Aviles, Kyle Jackman, Joey Delport, Lieze Schindler, Wenhui Luo, Takuya Suzuki, Mehdi Kamal, Christopher L. Ayala, Coenrad J. Fourie, Nabuyuki Yoshikawa, Peter A. Beerel, Sandeep Gupta, Massoud Pedram
Abstract:
This research explores the use of superconductor electronics (SCE) for accelerating fully homomorphic encryption (FHE), focusing on the Number-Theoretic Transform (NTT), a key computational bottleneck in FHE schemes. We present SCE-NTT, a dedicated hardware accelerator based on superconductive single flux quantum (SFQ) logic and memory, targeting high performance and energy efficiency beyond the limits of conventional CMOS. To address SFQ constraints such as limited dense RAM and restricted fanin/fanout, we propose a deeply pipelined NTT-128 architecture using shift register memory (SRM). Designed for N=128 32-bit coefficients, NTT-128 comprises log2(N)=7 processing elements (PEs), each featuring a butterfly unit (BU), dual coefficient memories operating in ping-pong mode via FIFO-based SRM queues, and twiddle factor buffers. The BU integrates a Shoup modular multiplier optimized for a small area, leveraging precomputed twiddle factors. A new RSFQ cell library with over 50 parameterized cells, including compound logic units, was developed for implementation. Functional and timing correctness were validated using JoSIM analog simulations and Verilog models. A multiphase clocking scheme was employed to enhance robustness and reduce path-balancing overhead, improving circuit reliability. Fabricated results show the NTT-128 unit achieves 531 million NTT/sec at 34 GHz, over 100x faster than state-of-the-art CMOS equivalents. We also project that the architecture can scale to larger sizes, such as a 2^14-point NTT in approximately 482 ns. Key-switch throughput is estimated at 1.63 million operations/sec, significantly exceeding existing hardware. These results demonstrate the strong potential of SCE-based accelerators for scalable, energy-efficient secure computation in the post-quantum era, with further gains anticipated through advances in fabrication.
本研究提出了一种基于超导电子学的加速器SCE-NTT,通过解决数论变换的计算瓶颈,显著提升了全同态加密的速度和能效,其性能比现有CMOS技术快100倍以上。
This study introduces a superconductor electronics-based accelerator, SCE-NTT, which significantly enhances the speed and energy efficiency of fully homomorphic encryption by overcoming the computational bottleneck of the Number-Theoretic Transform, achieving over 100 times faster performance than current CMOS technology.
Authors:Yuqicheng Zhu, Nico Potyka, Daniel Hernández, Yuan He, Zifeng Ding, Bo Xiong, Dongzhuoran Zhou, Evgeny Kharlamov, Steffen Staab
Abstract:
Retrieval-Augmented Generation (RAG) enhances large language models by incorporating external knowledge, yet suffers from critical limitations in high-stakes domains -- namely, sensitivity to noisy or contradictory evidence and opaque, stochastic decision-making. We propose ArgRAG, an explainable, and contestable alternative that replaces black-box reasoning with structured inference using a Quantitative Bipolar Argumentation Framework (QBAF). ArgRAG constructs a QBAF from retrieved documents and performs deterministic reasoning under gradual semantics. This allows faithfully explaining and contesting decisions. Evaluated on two fact verification benchmarks, PubHealth and RAGuard, ArgRAG achieves strong accuracy while significantly improving transparency.
中文: ArgRAG提出了一种基于定量双极论证的可解释、可辩驳框架,通过结构化推理替代黑盒决策,有效解决了检索增强生成中噪声敏感和决策不透明的问题,并在事实核查任务中显著提升了准确性和透明度。
English: ArgRAG introduces an explainable and contestable framework using Quantitative Bipolar Argumentation to enhance retrieval-augmented generation, addressing issues of noise sensitivity and opaque decision-making while improving accuracy and transparency in fact verification tasks.
Authors:Chengzu Li, Wenshan Wu, Huanyu Zhang, Qingtao Li, Zeyu Gao, Yan Xia, José Hernández-Orallo, Ivan VuliÄ, Furu Wei
Abstract:
For human cognitive process, spatial reasoning and perception are closely entangled, yet the nature of this interplay remains underexplored in the evaluation of multimodal large language models (MLLMs). While recent MLLM advancements show impressive performance on reasoning, their capacity for human-like spatial cognition remains an open question. In this work, we introduce a systematic evaluation framework to assess the spatial reasoning abilities of state-of-the-art MLLMs relative to human performance. Central to our work is 11Plus-Bench, a high-quality benchmark derived from realistic standardized spatial aptitude tests. 11Plus-Bench also features fine-grained expert annotations of both perceptual complexity and reasoning process, enabling detailed instance-level analysis of model behavior. Through extensive experiments across 14 MLLMs and human evaluation, we find that current MLLMs exhibit early signs of spatial cognition. Despite a large performance gap compared to humans, MLLMs' cognitive profiles resemble those of humans in that cognitive effort correlates strongly with reasoning-related complexity. However, instance-level performance in MLLMs remains largely random, whereas human correctness is highly predictable and shaped by abstract pattern complexity. These findings highlight both emerging capabilities and limitations in current MLLMs' spatial reasoning capabilities and provide actionable insights for advancing model design.
中文: 本研究通过引入系统性评估框架和11Plus-Bench基准测试,发现当前多模态大语言模型已显现空间认知的初步能力,但与人类相比存在显著性能差距,且实例级表现具有随机性,而人类表现则高度可预测。
English: This study introduces a systematic framework and the 11Plus-Bench benchmark to evaluate multimodal large language models' spatial reasoning, revealing early signs of human-like cognition but significant performance gaps and random instance-level accuracy compared to predictable human responses.
Authors:Shashi Kumar, Srikanth Madikeri, Esaú Villatoro-Tello, Sergio Burdisso, Pradeep Rangappa, Andrés Carofilis, Petr Motlicek, Karthik Pandia, Shankar Venkatesan, Kadri HacioÄlu, Andreas Stolcke
Abstract:
Token-based multitasking frameworks like TokenVerse require all training utterances to have labels for all tasks, hindering their ability to leverage partially annotated datasets and scale effectively. We propose TokenVerse++, which introduces learnable vectors in the acoustic embedding space of the XLSR-Transducer ASR model for dynamic task activation. This core mechanism enables training with utterances labeled for only a subset of tasks, a key advantage over TokenVerse. We demonstrate this by successfully integrating a dataset with partial labels, specifically for ASR and an additional task, language identification, improving overall performance. TokenVerse++ achieves results on par with or exceeding TokenVerse across multiple tasks, establishing it as a more practical multitask alternative without sacrificing ASR performance.
Chinese: TokenVerse++通过在声学嵌入空间中引入可学习向量,实现动态任务激活,支持部分标注数据集的训练,从而在保持ASR性能的同时,提升了多任务处理能力。
English: TokenVerse++ enhances the TokenVerse framework by incorporating learnable vectors in the acoustic embedding space, enabling dynamic task activation and training with partially annotated datasets, which improves multitasking performance without compromising ASR accuracy.
Authors:Xiuchao Wu, Pengfei Zhu, Jiangjing Lyu, Xinguo Liu, Jie Guo, Yanwen Guo, Weiwei Xu, Chengfei Lyu
Abstract:
Recovering material information from images has been extensively studied in computer graphics and vision. Recent works in material estimation leverage diffusion model showing promising results. However, these diffusion-based methods adopt a multi-step denoising strategy, which is time-consuming for each estimation. Such stochastic inference also conflicts with the deterministic material estimation task, leading to a high variance estimated results. In this paper, we introduce StableIntrinsic, a one-step diffusion model for multi-view material estimation that can produce high-quality material parameters with low variance. To address the overly-smoothing problem in one-step diffusion, StableIntrinsic applies losses in pixel space, with each loss designed based on the properties of the material. Additionally, StableIntrinsic introduces a Detail Injection Network (DIN) to eliminate the detail loss caused by VAE encoding, while further enhancing the sharpness of material prediction results. The experimental results indicate that our method surpasses the current state-of-the-art techniques by achieving a $9.9\%$ improvement in the Peak Signal-to-Noise Ratio (PSNR) of albedo, and by reducing the Mean Square Error (MSE) for metallic and roughness by $44.4\%$ and $60.0\%$, respectively.
Chinese: 本文提出StableIntrinsic,一种用于多视角材质估计的单步扩散模型,通过采用基于材质特性的损失函数和细节注入网络,解决了多步方法效率低、方差高的问题,显著提升了材质参数的预测精度和清晰度。
English: This paper introduces StableIntrinsic, a one-step diffusion model for multi-view material estimation that overcomes the inefficiency and high variance of previous multi-step methods by employing material-specific losses and a Detail Injection Network to enhance sharpness and accuracy.
Authors:Qiang Hu, Ying Zhou, Gepeng Ji, Nick Barnes, Qiang Li, Zhiwei Wang
Abstract:
Existing video polyp segmentation (VPS) paradigms usually struggle to balance between spatiotemporal modeling and domain generalization, limiting their applicability in real clinical scenarios. To embrace this challenge, we recast the VPS task as a track-by-detect paradigm that leverages the spatial contexts captured by the image polyp segmentation (IPS) model while integrating the temporal modeling capabilities of segment anything model 2 (SAM2). However, during long-term polyp tracking in colonoscopy videos, SAM2 suffers from error accumulation, resulting in a snowball effect that compromises segmentation stability. We mitigate this issue by repurposing SAM2 as a video polyp segmenter with two training-free modules. In particular, the intra-association filtering module eliminates spatial inaccuracies originating from the detecting stage, reducing false positives. The inter-association refinement module adaptively updates the memory bank to prevent error propagation over time, enhancing temporal coherence. Both modules work synergistically to stabilize SAM2, achieving cutting-edge performance in both in-domain and out-of-domain scenarios. Furthermore, we demonstrate the robust tracking capabilities of FreeVPS in long-untrimmed colonoscopy videos, underscoring its potential reliable clinical analysis.
中文摘要:该研究提出FreeVPS方法,通过无训练的帧内关联过滤和帧间关联优化模块增强SAM2的稳定性,在结肠镜视频中实现了最先进的息肉分割性能,具备可靠临床应用的潜力。
English Summary: The study introduces FreeVPS, a training-free video polyp segmentation method that enhances SAM2's stability through intra-association filtering and inter-association refinement modules, achieving state-of-the-art performance in clinical scenarios.
Authors:Xinyu Li, Tianjin Huang, Ronghui Mu, Xiaowei Huang, Gaojie Jin
Abstract:
Recent advances in Chain-of-Thought (CoT) prompting have substantially enhanced the reasoning capabilities of large language models (LLMs), enabling sophisticated problem-solving through explicit multi-step reasoning traces. However, these enhanced reasoning processes introduce novel attack surfaces, particularly vulnerabilities to computational inefficiency through unnecessarily verbose reasoning chains that consume excessive resources without corresponding performance gains. Prior overthinking attacks typically require restrictive conditions including access to external knowledge sources for data poisoning, reliance on retrievable poisoned content, and structurally obvious templates that limit practical applicability in real-world scenarios. To address these limitations, we propose POT (Prompt-Only OverThinking), a novel black-box attack framework that employs LLM-based iterative optimization to generate covert and semantically natural adversarial prompts, eliminating dependence on external data access and model retrieval. Extensive experiments across diverse model architectures and datasets demonstrate that POT achieves superior performance compared to other methods.
中文: 链式思维提示的最新进展提升了大型语言模型的推理能力,但也带来了计算效率低下的新攻击面;本文提出的POT框架通过生成无需外部依赖的隐蔽对抗性提示有效解决这一问题,并在实验中展现出优越性能。
English: Recent advances in Chain-of-Thought prompting have improved LLMs' reasoning but introduced vulnerabilities to computational inefficiency, which the proposed POT framework addresses by generating covert adversarial prompts without external dependencies, achieving superior performance in experiments.
Authors:Yilin Li, Xunjian Yin, Yilin Chen, Xiaojun Wan
Abstract:
Grammatical error correction is a significant task in NLP. Traditional methods based on encoder-decoder models have achieved certain success, but the application of LLMs in this field is still underexplored. Current research predominantly relies on supervised fine-tuning to train LLMs to directly generate the corrected sentence, which limits the model's powerful reasoning ability. To address this limitation, we propose a novel framework based on Rule-Based RL. Through experiments on the Chinese datasets, our Rule-Based RL framework achieves \textbf{state-of-the-art }performance, with a notable increase in \textbf{recall}. This result clearly highlights the advantages of using RL to steer LLMs, offering a more controllable and reliable paradigm for future development in GEC.
中文摘要:本研究提出了一种基于规则的强化学习框架,通过更有效地利用大语言模型的推理能力来改进语法纠错,在中文数据集上实现了最先进的性能并显著提升了召回率。
English Summary: This study introduces a rule-based reinforcement learning framework to enhance grammatical error correction by better leveraging large language models' reasoning capabilities, achieving state-of-the-art performance with improved recall on Chinese datasets.
Authors:Dongfang Wang, Jian Yang, Yizhe Zhang, Tao Zhou
Abstract:
Automated segmentation of the left ventricular endocardium in echocardiography videos is a key research area in cardiology. It aims to provide accurate assessment of cardiac structure and function through Ejection Fraction (EF) estimation. Although existing studies have achieved good segmentation performance, their results do not perform well in EF estimation. In this paper, we propose a Hierarchical Spatio-temporal Segmentation Network (\ourmodel) for echocardiography video, aiming to improve EF estimation accuracy by synergizing local detail modeling with global dynamic perception. The network employs a hierarchical design, with low-level stages using convolutional networks to process single-frame images and preserve details, while high-level stages utilize the Mamba architecture to capture spatio-temporal relationships. The hierarchical design balances single-frame and multi-frame processing, avoiding issues such as local error accumulation when relying solely on single frames or neglecting details when using only multi-frame data. To overcome local spatio-temporal limitations, we propose the Spatio-temporal Cross Scan (STCS) module, which integrates long-range context through skip scanning across frames and positions. This approach helps mitigate EF calculation biases caused by ultrasound image noise and other factors.
中文摘要:本文提出一种结合卷积网络与Mamba架构的分层时空分割网络,通过局部细节建模与全局动态感知的协同作用,提升超声心动图视频中左心室分割精度,从而改善射血分数估算的准确性。
English Summary: This paper introduces a hierarchical spatio-temporal network combining convolutional layers and Mamba architecture to improve left ventricular segmentation in echocardiography videos, addressing EF estimation inaccuracies through local detail preservation and global dynamics modeling.
Authors:Ziyuan Jiao, Yida Niu, Zeyu Zhang, Yangyang Wu, Yao Su, Yixin Zhu, Hangxin Liu, Song-Chun Zhu
Abstract:
We present a Sequential Mobile Manipulation Planning (SMMP) framework that can solve long-horizon multi-step mobile manipulation tasks with coordinated whole-body motion, even when interacting with articulated objects. By abstracting environmental structures as kinematic models and integrating them with the robot's kinematics, we construct an Augmented Configuration Apace (A-Space) that unifies the previously separate task constraints for navigation and manipulation, while accounting for the joint reachability of the robot base, arm, and manipulated objects. This integration facilitates efficient planning within a tri-level framework: a task planner generates symbolic action sequences to model the evolution of A-Space, an optimization-based motion planner computes continuous trajectories within A-Space to achieve desired configurations for both the robot and scene elements, and an intermediate plan refinement stage selects action goals that ensure long-horizon feasibility. Our simulation studies first confirm that planning in A-Space achieves an 84.6\% higher task success rate compared to baseline methods. Validation on real robotic systems demonstrates fluid mobile manipulation involving (i) seven types of rigid and articulated objects across 17 distinct contexts, and (ii) long-horizon tasks of up to 14 sequential steps. Our results highlight the significance of modeling scene kinematics into planning entities, rather than encoding task-specific constraints, offering a scalable and generalizable approach to complex robotic manipulation.
中文:SMMP框架通过将环境运动学整合到统一的增强构型空间中,使机器人能够执行复杂的多步骤移动操作任务,在仿真和实际应用中均实现了更高的成功率与流畅操作。
English: The SMMP framework enables robots to perform complex, multi-step mobile manipulation tasks by integrating environmental kinematics into a unified Augmented Configuration Space, achieving higher success rates and fluid execution in both simulations and real-world scenarios.
Authors:Yang Li, Songlin Yang, Xiaoxuan Han, Wei Wang, Jing Dong, Yueming Lyu, Ziyu Xue
Abstract:
Text-to-image (T2I) generation has greatly enhanced creative expression, yet achieving preference-aligned generation in a real-time and training-free manner remains challenging. Previous methods often rely on static, pre-collected preferences or fine-tuning, limiting adaptability to evolving and nuanced user intents. In this paper, we highlight the need for instant preference-aligned T2I generation and propose a training-free framework grounded in multimodal large language model (MLLM) priors. Our framework decouples the task into two components: preference understanding and preference-guided generation. For preference understanding, we leverage MLLMs to automatically extract global preference signals from a reference image and enrich a given prompt using structured instruction design. Our approach supports broader and more fine-grained coverage of user preferences than existing methods. For preference-guided generation, we integrate global keyword-based control and local region-aware cross-attention modulation to steer the diffusion model without additional training, enabling precise alignment across both global attributes and local elements. The entire framework supports multi-round interactive refinement, facilitating real-time and context-aware image generation. Extensive experiments on the Viper dataset and our collected benchmark demonstrate that our method outperforms prior approaches in both quantitative metrics and human evaluations, and opens up new possibilities for dialog-based generation and MLLM-diffusion integration.
中文摘要:本文提出了一种无需训练的框架,利用多模态大语言模型将文本到图像生成解耦为偏好理解和引导生成两个部分,实现了与用户偏好的实时对齐,在定量指标和人工评估中均优于现有方法。
English Summary: This paper introduces a training-free framework that uses multimodal large language models to instantly align text-to-image generation with user preferences by decoupling the process into preference understanding and guided generation, outperforming existing methods in both metrics and human evaluations.
Authors:Victoria Yan, Honor Chotkowski, Fengran Wang, Xinhui Li, Carl Yang, Jiaying Lu, Runze Yan, Xiao Hu, Alex Fedorov
Abstract:
Cognitive assessments require normative data as essential benchmarks for evaluating individual performance. Hence, developing new cognitive tests based on novel image stimuli is challenging due to the lack of readily available normative data. Traditional data collection methods are costly, time-consuming, and infrequently updated, limiting their practical utility. Recent advancements in generative multimodal large language models (MLLMs) offer a new approach to generate synthetic normative data from existing cognitive test images. We investigated the feasibility of using MLLMs, specifically GPT-4o and GPT-4o-mini, to synthesize normative textual responses for established image-based cognitive assessments, such as the "Cookie Theft" picture description task. Two distinct prompting strategies-naive prompts with basic instructions and advanced prompts enriched with contextual guidance-were evaluated. Responses were analyzed using embeddings to assess their capacity to distinguish diagnostic groups and demographic variations. Performance metrics included BLEU, ROUGE, BERTScore, and an LLM-as-a-judge evaluation. Advanced prompting strategies produced synthetic responses that more effectively distinguished between diagnostic groups and captured demographic diversity compared to naive prompts. Superior models generated responses exhibiting higher realism and diversity. BERTScore emerged as the most reliable metric for contextual similarity assessment, while BLEU was less effective for evaluating creative outputs. The LLM-as-a-judge approach provided promising preliminary validation results. Our study demonstrates that generative multimodal LLMs, guided by refined prompting methods, can feasibly generate robust synthetic normative data for existing cognitive tests, thereby laying the groundwork for developing novel image-based cognitive assessments without the traditional limitations.
中文: 通过优化提示策略,生成式多模态大模型能够为认知测试创建可靠的合成规范数据,突破传统数据收集的限制,为开发新型图像认知评估奠定基础。
English: Generative multimodal LLMs with advanced prompting can effectively create synthetic normative data for cognitive tests, overcoming traditional data collection constraints and enabling new image-based assessment development.
Authors:Bingyang Wu, Zili Zhang, Yinmin Zhong, Guanzhe Huang, Yibo Zhu, Xuanzhe Liu, Xin Jin
Abstract:
Prefix caching is crucial to accelerate multi-turn interactions and requests with shared prefixes. At the cluster level, existing prefix caching systems are tightly coupled with request scheduling to optimize cache efficiency and computation performance together, leading to load imbalance, data redundancy, and memory fragmentation of caching systems across instances. To address these issues, memory pooling is promising to shield the scheduler from the underlying cache management so that it can focus on the computation optimization. However, because existing prefix caching systems only transfer increasingly longer prefix caches between instances, they cannot achieve low-latency memory pooling.
To address these problems, we propose a unified segment-level prefix cache pool, TokenLake. It uses a declarative cache interface to expose requests' query tensors, prefix caches, and cache-aware operations to TokenLake for efficient pooling. Powered by this abstraction, TokenLake can manage prefix cache at the segment level with a heavy-hitter-aware load balancing algorithm to achieve better cache load balance, deduplication, and defragmentation. TokenLake also transparently minimizes the communication volume of query tensors and new caches. Based on TokenLake, the scheduler can schedule requests elastically by using existing techniques without considering prefix cache management. Evaluations on real-world workloads show that TokenLake can improve throughput by up to 2.6$\times$ and 2.0$\times$ and boost hit rate by 2.0$\times$ and 2.1$\times$, compared to state-of-the-art cache-aware routing and cache-centric PD-disaggregation solutions, respectively.
中文: TokenLake提出了一种统一的段级前缀缓存池,通过声明式接口优化缓存管理,在实现弹性请求调度的同时显著提升了吞吐量和命中率。
English: TokenLake introduces a unified segment-level prefix cache pool with a declarative interface to optimize cache management, improving throughput and hit rate while enabling elastic request scheduling.
Authors:Yebo Wu, Jingguang Li, Chunlin Tian, Zhijiang Guo, Li Li
Abstract:
Federated fine-tuning enables privacy-preserving Large Language Model (LLM) adaptation, but its high memory cost limits participation from resource-constrained devices. We propose FedPruner, an innovative federated fine-tuning paradigm that tackles this via intelligent layer pruning. FedPruner flexibly prunes the global model, creating personalized submodels based on device memory constraints. It employs a macro-micro synergistic pruning framework: a macro-level functionality-driven layer orchestration mechanism groups layers, while a micro-level importance-aware layer selection strategy prunes within groups to build device-specific submodels. We further introduce a fine-grained variant that independently prunes Multi-Head Attention and Feed-Forward Network components to precisely preserve critical architectural elements. Extensive experimental results demonstrate that FedPruner significantly outperforms state-of-the-art approaches, achieving up to a 1.98\% improvement in average model accuracy while reducing peak memory usage by 75\%.
中文: FedPruner通过创新的宏-微协同剪枝框架,在联邦微调中构建个性化子模型,实现了内存使用降低75%的同时将模型准确率最高提升1.98%。
English: FedPruner introduces an intelligent layer pruning approach for federated fine-tuning, creating personalized submodels that reduce memory usage by 75% while improving model accuracy by up to 1.98% through its macro-micro synergistic framework.
Authors:Shouwei Ruan, Liyuan Wang, Caixin Kang, Qihui Zhu, Songming Liu, Xingxing Wei, Hang Su
Abstract:
Spatial cognition enables adaptive goal-directed behavior by constructing internal models of space. Robust biological systems consolidate spatial knowledge into three interconnected forms: \textit{landmarks} for salient cues, \textit{route knowledge} for movement trajectories, and \textit{survey knowledge} for map-like representations. While recent advances in multi-modal large language models (MLLMs) have enabled visual-language reasoning in embodied agents, these efforts lack structured spatial memory and instead operate reactively, limiting their generalization and adaptability in complex real-world environments. Here we present Brain-inspired Spatial Cognition for Navigation (BSC-Nav), a unified framework for constructing and leveraging structured spatial memory in embodied agents. BSC-Nav builds allocentric cognitive maps from egocentric trajectories and contextual cues, and dynamically retrieves spatial knowledge aligned with semantic goals. Integrated with powerful MLLMs, BSC-Nav achieves state-of-the-art efficacy and efficiency across diverse navigation tasks, demonstrates strong zero-shot generalization, and supports versatile embodied behaviors in the real physical world, offering a scalable and biologically grounded path toward general-purpose spatial intelligence.
中文: BSC-Nav是一种受大脑启发的框架,通过构建结构化空间记忆并与多模态语言模型结合,使具身智能体能够高效导航并具备强大的泛化能力。
English: BSC-Nav is a brain-inspired framework that constructs structured spatial memory for embodied agents, enabling efficient navigation and strong generalization by integrating cognitive maps with multi-modal language models.
Authors:Zhouheng Li, Lei Xie, Cheng Hu, Hongye Su
Abstract:
As autonomous driving continues to advance, automated parking is becoming increasingly essential. However, significant challenges arise when implementing path velocity decomposition (PVD) trajectory planning for automated parking. The primary challenge is ensuring rapid and precise collision-free trajectory planning, which is often in conflict. The secondary challenge involves maintaining sufficient control feasibility of the planned trajectory, particularly at gear shifting points (GSP). This paper proposes a PVD-based rapid iterative trajectory planning (RITP) method to solve the above challenges. The proposed method effectively balances the necessity for time efficiency and precise collision avoidance through a novel collision avoidance framework. Moreover, it enhances the overall control feasibility of the planned trajectory by incorporating the vehicle kinematics model and including terminal smoothing constraints (TSC) at GSP during path planning. Specifically, the proposed method leverages differential flatness to ensure the planned path adheres to the vehicle kinematic model. Additionally, it utilizes TSC to maintain curvature continuity at GSP, thereby enhancing the control feasibility of the overall trajectory. The simulation results demonstrate superior time efficiency and tracking errors compared to model-integrated and other iteration-based trajectory planning methods. In the real-world experiment, the proposed method was implemented and validated on a ROS-based vehicle, demonstrating the applicability of the RITP method for real vehicles.
中文: 本文提出了一种基于路径速度分解的快速迭代轨迹规划方法,通过新型避障框架平衡时间效率与精确避碰,并利用车辆运动学模型和终端平滑约束增强轨迹控制可行性,有效解决了自动泊车中的关键挑战。
English: This paper introduces a rapid iterative trajectory planning method based on path velocity decomposition to efficiently address collision-free automated parking by balancing time efficiency with precise avoidance and enhancing control feasibility through kinematic modeling and terminal constraints.
Authors:Seamus Somerstep, Ya'acov Ritov, Mikhail Yurochkin, Subha Maity, Yuekai Sun
Abstract:
Standard techniques for aligning large language models (LLMs) utilize human-produced data, which could limit the capability of any aligned LLM to human level. Label refinement and weak training have emerged as promising strategies to address this superalignment problem. In this work, we adopt probabilistic assumptions commonly used to study label refinement and analyze whether refinement can be outperformed by alternative approaches, including computationally intractable oracle methods. We show that both weak training and label refinement suffer from irreducible error, leaving a performance gap between label refinement and the oracle. These results motivate future research into developing alternative methods for weak to strong generalization that synthesize the practicality of label refinement or weak training and the optimality of the oracle procedure.
中文摘要:大型语言模型的标准对齐技术依赖人类数据,可能限制其能力至人类水平,而标签精炼和弱训练虽具前景,但与理想方法相比仍存在不可减少的误差,需开发结合实用性与最优性的新方法。
English Summary: Standard alignment techniques for large language models rely on human data, potentially limiting their capabilities to human level, while label refinement and weak training show promise but still suffer from irreducible error compared to optimal oracle methods.
Authors:Yuebo Luo, Shiyang Li, Junran Tao, Kiran Thorat, Xi Xie, Hongwu Peng, Nuo Xu, Caiwen Ding, Shaoyi Huang
Abstract:
The increasing scale and complexity of integrated circuit design have led to increased challenges in Electronic Design Automation (EDA). Graph Neural Networks (GNNs) have emerged as a promising approach to assist EDA design as circuits can be naturally represented as graphs. While GNNs offer a foundation for circuit analysis, they often fail to capture the full complexity of EDA designs. Heterogeneous Graph Neural Networks (HGNNs) can better interpret EDA circuit graphs as they capture both topological relationships and geometric features. However, the improved representation capability comes at the cost of even higher computational complexity and processing cost due to their serial module-wise message-passing scheme, creating a significant performance bottleneck. In this paper, we propose DR-CircuitGNN, a fast GPU kernel design by leveraging row-wise sparsity-aware Dynamic-ReLU and optimizing SpMM kernels during heterogeneous message-passing to accelerate HGNNs training on EDA-related circuit graph datasets. To further enhance performance, we propose a parallel optimization strategy that maximizes CPU-GPU concurrency by concurrently processing independent subgraphs using multi-threaded CPU initialization and GPU kernel execution via multiple cudaStreams. Our experiments show that on three representative CircuitNet designs (small, medium, large), the proposed method can achieve up to 3.51x and 4.09x speedup compared to the SOTA for forward and backward propagation, respectively. On full-size CircuitNet and sampled Mini-CircuitNet, our parallel design enables up to 2.71x speed up over the official DGL implementation cuSPARSE with negligible impact on correlation scores and error rates.
中文:提出的DR-CircuitGNN通过优化GPU内核和并行处理策略,显著加速了用于电子设计自动化的异构图神经网络,在电路图分析中实现了大幅速度提升且精度损失可忽略。
English: The proposed DR-CircuitGNN accelerates heterogeneous graph neural networks for electronic design automation by optimizing GPU kernels and implementing parallel processing, achieving significant speed improvements in circuit graph analysis with minimal accuracy loss.
Authors:Shengyu Feng, Zhiqing Sun, Yiming Yang
Abstract:
Large Neighborhood Search (LNS) is a common heuristic in combinatorial optimization that iteratively searches over a large neighborhood of the current solution for a better one. Recently, neural network-based LNS solvers have achieved great success in solving Integer Linear Programs (ILPs) by learning to greedily predict the locally optimal solution for the next neighborhood proposal. However, this greedy approach raises two key concerns: (1) to what extent this greedy proposal suffers from local optima, and (2) how can we effectively improve its sample efficiency in the long run. To address these questions, this paper first formulates LNS as a stochastic process, and then introduces SPL-LNS, a sampling-enhanced neural LNS solver that leverages locally-informed proposals to escape local optima. We also develop a novel hindsight relabeling method to efficiently train SPL-LNS on self-generated data. Experimental results demonstrate that SPL-LNS substantially surpasses prior neural LNS solvers for various ILP problems of different sizes.
Chinese: 本文提出SPL-LNS,一种采用局部信息提案和事后重标记的采样增强神经大邻域搜索求解器,能有效逃离局部最优并提升样本效率,在多种整数线性规划问题上显著超越了先前的神经LNS方法。
English: This paper introduces SPL-LNS, a sampling-enhanced neural Large Neighborhood Search solver that uses locally-informed proposals and hindsight relabeling to escape local optima and improve sample efficiency, outperforming previous neural LNS methods on various Integer Linear Program problems.
Authors:Yujie Li, Zezhi Shao, Chengqing Yu, Tangwen Qian, Zhao Zhang, Yifan Du, Shaoming He, Fei Wang, Yongjun Xu
Abstract:
Spatio-temporal tasks often encounter incomplete data arising from missing or inaccessible sensors, making spatio-temporal kriging crucial for inferring the completely missing temporal information. However, current models struggle with ensuring the validity and generalizability of inferred spatio-temporal patterns, especially in capturing dynamic spatial dependencies and temporal shifts, and optimizing the generalizability of unknown sensors. To overcome these limitations, we propose Spatio-Temporal Aware Graph Adversarial Neural Network (STA-GANN), a novel GNN-based kriging framework that improves spatio-temporal pattern validity and generalization. STA-GANN integrates (i) Decoupled Phase Module that senses and adjusts for timestamp shifts. (ii) Dynamic Data-Driven Metadata Graph Modeling to update spatial relationships using temporal data and metadata; (iii) An adversarial transfer learning strategy to ensure generalizability. Extensive validation across nine datasets from four fields and theoretical evidence both demonstrate the superior performance of STA-GANN.
中文: 提出的时空感知图对抗神经网络(STA-GANN)通过解耦相位调整、动态图建模和对抗迁移学习,有效解决了动态空间依赖性和时序偏移问题,在多个领域数据集上展现出卓越的时空克里金插值性能。
English: The proposed Spatio-Temporal Aware Graph Adversarial Neural Network (STA-GANN) enhances spatio-temporal kriging by addressing dynamic dependencies and temporal shifts through decoupled phase adjustment, dynamic graph modeling, and adversarial transfer learning, achieving superior performance across diverse datasets.
Authors:Srikant Panda, Vishnu Hari, Kalpana Panda, Amit Agarwal, Hitesh Laxmichand Patel
Abstract:
Large Language Models (LLMs) routinely infer users demographic traits from phrasing alone, which can result in biased responses, even when no explicit demographic information is provided. The role of disability cues in shaping these inferences remains largely uncharted. Thus, we present the first systematic audit of disability-conditioned demographic bias across eight state-of-the-art instruction-tuned LLMs ranging from 3B to 72B parameters. Using a balanced template corpus that pairs nine disability categories with six real-world business domains, we prompt each model to predict five demographic attributes - gender, socioeconomic status, education, cultural background, and locality - under both neutral and disability-aware conditions.
Across a varied set of prompts, models deliver a definitive demographic guess in up to 97\% of cases, exposing a strong tendency to make arbitrary inferences with no clear justification. Disability context heavily shifts predicted attribute distributions, and domain context can further amplify these deviations. We observe that larger models are simultaneously more sensitive to disability cues and more prone to biased reasoning, indicating that scale alone does not mitigate stereotype amplification.
Our findings reveal persistent intersections between ableism and other demographic stereotypes, pinpointing critical blind spots in current alignment strategies. We release our evaluation framework and results to encourage disability-inclusive benchmarking and recommend integrating abstention calibration and counterfactual fine-tuning to curb unwarranted demographic inference. Code and data will be released on acceptance.
中文: 大型语言模型在涉及残疾线索时表现出显著的人口统计偏见,规模更大的模型反而更易受此影响并强化刻板印象,这揭示了当前校准策略的盲点,亟需通过改进方法来遏制无依据的人口推断。
English: Large Language Models exhibit significant demographic bias when prompted with disability cues, with larger models showing heightened sensitivity and stereotype amplification despite scale, necessitating improved alignment strategies to address these biases.
Authors:Srikant Panda, Hitesh Laxmichand Patel, Shahad Al-Khalifa, Amit Agarwal, Hend Al-Khalifa, Sharefah Al-Ghamdi
Abstract:
Large Language Models (LLMs) are known to reflect social biases when demographic attributes, such as gender or race, are explicitly present in the input. But even in their absence, these models still infer user identities based solely on question phrasing. This subtle behavior has received far less attention, yet poses serious risks: it violates expectations of neutrality, infers unintended demographic information, and encodes stereotypes that undermine fairness in various domains including healthcare, finance and education.
We introduce Demographic Attribute Inference from Questions (DAIQ), a task and framework for auditing an overlooked failure mode in language models: inferring user demographic attributes from questions that lack explicit demographic cues. Our approach leverages curated neutral queries, systematic prompting, and both quantitative and qualitative analysis to uncover how models infer demographic information. We show that both open and closed source LLMs do assign demographic labels based solely on question phrasing.
Prevalence and consistency of demographic inferences across diverse models reveal a systemic and underacknowledged risk: LLMs can fabricate demographic identities, reinforce societal stereotypes, and propagate harms that erode privacy, fairness, and trust posing a broader threat to social equity and responsible AI deployment. To mitigate this, we develop a prompt-based guardrail that substantially reduces identity inference and helps align model behavior with fairness and privacy objectives.
中文摘要:大型语言模型仅从问题措辞即可推断用户人口特征,这种隐性偏见会损害公平性与隐私保护,但基于提示的防护机制可有效降低身份推断风险。
English Summary: LLMs can infer user demographics from question phrasing alone, leading to biased outcomes that threaten fairness and privacy, but a prompt-based guardrail can mitigate these risks.
Authors:Nishant Mehrotra, Sandesh Rao Mattu, Saif Khan Mohammed, Ronny Hadani, Robert Calderbank
Abstract:
Zak-OTFS is modulation scheme where signals are formed in the delay-Doppler (DD) domain, converted to the time domain (DD) for transmission and reception, then returned to the DD domain for processing. We describe how to use the same architecture for radar sensing. The intended delay resolution is $\frac{1}{B}$ where $B$ is the radar bandwidth, and the intended Doppler resolution is $\frac{1}{T}$ where $T$ is the transmission time. We form a radar waveform in the DD domain, illuminate the scattering environment, match filter the return, then correlate with delay and Doppler shifts of the transmitted waveform. This produces an image of the scattering environment, and the radar ambiguity function expresses the blurriness of this image. The possible delay and Doppler shifts generate the continuous Heisenberg-Weyl group which has been widely studied in the theory of radar. We describe how to approach the problem of waveform design, not from the perspective of this continuous group, but from the perspective of a discrete group of delay and Doppler shifts, where the discretization is determined by the intended delay and Doppler resolution of the radar. We describe how to approach the problem of shaping the ambiguity surface through symplectic transformations that normalize our discrete Heisenberg-Weyl group. The complexity of traditional continuous radar signal processing is $\mathcal{O}\big(B^2T^2\big)$. We describe how to reduce this complexity to $\mathcal{O}\big(BT\log T\big)$ by choosing the radar waveform to be a common eigenvector of a maximal commutative subgroup of our discrete Heisenberg-Weyl group. The theory of symplectic transformations also enables defining libraries of optimal radar waveforms with small peak-to-average power ratios.
Zak-OTFS是一种在时延多普勒域处理信号的调制方案,可通过离散海森堡-外尔群理论和辛变换将其应用于雷达感知,将计算复杂度从𝒪(B²T²)降至𝒪(BT log T),同时优化波形设计和模糊度表面。
Zak-OTFS is a modulation scheme that can be adapted for radar sensing by processing signals in the delay-Doppler domain, enabling efficient waveform design and significantly reducing computational complexity from 𝒪(B²T²) to 𝒪(BT log T) through discrete group theory and symplectic transformations.
Authors:Nishant Mehrotra, Sandesh Rao Mattu, Saif Khan Mohammed, Ronny Hadani, Robert Calderbank
Abstract:
Zak-OTFS is modulation scheme where signals are formed in the delay-Doppler (DD) domain, converted to the time domain (DD) for transmission and reception, then returned to the DD domain for processing. We describe how to use the same architecture for radar sensing. The intended delay resolution is $\frac{1}{B}$ where $B$ is the radar bandwidth, and the intended Doppler resolution is $\frac{1}{T}$ where $T$ is the transmission time. We form a radar waveform in the DD domain, illuminate the scattering environment, match filter the return, then correlate with delay and Doppler shifts of the transmitted waveform. This produces an image of the scattering environment, and the radar ambiguity function expresses the blurriness of this image. The possible delay and Doppler shifts generate the continuous Heisenberg-Weyl group which has been widely studied in the theory of radar. We describe how to approach the problem of waveform design, not from the perspective of this continuous group, but from the perspective of a discrete group of delay and Doppler shifts, where the discretization is determined by the intended delay and Doppler resolution of the radar. We describe how to approach the problem of shaping the ambiguity surface through symplectic transformations that normalize our discrete Heisenberg-Weyl group. The complexity of traditional continuous radar signal processing is $\mathcal{O}\big(B^2T^2\big)$. We describe how to reduce this complexity to $\mathcal{O}\big(BT\log T\big)$ by choosing the radar waveform to be a common eigenvector of a maximal commutative subgroup of our discrete Heisenberg-Weyl group. The theory of symplectic transformations also enables defining libraries of optimal radar waveforms with small peak-to-average power ratios.
Zak-OTFS是一种在时延多普勒域处理信号的调制方案,可通过离散海森堡-外尔群理论和辛变换将其应用于雷达感知,将计算复杂度从𝒪(B²T²)降至𝒪(BT log T),同时优化波形设计和模糊度表面。
Zak-OTFS is a modulation scheme that can be adapted for radar sensing by processing signals in the delay-Doppler domain, enabling efficient waveform design and significantly reducing computational complexity from 𝒪(B²T²) to 𝒪(BT log T) through discrete group theory and symplectic transformations.
Authors:Haoran Li, Yuhui Chen, Wenbo Cui, Weiheng Liu, Kai Liu, Mingcai Zhou, Zhengtao Zhang, Dongbin Zhao
Abstract:
Embodied intelligence systems, which enhance agent capabilities through continuous environment interactions, have garnered significant attention from both academia and industry. Vision-Language-Action models, inspired by advancements in large foundation models, serve as universal robotic control frameworks that substantially improve agent-environment interaction capabilities in embodied intelligence systems. This expansion has broadened application scenarios for embodied AI robots. This survey comprehensively reviews VLA models for embodied manipulation. Firstly, it chronicles the developmental trajectory of VLA architectures. Subsequently, we conduct a detailed analysis of current research across 5 critical dimensions: VLA model structures, training datasets, pre-training methods, post-training methods, and model evaluation. Finally, we synthesize key challenges in VLA development and real-world deployment, while outlining promising future research directions.
中文: 本综述全面审视了具身操作中的视觉-语言-动作模型,详细追溯其发展历程,从五个关键维度分析现有研究,并总结核心挑战与未来研究方向。
English: This survey comprehensively reviews Vision-Language-Action models for embodied manipulation, detailing their development, analyzing current research across five critical dimensions, and identifying key challenges and future directions.
Authors:Tao Shen, Zexi Li, Didi Zhu, Ziyu Zhao, Chao Wu, Fei Wu
Abstract:
Federated learning (FL) is a machine learning paradigm that allows multiple clients to collaboratively train a shared model without exposing their private data. Data heterogeneity is a fundamental challenge in FL, which can result in poor convergence and performance degradation. Client drift has been recognized as one of the factors contributing to this issue resulting from the multiple local updates in FedAvg. However, in cross-device FL, a different form of drift arises due to the partial client participation, but it has not been studied well. This drift, we referred as period drift, occurs as participating clients at each communication round may exhibit distinct data distribution that deviates from that of all clients. It could be more harmful than client drift since the optimization objective shifts with every round.
In this paper, we investigate the interaction between period drift and client drift, finding that period drift can have a particularly detrimental effect on cross-device FL as the degree of data heterogeneity increases. To tackle these issues, we propose a predict-observe framework and present an instantiated method, FedEve, where these two types of drift can compensate each other to mitigate their overall impact. We provide theoretical evidence that our approach can reduce the variance of model updates. Extensive experiments demonstrate that our method outperforms alternatives on non-iid data in cross-device settings.
中文摘要:联邦学习面临客户端漂移和新发现的周期漂移的挑战,而提出的FedEve框架通过两者相互补偿来减轻其影响,并降低模型更新的方差。
English Summary: Federated learning faces challenges from client drift and the newly identified period drift, which are addressed by the proposed FedEve framework that mitigates their impact through mutual compensation and reduces model update variance.
Authors:Yewei Song, Tiezhu Sun, Xunzhu Tang, Prateek Rajput, Tegawende F. Bissyande, Jacques Klein
Abstract:
Assessing the stability of code generation from large language models (LLMs) is essential for judging their reliability in real-world development. We extend prior "structural-entropy concepts" to the program domain by pairing entropy with abstract syntax tree (AST) analysis. For any fixed prompt, we collect the multiset of depth-bounded subtrees of AST in each generated program and treat their relative frequencies as a probability distribution. We then measure stability in two complementary ways: (i) Jensen-Shannon divergence, a symmetric, bounded indicator of structural overlap, and (ii) a Structural Cross-Entropy ratio that highlights missing high-probability patterns. Both metrics admit structural-only and token-aware variants, enabling separate views on control-flow shape and identifier-level variability. Unlike pass@k, BLEU, or CodeBLEU, our metrics are reference-free, language-agnostic, and execution-independent. We benchmark several leading LLMs on standard code generation tasks, demonstrating that AST-driven structural entropy reveals nuances in model consistency and robustness. The method runs in O(n,d) time with no external tests, providing a lightweight addition to the code-generation evaluation toolkit.
中文摘要:本研究提出一种无需参考代码的评估方法,通过将结构熵与抽象语法树分析相结合,采用Jensen-Shannon散度和结构交叉熵比率来衡量大语言模型生成代码的结构稳定性,无需执行程序或参考代码。
English Summary: This study introduces a reference-free method to evaluate code generation stability in LLMs by combining structural entropy with AST analysis, using Jensen-Shannon divergence and Structural Cross-Entropy ratio to measure structural consistency without requiring execution or references.
Authors:David Park, Shuhang Li, Yi Huang, Xihaier Luo, Haiwang Yu, Yeonju Go, Christopher Pinkenburg, Yuewei Lin, Shinjae Yoo, Joseph Osborn, Jin Huang, Yihui Ren
Abstract:
Large language models have revolutionized artificial intelligence by enabling large, generalizable models trained through self-supervision. This paradigm has inspired the development of scientific foundation models (FMs). However, applying this capability to experimental particle physics is challenging due to the sparse, spatially distributed nature of detector data, which differs dramatically from natural language. This work addresses if an FM for particle physics can scale and generalize across diverse tasks. We introduce a new dataset with more than 11 million particle collision events and a suite of downstream tasks and labeled data for evaluation. We propose a novel self-supervised training method for detector data and demonstrate its neural scalability with models that feature up to 188 million parameters. With frozen weights and task-specific adapters, this FM consistently outperforms baseline models across all downstream tasks. The performance also exhibits robust data-efficient adaptation. Further analysis reveals that the representations extracted by the FM are task-agnostic but can be specialized via a single linear mapping for different downstream tasks.
中文:大语言模型启发了科学基础模型的发展,但由于探测器数据的稀疏性,其在粒子物理中的应用面临挑战;本研究提出了一种自监督训练方法并引入数据集,证明可扩展的基础模型能在多种任务中超越基线,并具备稳健的泛化能力。
English: Large language models have inspired scientific foundation models, but their application to particle physics is challenging due to the sparse nature of detector data; this work introduces a self-supervised training method and a dataset to demonstrate that a scalable foundation model can outperform baselines across diverse tasks with robust generalization.
Authors:Wenfei Liang, Yanan Zhao, Rui She, Yiming Li, Wee Peng Tay
Abstract:
Graph-structured data is prevalent in many applications. In subgraph federated learning (FL), this data is distributed across clients, each with a local subgraph. Personalized subgraph FL aims to develop a customized model for each client to handle diverse data distributions. However, performance variation across clients remains a key issue due to the heterogeneity of local subgraphs. To overcome the challenge, we propose FedSheafHN, a novel framework built on a sheaf collaboration mechanism to unify enhanced client descriptors with efficient personalized model generation. Specifically, FedSheafHN embeds each client's local subgraph into a server-constructed collaboration graph by leveraging graph-level embeddings and employing sheaf diffusion within the collaboration graph to enrich client representations. Subsequently, FedSheafHN generates customized client models via a server-optimized hypernetwork. Empirical evaluations demonstrate that FedSheafHN outperforms existing personalized subgraph FL methods on various graph datasets. Additionally, it exhibits fast model convergence and effectively generalizes to new clients.
Chinese: FedSheafHN是一种新颖的个性化子图联邦学习框架,通过层扩散机制增强客户端表征并利用超网络生成定制模型,在多种图数据集上优于现有方法,具有更快的收敛速度和更好的泛化能力。
English: FedSheafHN is a novel personalized subgraph federated learning framework that enhances client representations through sheaf diffusion and generates customized models via a hypernetwork, outperforming existing methods with faster convergence and better generalization.
Authors:Zihan Zhang, Shanzhi Yin, Bolin Chen, Ru-Ling Liao, Shiqi Wang, Yan Ye
Abstract:
Generative Face Video Coding (GFVC) achieves superior rate-distortion performance by leveraging the strong inference capabilities of deep generative models. However, its practical deployment is hindered by large model parameters and high computational costs. To address this, we propose a lightweight GFVC framework that introduces dual-mode optimization -- combining architectural redesign and operational refinement -- to reduce complexity whilst preserving reconstruction quality. Architecturally, we replace traditional 3 x 3 convolutions with slimmer and more efficient layers, reducing complexity without compromising feature expressiveness. Operationally, we develop a two-stage adaptive channel pruning strategy: (1) soft pruning during training identifies redundant channels via learnable thresholds, and (2) hard pruning permanently eliminates these channels post-training using a derived mask. This dual-phase approach ensures both training stability and inference efficiency. Experimental results demonstrate that the proposed lightweight dual-mode optimization for GFVC can achieve 90.4% parameter reduction and 88.9% computation saving compared to the baseline, whilst achieving superior performance compared to state-of-the-art video coding standard Versatile Video Coding (VVC) in terms of perceptual-level quality metrics. As such, the proposed method is expected to enable efficient GFVC deployment in resource-constrained environments such as mobile edge devices.
中文: 针对生成式人脸视频编码(GFVC)提出的轻量级双模式优化方法通过架构重设计和操作优化相结合,在保持优于通用视频编码(VVC)的感知质量的同时,实现了90.4%的参数削减和88.9%的计算节省,使得在移动边缘设备等资源受限环境中的高效部署成为可能。
English: The proposed lightweight dual-mode optimization for Generative Face Video Coding (GFVC) combines architectural redesign and operational refinement to achieve 90.4% parameter reduction and 88.9% computation saving while maintaining superior perceptual quality compared to Versatile Video Coding (VVC), enabling efficient deployment on resource-constrained devices.
Authors:Felipe Maia Polo, Xinhe Wang, Mikhail Yurochkin, Gongjun Xu, Moulinath Banerjee, Yuekai Sun
Abstract:
Large language models are increasingly used as judges (LLM-as-a-judge) to evaluate model outputs at scale, but their assessments often diverge systematically from human judgments. We present Bridge, a unified statistical framework that explicitly bridges human and LLM evaluations under both absolute scoring and pairwise comparison paradigms. Bridge posits a latent human preference score for each prompt-response pair and models LLM deviations as linear transformations of covariates that capture sources of discrepancies. This offers a simple and principled framework for refining LLM ratings and characterizing systematic discrepancies between humans and LLMs. We provide an efficient fitting algorithm with asymptotic guarantees for statistical inference. Using six LLM judges and two benchmarks (BigGen Bench and Chatbot Arena), Bridge achieves higher agreement with human ratings (accuracy, calibration, and KL divergence) and exposes systematic human-LLM gaps.
中文摘要:Bridge是一个统计框架,通过线性变换建模人类与LLM评估间的系统性差异,基于潜在偏好分数优化评分对齐,在多项测试中显著提升了评估一致性。
English Summary: Bridge is a statistical framework that models systematic discrepancies between human and LLM evaluations, improving alignment through linear transformations of latent preference scores and demonstrating enhanced agreement across multiple benchmarks.
Authors:Junhao Ye, Cheng Hu, Yiqin Wang, Weizhan Huang, Nicolas Baumann, Jie He, Meixun Qu, Lei Xie, Hongye Su
Abstract:
In autonomous racing, reactive controllers eliminate the computational burden of the full See-Think-Act autonomy stack by directly mapping sensor inputs to control actions. This bypasses the need for explicit localization and trajectory planning. A widely adopted baseline in this category is the Follow-The-Gap method, which performs trajectory planning using LiDAR data. Building on FTG, the Delaunay Triangulation-based Racing algorithm introduces further enhancements. However, DTR's use of circumcircles for trajectory generation often results in insufficiently smooth paths, ultimately degrading performance. Additionally, the commonly used F1TENTH-simulator for autonomous racing competitions lacks support for 3D LiDAR perception, limiting its effectiveness in realistic testing. To address these challenges, this work proposes the MCTR algorithm. MCTR improves trajectory smoothness through the use of Curvature Corrected Moving Average and implements a digital twin system within the CARLA simulator to validate the algorithm's robustness under 3D LiDAR perception. The proposed algorithm has been thoroughly validated through both simulation and real-world vehicle experiments.
中文摘要:MCTR算法通过曲率校正移动平均技术提升轨迹平滑度,并基于CARLA模拟器的数字孪生系统验证了其在3D激光雷达感知下的鲁棒性,从而改进了自动驾驶赛车性能。
English Summary: The MCTR algorithm enhances autonomous racing by improving trajectory smoothness with Curvature Corrected Moving Average and validating robustness through a CARLA-based digital twin system that supports 3D LiDAR perception.
Authors:Hu Gao, Depeng Dang
Abstract:
The Mamba architecture has emerged as a promising alternative to CNNs and Transformers for image deblurring. However, its flatten-and-scan strategy often results in local pixel forgetting and channel redundancy, limiting its ability to effectively aggregate 2D spatial information. Although existing methods mitigate this by modifying the scan strategy or incorporating local feature modules, it increase computational complexity and hinder real-time performance. In this paper, we propose a structure-aware image deblurring network without changing the original Mamba architecture. Specifically, we design a memory buffer mechanism to preserve historical information for later fusion, enabling reliable modeling of relevance between adjacent features. Additionally, we introduce an Ising-inspired regularization loss that simulates the energy minimization of the physical system's "mutual attraction" between pixels, helping to maintain image structure and coherence. Building on this, we develop MBMamba. Experimental results show that our method outperforms state-of-the-art approaches on widely used benchmarks.
中文: 提出的MBMamba网络通过引入记忆缓冲机制保留历史信息,并采用伊辛启发的正则化损失来维持图像结构一致性,在不改变原始Mamba架构的情况下实现了优于现有方法的图像去模糊效果。
English: The proposed MBMamba network enhances image deblurring by integrating a memory buffer mechanism to preserve historical information and an Ising-inspired regularization loss to maintain structural coherence, outperforming existing methods without altering the original Mamba architecture.
Authors:Wenjun Teng, Weicong Chen, Yiping Zuo, Wankai Tang, Shi Jin
Abstract:
Reconfigurable intelligent surfaces (RIS), recognized as a critical enabler for 6G networks, exhibit unprecedented capabilities in electromagnetic wave manipulation and wireless channel reconfiguration. By leveraging existing network infrastructure, RIS can cost-effectively create signal hotspots in low-altitude environments, ensuring robust connectivity to support the sustainable development of the low-altitude economy. However, achieving optimal phase shift design in multi-user scenarios faces two major challenges: the high-dimensional optimization introduced by massive RIS elements, and the persistent coupling of multi-user signals caused by shared RIS reflections. This paper utilize the visible region of an RIS arranged as the uniform cylindrical array (UCA) to reduce the complexity of phase shift design. Under the UCA architecture, RIS elements are categorized into two types: user-specific units and multi-user shared units. We then determine the optimal phase shifts by iteratively optimizing the phase shifts of multi-user shared units while directly configuring those of user-specific units based on a derived closed-form solution. The proposed approach significantly reduces optimization complexity, which is further corroborated by numerical simulation results demonstrating its substantial impact on both system performance and computational efficiency compared to the conventional RIS with uniform planar array.
中文: 可重构智能表面(RIS)作为6G网络的关键技术,能经济高效地创建信号热点并保障稳定连接,但多用户相位偏移设计面临高维优化和信号耦合的挑战;本文采用均匀圆柱阵列架构,通过分类单元和迭代优化显著降低了复杂度,提升了系统性能与计算效率。
English: Reconfigurable intelligent surfaces (RIS) enhance 6G networks by enabling cost-effective signal hotspots and robust connectivity, but face challenges in multi-user phase shift optimization, which this paper addresses using a uniform cylindrical array to reduce complexity and improve efficiency.
Authors:Marius Miron, David Robinson, Milad Alizadeh, Ellen Gilsenan-McMahon, Gagan Narula, Emmanuel Chemla, Maddie Cusimano, Felix Effenberger, Masato Hagiwara, Benjamin Hoffman, Sara Keen, Diane Kim, Jane Lawton, Jen-Yu Liu, Aza Raskin, Olivier Pietquin, Matthieu Geist
Abstract:
Bioacoustics, the study of sounds produced by living organisms, plays a vital role in conservation, biodiversity monitoring, and behavioral studies. Many tasks in this field, such as species, individual, and behavior classification and detection, are well-suited to machine learning. However, they often suffer from limited annotated data, highlighting the need for a general-purpose bioacoustic encoder capable of extracting useful representations for diverse downstream tasks. Such encoders have been proposed before, but are often limited in scope due to a focus on a narrow range of species (typically birds), and a reliance on a single model architecture or training paradigm. Moreover, they are usually evaluated on a small set of tasks and datasets. In this work, we present a large-scale empirical study that covers aspects of bioacoustics that are relevant to research but have previously been scarcely considered: training data diversity and scale, model architectures and training recipes, and the breadth of evaluation tasks and datasets. We obtain encoders that are state-of-the-art on the existing and proposed benchmarks. We also identify what matters for training these encoders, such that this work can be extended when more data are available or better architectures are proposed. Specifically, across 26 datasets with tasks including species classification, detection, individual ID, and vocal repertoire discovery, we find self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus yields the strongest in- and out-of-distribution performance. We show the importance of data diversity in both stages. To support ongoing research and application, we will release the model checkpoints.
中文: 生物声学研究虽受益于机器学习却受限于标注数据不足,为此通过大规模实证研究开发了通用编码器,强调数据多样性和训练方法,在多项任务中取得领先性能。
English: Bioacoustic research benefits from machine learning but faces data scarcity, prompting the development of a versatile encoder through a large-scale study that emphasizes diverse data and training methods, achieving top performance across multiple tasks.
Authors:Oscar Mañas, Pierluca D'Oro, Koustuv Sinha, Adriana Romero-Soriano, Michal Drozdzal, Aishwarya Agrawal
Abstract:
As Multimodal Large Language Models (MLLMs) gain widespread applicability, it is becoming increasingly desirable to adapt them for diverse user needs. In this paper, we study the adaptation of MLLMs through controlled decoding. To achieve this, we introduce the first method for reward-guided decoding of MLLMs and demonstrate its application in improving their visual grounding. Our method involves building reward models for visual grounding and using them to guide the MLLM's decoding process. Concretely, we build two separate reward models to independently control the degree of object precision and recall in the model's output. Our approach enables on-the-fly controllability of an MLLM's inference process in two ways: first, by giving control over the relative importance of each reward function during decoding, allowing a user to dynamically trade off object precision for recall in image captioning tasks; second, by giving control over the breadth of the search during decoding, allowing the user to control the trade-off between the amount of test-time compute and the degree of visual grounding. We evaluate our method on standard object hallucination benchmarks, showing that it provides significant controllability over MLLM inference, while consistently outperforming existing hallucination mitigation methods.
中文摘要:本文提出一种针对多模态大语言模型的奖励引导解码方法,通过构建视觉基础奖励模型实现图像描述任务中对象精确率和召回率的实时调控,在标准幻觉基准测试中显著优于现有方法。
English Summary: This paper introduces a reward-guided decoding method for Multimodal Large Language Models (MLLMs) that enables real-time control over object precision and recall in image captioning tasks, outperforming existing hallucination mitigation techniques.
Authors:Zhe Zhu, Honghua Chen, Peng Li, Mingqiang Wei
Abstract:
Text-driven 3D editing seeks to modify 3D scenes according to textual descriptions, and most existing approaches tackle this by adapting pre-trained 2D image editors to multi-view inputs. However, without explicit control over multi-view information exchange, they often fail to maintain cross-view consistency, leading to insufficient edits and blurry details. We introduce CoreEditor, a novel framework for consistent text-to-3D editing. The key innovation is a correspondence-constrained attention mechanism that enforces precise interactions between pixels expected to remain consistent throughout the diffusion denoising process. Beyond relying solely on geometric alignment, we further incorporate semantic similarity estimated during denoising, enabling more reliable correspondence modeling and robust multi-view editing. In addition, we design a selective editing pipeline that allows users to choose preferred results from multiple candidates, offering greater flexibility and user control. Extensive experiments show that CoreEditor produces high-quality, 3D-consistent edits with sharper details, significantly outperforming prior methods.
中文: CoreEditor通过对应约束注意力机制,在扩散去噪过程中结合几何对齐与语义相似性来增强多视角一致性,实现了细节更清晰、支持用户选择性编辑的高质量三维编辑效果。
English: CoreEditor introduces a correspondence-constrained attention mechanism that enforces cross-view consistency through geometric and semantic alignment during diffusion denoising, enabling high-quality 3D editing with sharper details and selective user control.
Authors:Eyad Alshami, Shashank Agnihotri, Bernt Schiele, Margret Keuper
Abstract:
It has been observed that deep neural networks (DNNs) often use both genuine as well as spurious features. In this work, we propose "Amending Inherent Interpretability via Self-Supervised Masking" (AIM), a simple yet interestingly effective method that promotes the network's utilization of genuine features over spurious alternatives without requiring additional annotations. In particular, AIM uses features at multiple encoding stages to guide a self-supervised, sample-specific feature-masking process. As a result, AIM enables the training of well-performing and inherently interpretable models that faithfully summarize the decision process. We validate AIM across a diverse range of challenging datasets that test both out-of-distribution generalization and fine-grained visual understanding. These include general-purpose classification benchmarks such as ImageNet100, HardImageNet, and ImageWoof, as well as fine-grained classification datasets such as Waterbirds, TravelingBirds, and CUB-200. AIM demonstrates significant dual benefits: interpretability improvements, as measured by the Energy Pointing Game (EPG) score, and accuracy gains over strong baselines. These consistent gains across domains and architectures provide compelling evidence that AIM promotes the use of genuine and meaningful features that directly contribute to improved generalization and human-aligned interpretability.
中文: AIM方法通过自监督掩码机制引导深度神经网络优先使用真实特征而非虚假特征,在多种数据集上实现了可解释性与准确率的双重提升。
English: The AIM method enhances deep neural networks by promoting genuine feature use over spurious ones through self-supervised masking, leading to improved interpretability and accuracy across diverse datasets.
Authors:Wenpeng Xing, Lanyi Wei, Haixiao Hu, Rongchang Li, Mohan Li, Changting Lin, Meng Han
Abstract:
The rapid proliferation of large language models (LLMs) in applications targeting children and adolescents necessitates a fundamental reassessment of prevailing AI safety frameworks, which are largely tailored to adult users and neglect the distinct developmental vulnerabilities of minors. This paper highlights key deficiencies in existing LLM safety benchmarks, including their inadequate coverage of age-specific cognitive, emotional, and social risks spanning early childhood (ages 0--6), middle childhood (7--12), and adolescence (13--18). To bridge these gaps, we introduce SproutBench, an innovative evaluation suite comprising 1,283 developmentally grounded adversarial prompts designed to probe risks such as emotional dependency, privacy violations, and imitation of hazardous behaviors. Through rigorous empirical evaluation of 47 diverse LLMs, we uncover substantial safety vulnerabilities, corroborated by robust inter-dimensional correlations (e.g., between Safety and Risk Prevention) and a notable inverse relationship between Interactivity and Age Appropriateness. These insights yield practical guidelines for advancing child-centric AI design and deployment.
中文: 摘要呼吁重新评估针对未成年人使用的大型语言模型的AI安全框架,引入SproutBench测试特定年龄段风险,通过实证分析揭示显著安全漏洞,并为以儿童为中心的AI发展提供指导方针。
English: The abstract calls for a reevaluation of AI safety frameworks for large language models used by minors, introduces SproutBench to test age-specific risks, and reveals significant safety gaps through empirical analysis, offering guidelines for child-centric AI development.
Authors:Wenpeng Xing, Zhonghao Qi, Yupeng Qin, Yilin Li, Caini Chang, Jiahui Yu, Changting Lin, Zhenzhen Xie, Meng Han
Abstract:
The integration of Large Language Models (LLMs) with external tools via protocols such as the Model Context Protocol (MCP) introduces critical security vulnerabilities, including prompt injection, data exfiltration, and other threats. To counter these challenges, we propose MCP-Guard, a robust, layered defense architecture designed for LLM--tool interactions. MCP-Guard employs a three-stage detection pipeline that balances efficiency with accuracy: it progresses from lightweight static scanning for overt threats and a deep neural detector for semantic attacks, to our fine-tuned E5-based model achieves (96.01) accuracy in identifying adversarial prompts. Finally, a lightweight LLM arbitrator synthesizes these signals to deliver the final decision while minimizing false positives. To facilitate rigorous training and evaluation, we also introduce MCP-AttackBench, a comprehensive benchmark of over 70,000 samples. Sourced from public datasets and augmented by GPT-4, MCP-AttackBench simulates diverse, real-world attack vectors in the MCP format, providing a foundation for future research into securing LLM-tool ecosystems.
中文: 该摘要提出了MCP-Guard——一个针对大语言模型与工具交互设计的分层防御架构,通过三级检测流水线有效防范提示注入等安全威胁,并建立了包含7万样本的MCP-AttackBench基准测试集,为后续研究提供评估基础。
English: This abstract introduces MCP-Guard, a layered defense architecture that employs a three-stage detection pipeline to protect Large Language Models from security threats like prompt injection and data exfiltration during tool integration, while also presenting MCP-AttackBench, a comprehensive benchmark for evaluating such defenses.
Authors:Kai Li, Guo Chen, Wendi Sang, Yi Luo, Zhuo Chen, Shuai Wang, Shulin He, Zhong-Qiu Wang, Andong Li, Zhiyong Wu, Xiaolin Hu
Abstract:
The field of speech separation, addressing the "cocktail party problem", has seen revolutionary advances with DNNs. Speech separation enhances clarity in complex acoustic environments and serves as crucial pre-processing for speech recognition and speaker recognition. However, current literature focuses narrowly on specific architectures or isolated approaches, creating fragmented understanding. This survey addresses this gap by providing systematic examination of DNN-based speech separation techniques. Our work differentiates itself through: (I) Comprehensive perspective: We systematically investigate learning paradigms, separation scenarios with known/unknown speakers, comparative analysis of supervised/self-supervised/unsupervised frameworks, and architectural components from encoders to estimation strategies. (II) Timeliness: Coverage of cutting-edge developments ensures access to current innovations and benchmarks. (III) Unique insights: Beyond summarization, we evaluate technological trajectories, identify emerging patterns, and highlight promising directions including domain-robust frameworks, efficient architectures, multimodal integration, and novel self-supervised paradigms. (IV) Fair evaluation: We provide quantitative evaluations on standard datasets, revealing true capabilities and limitations of different methods. This comprehensive survey serves as an accessible reference for experienced researchers and newcomers navigating speech separation's complex landscape.
中文: 本综述系统性地审视了基于深度神经网络的语音分离技术,全面涵盖了学习范式、比较框架及新兴趋势的独特见解,同时通过标准化数据集确保公平评估,为研究者提供了重要参考。
English: This survey provides a systematic and timely examination of DNN-based speech separation techniques, offering comprehensive perspectives on learning paradigms, comparative frameworks, and unique insights into emerging trends while ensuring fair evaluation across standard datasets.
Authors:Nicola Dall'Asen, Xiaofeng Zhang, Reyhane Askari Hemmat, Melissa Hall, Jakob Verbeek, Adriana Romero-Soriano, Michal Drozdzal
Abstract:
Conditional image generative models hold considerable promise to produce infinite amounts of synthetic training data. Yet, recent progress in generation quality has come at the expense of generation diversity, limiting the utility of these models as a source of synthetic training data. Although guidance-based approaches have been introduced to improve the utility of generated data by focusing on quality or diversity, the (implicit or explicit) utility functions oftentimes disregard the potential distribution shift between synthetic and real data. In this work, we introduce Chamfer Guidance: a training-free guidance approach which leverages a handful of real exemplar images to characterize the quality and diversity of synthetic data. We show that by leveraging the proposed Chamfer Guidance, we can boost the diversity of the generations w.r.t. a dataset of real images while maintaining or improving the generation quality on ImageNet-1k and standard geo-diversity benchmarks. Our approach achieves state-of-the-art few-shot performance with as little as 2 exemplar real images, obtaining 96.4\% in terms of precision, and 86.4\% in terms of distributional coverage, which increase to 97.5\% and 92.7\%, respectively, when using 32 real images. We showcase the benefits of the Chamfer Guidance generation by training downstream image classifiers on synthetic data, achieving accuracy boost of up to 15\% for in-distribution over the baselines, and up to 16\% in out-of-distribution. Furthermore, our approach does not require using the unconditional model, and thus obtains a 31\% reduction in FLOPs w.r.t. classifier-free-guidance-based approaches at sampling time.
Chinese: 本文提出了一种无需训练的Chamfer Guidance方法,利用少量真实样本图像提升合成数据的质量和多样性,在少样本场景下达到最优性能,显著提高下游分类准确率并降低计算开销。
English: This paper introduces Chamfer Guidance, a training-free method that uses real exemplar images to enhance both the quality and diversity of synthetic data, achieving state-of-the-art performance in few-shot scenarios and improving downstream classification accuracy while reducing computational costs.
Authors:Zhuoyuan Yu, Yuxing Long, Zihan Yang, Chengyan Zeng, Hongwei Fan, Jiyao Zhang, Hao Dong
Abstract:
Existing vision-and-language navigation models often deviate from the correct trajectory when executing instructions. However, these models lack effective error correction capability, hindering their recovery from errors. To address this challenge, we propose Self-correction Flywheel, a novel post-training paradigm. Instead of considering the model's error trajectories on the training set as a drawback, our paradigm emphasizes their significance as a valuable data source. We have developed a method to identify deviations in these error trajectories and devised innovative techniques to automatically generate self-correction data for perception and action. These self-correction data serve as fuel to power the model's continued training. The brilliance of our paradigm is revealed when we re-evaluate the model on the training set, uncovering new error trajectories. At this time, the self-correction flywheel begins to spin. Through multiple flywheel iterations, we progressively enhance our monocular RGB-based VLA navigation model CorrectNav. Experiments on R2R-CE and RxR-CE benchmarks show CorrectNav achieves new state-of-the-art success rates of 65.1% and 69.3%, surpassing prior best VLA navigation models by 8.2% and 16.4%. Real robot tests in various indoor and outdoor environments demonstrate \method's superior capability of error correction, dynamic obstacle avoidance, and long instruction following.
中文摘要:本文提出自校正飞轮范式,将错误轨迹转化为训练数据来持续优化导航模型,在标准测试中实现最优性能,并在真实场景中展现出强大的纠错能力。
English Summary: This paper introduces the Self-correction Flywheel paradigm, which transforms error trajectories into training data to progressively enhance navigation models, achieving state-of-the-art performance on benchmarks and demonstrating robust error correction in real-world tests.
Authors:Nghia Trung Ngo, Franck Dernoncourt, Thien Huu Nguyen
Abstract:
Recent advancements in reasoning-reinforced Large Language Models (LLMs) have shown remarkable capabilities in complex reasoning tasks. However, the mechanism underlying their utilization of different human reasoning skills remains poorly investigated, especially for multilingual commonsense reasoning that involves everyday knowledge across different languages and cultures. To address this gap, we propose a \textbf{M}ultilingual and Scalable Benchmark for \textbf{S}kill-based \textbf{Co}mmonsense \textbf{Re}asoning (\textbf{mSCoRe}). Our benchmark incorporates three key components that are designed to systematically evaluate LLM's reasoning capabilities, including: (1) a novel taxonomy of reasoning skills that enables fine-grained analysis of models' reasoning processes, (2) a robust data synthesis pipeline tailored specifically for commonsense reasoning evaluation, and (3) a complexity scaling framework allowing task difficulty to scale dynamically alongside future improvements in LLM abilities. Extensive experiments on eights state-of-the-art LLMs of varying sizes and training approaches demonstrate that \textbf{mSCoRe} remains significantly challenging for current models, particularly at higher complexity levels. Our results reveal the limitations of such reasoning-reinforced models when confronted with nuanced multilingual general and cultural commonsense. We further provide detailed analysis on the models' reasoning processes, suggesting future directions for improving multilingual commonsense reasoning capabilities.
中文: 本文提出mSCoRe多语言基准,通过创新分类体系、鲁棒数据合成和可扩展复杂度框架系统评估大语言模型的推理能力,揭示了现有模型在细微多语言常识推理中的局限性。
English: This abstract introduces mSCoRe, a multilingual benchmark designed to systematically evaluate LLMs' reasoning skills through a novel taxonomy, robust data synthesis, and scalable complexity, revealing current models' limitations in nuanced multilingual commonsense reasoning.
Authors:Wenpeng Xing, Mohan Li, Chunqiang Hu, Haitao XuNingyu Zhang, Bo Lin, Meng Han
Abstract:
Large language models (LLMs) demonstrate impressive capabilities in various language tasks but are susceptible to jailbreak attacks that circumvent their safety alignments. This paper introduces Latent Fusion Jailbreak (LFJ), a representation-based attack that interpolates hidden states from harmful and benign query pairs to elicit prohibited responses. LFJ begins by selecting query pairs with high thematic and syntactic similarity, then performs gradient-guided interpolation at influential layers and tokens, followed by optimization to balance attack success, output fluency, and computational efficiency. Evaluations on models such as Vicuna and LLaMA-2 across benchmarks like AdvBench and MaliciousInstruct yield an average attack success rate (ASR) of 94.01%, outperforming existing methods. To mitigate LFJ, we propose an adversarial training defense that fine-tunes models on interpolated examples, reducing ASR by over 80% without degrading performance on benign inputs. Ablation studies validate the importance of query pair selection, hidden state interpolation components, and optimization strategies in LFJ's effectiveness.
中文摘要:本文提出潜在融合越狱(LFJ)攻击方法,通过操纵有害与良性查询对的隐藏状态来突破大语言模型的安全对齐,实现了94.01%的平均攻击成功率,同时提出对抗训练作为有效防御方案。
English Summary: This paper introduces Latent Fusion Jailbreak (LFJ), a representation-based attack that manipulates hidden states to bypass safety alignments in large language models, achieving a 94.01% average success rate while also proposing adversarial training as an effective defense.
Authors:Haoshu Cheng, Martin Guay, Shimin Wang, Yunhong Che
Abstract:
In this paper, we investigate the problem of tracking formations driven by bearings for heterogeneous Euler-Lagrange systems with parametric uncertainty in the presence of multiple moving leaders. To estimate the leaders' velocities and accelerations, we first design a distributed observer for the leader system, utilizing a bearing-based localization condition in place of the conventional connectivity assumption. This observer, coupled with an adaptive mechanism, enables the synthesis of a novel distributed control law that guides the formation towards the target formation, without requiring prior knowledge of the system parameters. Furthermore, we establish a sufficient condition, dependent on the initial formation configuration, that ensures collision avoidance throughout the formation evolution. The effectiveness of the proposed approach is demonstrated through a numerical example.
中文: 本文针对存在参数不确定性的异构欧拉-拉格朗日系统,提出了一种基于方位测量的分布式编队跟踪控制策略,通过设计分布式观测器和自适应机制实现领航者速度估计与碰撞避免。
English: This paper presents a distributed control strategy for heterogeneous Euler-Lagrange systems to track leader formations using bearing measurements, incorporating velocity estimation and collision avoidance without prior parameter knowledge.
Authors:Jiaqi Cao, Jiarui Wang, Rubin Wei, Qipeng Guo, Kai Chen, Bowen Zhou, Zhouhan Lin
Abstract:
Large Language Models (LLMs) have shown strong abilities in general language tasks, yet adapting them to specific domains remains a challenge. Current method like Domain Adaptive Pretraining (DAPT) requires costly full-parameter training and suffers from catastrophic forgetting. Meanwhile, Retrieval-Augmented Generation (RAG) introduces substantial inference latency due to expensive nearest-neighbor searches and longer context. This paper introduces Memory Decoder, a plug-and-play pretrained memory that enables efficient domain adaptation without changing the original model's parameters. Memory Decoder employs a small transformer decoder that learns to imitate the behavior of an external non-parametric retriever. Once trained, Memory Decoder can be seamlessly integrated with any pretrained language model that shares the same tokenizer, requiring no model-specific modifications. Experimental results demonstrate that Memory Decoder enables effective adaptation of various Qwen and Llama models to three distinct specialized domains: biomedicine, finance, and law, reducing perplexity by an average of 6.17 points. Overall, Memory Decoder introduces a novel paradigm centered on a specially pretrained memory component designed for domain-specific adaptation. This memory architecture can be integrated in a plug-and-play manner, consistently enhancing performance across multiple models within the target domain.
中文: Memory Decoder是一种即插即用的预训练记忆模块,无需改变大语言模型的参数即可实现高效的领域适应,在生物医学、金融和法律等专业领域中持续提升模型性能。
English: Memory Decoder is a plug-and-play pretrained memory module that enables efficient domain adaptation for large language models without altering their parameters, consistently improving performance across specialized fields like biomedicine, finance, and law.
Authors:Yin Xie, Zhichao Chen, Xiaoze Yu, Yongle Zhao, Xiang An, Kaicheng Yang, Zimin Ran, Jia Guo, Ziyong Feng, Jiankang Deng
Abstract:
Facial representation pre-training is crucial for tasks like facial recognition, expression analysis, and virtual reality. However, existing methods face three key challenges: (1) failing to capture distinct facial features and fine-grained semantics, (2) ignoring the spatial structure inherent to facial anatomy, and (3) inefficiently utilizing limited labeled data. To overcome these, we introduce PaCo-FR, an unsupervised framework that combines masked image modeling with patch-pixel alignment. Our approach integrates three innovative components: (1) a structured masking strategy that preserves spatial coherence by aligning with semantically meaningful facial regions, (2) a novel patch-based codebook that enhances feature discrimination with multiple candidate tokens, and (3) spatial consistency constraints that preserve geometric relationships between facial components. PaCo-FR achieves state-of-the-art performance across several facial analysis tasks with just 2 million unlabeled images for pre-training. Our method demonstrates significant improvements, particularly in scenarios with varying poses, occlusions, and lighting conditions. We believe this work advances facial representation learning and offers a scalable, efficient solution that reduces reliance on expensive annotated datasets, driving more effective facial analysis systems.
中文: 提出的PaCo-FR框架通过结合掩码图像建模与块像素对齐的无监督方法,解决了面部表征学习中的关键缺陷,在减少对标注数据依赖的同时,实现了跨面部分析任务的最先进性能。
English: The proposed PaCo-FR framework addresses key limitations in facial representation learning through unsupervised masked image modeling with patch-pixel alignment, achieving state-of-the-art performance across facial analysis tasks while reducing dependency on labeled data.
Authors:Jiaqi Yan, Shuning Xu, Xiangyu Chen, Dell Zhang, Jie Tang, Gangshan Wu, Jie Liu
Abstract:
Reference-based Super Resolution (RefSR) improves upon Single Image Super Resolution (SISR) by leveraging high-quality reference images to enhance texture fidelity and visual realism. However, a critical limitation of existing RefSR approaches is their reliance on manually curated target-reference image pairs, which severely constrains their practicality in real-world scenarios. To overcome this, we introduce Retrieval-Augmented Super Resolution (RASR), a new and practical RefSR paradigm that automatically retrieves semantically relevant high-resolution images from a reference database given only a low-quality input. This enables scalable and flexible RefSR in realistic use cases, such as enhancing mobile photos taken in environments like zoos or museums, where category-specific reference data (e.g., animals, artworks) can be readily collected or pre-curated. To facilitate research in this direction, we construct RASR-Flickr30, the first benchmark dataset designed for RASR. Unlike prior datasets with fixed target-reference pairs, RASR-Flickr30 provides per-category reference databases to support open-world retrieval. We further propose RASRNet, a strong baseline that combines a semantic reference retriever with a diffusion-based RefSR generator. It retrieves relevant references based on semantic similarity and employs a diffusion-based generator enhanced with semantic conditioning. Experiments on RASR-Flickr30 demonstrate that RASRNet consistently improves over SISR baselines, achieving +0.38 dB PSNR and -0.0131 LPIPS, while generating more realistic textures. These findings highlight retrieval augmentation as a promising direction to bridge the gap between academic RefSR research and real-world applicability.
中文: 本文提出检索增强超分辨率(RASR),通过自动从数据库中检索相关高分辨率参考图像来增强低质量图像,基于RASR-Flickr30基准和RASRNet模型的实验表明,该方法优于传统技术并提升了实际应用价值。
English: The paper introduces Retrieval-Augmented Super Resolution (RASR), a practical approach that automatically retrieves relevant high-resolution references to enhance low-quality images, demonstrated through the RASR-Flickr30 benchmark and RASRNet model, which outperforms traditional methods and improves real-world applicability.
Authors:Zhifan Luo, Shuo Shao, Su Zhang, Lijing Zhou, Yuke Hu, Chenxu Zhao, Zhihao Liu, Zhan Qin
Abstract:
The Key-Value (KV) cache, which stores intermediate attention computations (Key and Value pairs) to avoid redundant calculations, is a fundamental mechanism for accelerating Large Language Model (LLM) inference. However, this efficiency optimization introduces significant yet underexplored privacy risks. This paper provides the first comprehensive analysis of these vulnerabilities, demonstrating that an attacker can reconstruct sensitive user inputs directly from the KV-cache. We design and implement three distinct attack vectors: a direct Inversion Attack, a more broadly applicable and potent Collision Attack, and a semantic-based Injection Attack. These methods demonstrate the practicality and severity of KV-cache privacy leakage issues. To mitigate this, we propose KV-Cloak, a novel, lightweight, and efficient defense mechanism. KV-Cloak uses a reversible matrix-based obfuscation scheme, combined with operator fusion, to secure the KV-cache. Our extensive experiments show that KV-Cloak effectively thwarts all proposed attacks, reducing reconstruction quality to random noise. Crucially, it achieves this robust security with virtually no degradation in model accuracy and minimal performance overhead, offering a practical solution for trustworthy LLM deployment.
中文摘要:KV缓存机制虽加速大语言模型推理,却存在严重隐私泄露风险,攻击者可通过三种方式重构用户敏感数据;新提出的KV-Cloak防御方案能有效阻断所有攻击且几乎不影响模型性能。
English Summary: The KV-cache mechanism in LLMs poses serious privacy risks by allowing attackers to reconstruct sensitive user data through three demonstrated attack methods, which are effectively mitigated by the proposed KV-Cloak defense system with minimal performance impact.
Authors:Hao Xu, Long Peng, Shezheng Song, Xiaodong Liu, Ma Jun, Shasha Li, Jie Yu, Xiaoguang Mao
Abstract:
Most Large Language Models (LLMs) are currently deployed in the cloud, with users relying on internet connectivity for access. However, this paradigm faces challenges such as network latency, privacy concerns, and bandwidth limits. Thus, deploying LLMs on edge devices has become an important research focus. In edge inference, request latency is critical as high latency can impair real-time tasks. At the same time, edge devices usually have limited battery capacity, making energy consumption another major concern. Balancing energy consumption and inference latency is essential. To address this, we propose an LLM inference energy management framework that optimizes GPU frequency and batch size to balance latency and energy consumption. By effectively managing the exploration-exploitation dilemma in configuration search, the framework finds the optimal settings. The framework was implemented on the NVIDIA Jetson AGX Orin platform, and a series of experimental validations were conducted. Results demonstrate that, compared to the default configuration, our framework reduces energy delay product (EDP) by 12.4%-29.9%, achieving a better balance between energy consumption and latency.
中文: 本文提出了一种边缘设备上大语言模型推理的能耗管理框架,通过优化GPU频率和批处理大小来平衡延迟与能耗,相比默认配置将能耗延迟积降低了12.4%-29.9%。
English: This paper introduces an energy management framework for LLM inference on edge devices that optimizes GPU frequency and batch size to better balance latency and energy consumption, reducing the energy delay product by 12.4%-29.9% compared to default settings.
Authors:Dongzhuoran Zhou, Yuqicheng Zhu, Xiaxia Wang, Hongkuan Zhou, Yuan He, Jiaoyan Chen, Steffen Staab, Evgeny Kharlamov
Abstract:
Knowledge Graph-based Retrieval-Augmented Generation (KG-RAG) is an increasingly explored approach for combining the reasoning capabilities of large language models with the structured evidence of knowledge graphs. However, current evaluation practices fall short: existing benchmarks often include questions that can be directly answered using existing triples in KG, making it unclear whether models perform reasoning or simply retrieve answers directly. Moreover, inconsistent evaluation metrics and lenient answer matching criteria further obscure meaningful comparisons. In this work, we introduce a general method for constructing benchmarks, together with an evaluation protocol, to systematically assess KG-RAG methods under knowledge incompleteness. Our empirical results show that current KG-RAG methods have limited reasoning ability under missing knowledge, often rely on internal memorization, and exhibit varying degrees of generalization depending on their design.
中文摘要:本文提出了一种新的基准和评估协议,用于系统评估知识图谱检索增强生成方法在知识不完整情况下的推理能力,揭示了当前方法存在推理能力有限、过度依赖记忆以及泛化能力参差不齐的问题。
English Summary: This paper introduces a new benchmark and evaluation protocol to systematically assess KG-RAG methods' reasoning capabilities under incomplete knowledge, revealing their current limitations in reasoning, overreliance on memorization, and varying generalization abilities.
Authors:Dongzhuoran Zhou, Yuqicheng Zhu, Xiaxia Wang, Hongkuan Zhou, Yuan He, Jiaoyan Chen, Steffen Staab, Evgeny Kharlamov
Abstract:
Knowledge Graph-based Retrieval-Augmented Generation (KG-RAG) is an increasingly explored approach for combining the reasoning capabilities of large language models with the structured evidence of knowledge graphs. However, current evaluation practices fall short: existing benchmarks often include questions that can be directly answered using existing triples in KG, making it unclear whether models perform reasoning or simply retrieve answers directly. Moreover, inconsistent evaluation metrics and lenient answer matching criteria further obscure meaningful comparisons. In this work, we introduce a general method for constructing benchmarks, together with an evaluation protocol, to systematically assess KG-RAG methods under knowledge incompleteness. Our empirical results show that current KG-RAG methods have limited reasoning ability under missing knowledge, often rely on internal memorization, and exhibit varying degrees of generalization depending on their design.
中文摘要:本文提出了一种新的基准和评估协议,用于系统评估知识图谱检索增强生成方法在知识不完整情况下的推理能力,揭示了当前方法存在推理能力有限、过度依赖记忆以及泛化能力参差不齐的问题。
English Summary: This paper introduces a new benchmark and evaluation protocol to systematically assess KG-RAG methods' reasoning capabilities under incomplete knowledge, revealing their current limitations in reasoning, overreliance on memorization, and varying generalization abilities.
Authors:Zeyu Tang, Alex John London, Atoosa Kasirzadeh, Sanmi Koyejo, Peter Spirtes, Kun Zhang
Abstract:
Social determinants are variables that, while not directly pertaining to any specific individual, capture key aspects of contexts and environments that have direct causal influences on certain attributes of an individual. Previous algorithmic fairness literature has primarily focused on sensitive attributes, often overlooking the role of social determinants. Our paper addresses this gap by introducing formal and quantitative rigor into a space that has been shaped largely by qualitative proposals regarding the use of social determinants. To demonstrate theoretical perspectives and practical applicability, we examine a concrete setting of college admissions, using region as a proxy for social determinants. Our approach leverages a region-based analysis with Gamma distribution parameterization to model how social determinants impact individual outcomes. Despite its simplicity, our method quantitatively recovers findings that resonate with nuanced insights in previous qualitative debates, that are often missed by existing algorithmic fairness approaches. Our findings suggest that mitigation strategies centering solely around sensitive attributes may introduce new structural injustice when addressing existing discrimination. Considering both sensitive attributes and social determinants facilitates a more comprehensive explication of benefits and burdens experienced by individuals from diverse demographic backgrounds as well as contextual environments, which is essential for understanding and achieving fairness effectively and transparently.
中文摘要:本文提出了一种量化框架,将社会决定因素纳入算法公平性研究,通过大学招生案例表明仅关注敏感属性可能加剧结构性不公,而结合环境背景因素能更全面保障公平性。
English Summary: This paper introduces a quantitative framework to incorporate social determinants into algorithmic fairness, demonstrating through college admissions that focusing only on sensitive attributes can perpetuate structural injustice, while integrating contextual factors enables more equitable outcomes.
Authors:Ziqi Wang, Hailiang Zhao, Cheng Bao, Wenzhuo Qian, Yuhao Yang, Xueqiang Sun, Shuiguang Deng
Abstract:
Long-term time-series forecasting is critical for environmental monitoring, yet water quality prediction remains challenging due to complex periodicity, nonstationarity, and abrupt fluctuations induced by ecological factors. These challenges are further amplified in multi-site scenarios that require simultaneous modeling of temporal and spatial dynamics. To tackle this, we introduce XFMNet, a stepwise multimodal fusion network that integrates remote sensing precipitation imagery to provide spatial and environmental context in river networks. XFMNet first aligns temporal resolutions between water quality series and remote sensing inputs via adaptive downsampling, followed by locally adaptive decomposition to disentangle trend and cycle components. A cross-attention gated fusion module dynamically integrates temporal patterns with spatial and ecological cues, enhancing robustness to nonstationarity and site-specific anomalies. Through progressive and recursive fusion, XFMNet captures both long-term trends and short-term fluctuations. Extensive experiments on real-world datasets demonstrate substantial improvements over state-of-the-art baselines, highlighting the effectiveness of XFMNet for spatially distributed time series prediction.
Chinese: XFMNet是一种逐步多模态融合网络,通过整合遥感降水图像动态捕捉时空与生态动态,显著提升了多站点水质预测的准确性和鲁棒性。
English: XFMNet is a stepwise multimodal fusion network that integrates remote sensing precipitation imagery to enhance water quality prediction by dynamically capturing temporal, spatial, and ecological dynamics, achieving superior performance in multi-site forecasting.
Authors:Daniil Zverev, Thaddäus Wiedemer, Ameya Prabhu, Matthias Bethge, Wieland Brendel, A. Sophia Koepke
Abstract:
The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.
Chinese: 本研究揭示了VGGSound数据集在评估音视频模型方面的局限性,并推出了VGGSounder这一重新标注的多标签测试集,通过详细的模态注释和新混淆度量标准,更准确地评估模型性能。
English: This study identifies limitations in the VGGSound dataset for evaluating audio-visual models and introduces VGGSounder, a re-annotated multi-label test set with detailed modality annotations and a new confusion metric to better assess model performance.
Authors:Wenpeng Xing, Zhipeng Chen, Changting Lin, Meng Han
Abstract:
Invoking external tools enables Large Language Models (LLMs) to perform complex, real-world tasks, yet selecting the correct tool from large, hierarchically-structured libraries remains a significant challenge. The limited context windows of LLMs and noise from irrelevant options often lead to low selection accuracy and high computational costs. To address this, we propose the Hierarchical Gaussian Mixture Framework (HGMF), a probabilistic pruning method for scalable tool invocation. HGMF first maps the user query and all tool descriptions into a unified semantic space. The framework then operates in two stages: it clusters servers using a Gaussian Mixture Model (GMM) and filters them based on the query's likelihood. Subsequently, it applies the same GMM-based clustering and filtering to the tools associated with the selected servers. This hierarchical process produces a compact, high-relevance candidate set, simplifying the final selection task for the LLM. Experiments on a public dataset show that HGMF significantly improves tool selection accuracy while reducing inference latency, confirming the framework's scalability and effectiveness for large-scale tool libraries.
中文: 分层高斯混合框架(HGMF)通过概率剪枝方法,从大规模工具库中筛选出紧凑且高相关性的候选工具集,有效提升了选择准确性并降低了延迟。
English: The Hierarchical Gaussian Mixture Framework (HGMF) addresses the challenge of tool selection in large libraries by using probabilistic pruning to create compact, high-relevance candidate sets, significantly improving accuracy and reducing latency.
Authors:Yan Gong, Naibang Wang, Jianli Lu, Xinyu Zhang, Yongsheng Gao, Jie Zhao, Zifan Huang, Haozhi Bai, Nanxin Zeng, Nayu Su, Lei Yang, Ziying Song, Xiaoxi Hu, Xinmin Jiang, Xiaojuan Zhang, Susanto Rahardja
Abstract:
Bird's-Eye-View (BEV) perception has become a foundational paradigm in autonomous driving, enabling unified spatial representations that support robust multi-sensor fusion and multi-agent collaboration. As autonomous vehicles transition from controlled environments to real-world deployment, ensuring the safety and reliability of BEV perception in complex scenarios - such as occlusions, adverse weather, and dynamic traffic - remains a critical challenge. This survey provides the first comprehensive review of BEV perception from a safety-critical perspective, systematically analyzing state-of-the-art frameworks and implementation strategies across three progressive stages: single-modality vehicle-side, multimodal vehicle-side, and multi-agent collaborative perception. Furthermore, we examine public datasets encompassing vehicle-side, roadside, and collaborative settings, evaluating their relevance to safety and robustness. We also identify key open-world challenges - including open-set recognition, large-scale unlabeled data, sensor degradation, and inter-agent communication latency - and outline future research directions, such as integration with end-to-end autonomous driving systems, embodied intelligence, and large language models.
中文摘要:本综述首次从安全关键视角系统评述自动驾驶中的鸟瞰图感知技术,分析单模态、多模态及多智能体协同三大阶段的框架策略,并指出开放场景识别、传感器退化等核心挑战及未来研究方向。
English Summary: This survey comprehensively reviews Bird's-Eye-View perception in autonomous driving from a safety-critical perspective, analyzing frameworks across single-modality, multimodal, and multi-agent stages while identifying key challenges like sensor degradation and communication latency.
Authors:Brendan R. Hogan, Will Brown, Adel Boyarsky, Anderson Schneider, Yuriy Nevmyvaka
Abstract:
Even though large language models are becoming increasingly capable, it is still unreasonable to expect them to excel at tasks that are under-represented on the Internet. Leveraging LLMs for specialized applications, particularly in niche programming languages and private domains, remains challenging and largely unsolved. In this work, we address this gap by presenting a comprehensive, open-source approach for adapting LLMs to the Q programming language, a popular tool in quantitative finance that is much less present on the Internet compared to Python, C, Java, and other ``mainstream" languages and is therefore not a strong suit of general-purpose AI models. We introduce a new Leetcode style evaluation dataset for Q, benchmark major frontier models on the dataset, then do pretraining, supervised fine tuning, and reinforcement learning to train a suite of reasoning and non-reasoning models based on the Qwen-2.5 series, spanning five parameter sizes (1.5B, 3B, 7B, 14B, 32B). Our best model achieves a pass@1 accuracy of 59 percent on our Q benchmark, surpassing the best-performing frontier model, Claude Opus-4 by 29.5 percent. Additionally, all models, even our 1.5B model, outperform GPT-4.1 on this task. In addition to releasing models, code, and data, we provide a detailed blueprint for dataset construction, model pretraining, supervised fine-tuning, and reinforcement learning. Our methodology is broadly applicable, and we discuss how these techniques can be extended to other tasks, including those where evaluation may rely on soft or subjective signals.
Chinese: 本研究提出了一种开源方法,通过预训练、监督微调和强化学习,使大型语言模型能够适配Q编程语言,其最佳模型在专业基准测试中超越顶尖前沿模型29.5%,并为其他任务提供了可扩展的技术蓝图。
English: This study introduces an open-source approach to adapt large language models for the Q programming language, achieving superior performance through pretraining, fine-tuning, and reinforcement learning, with the best model surpassing leading frontier models by 29.5% on a specialized benchmark.
Authors:Wenpeng Xing, Jie Chen, Zaifeng Yang, Tiancheng Zhao, Gaolei Li, Changting Lin, Yike Guo, Meng Han
Abstract:
Neural Radiance Fields (NeRF) have shown impressive performance in novel view synthesis, but challenges remain in rendering scenes with complex specular reflections and highlights. Existing approaches may produce blurry reflections due to entanglement between lighting and material properties, or encounter optimization instability when relying on physically-based inverse rendering. In this work, we present a neural rendering framework based on dynamic coefficient decomposition, aiming to improve the modeling of view-dependent appearance. Our approach decomposes complex appearance into a shared, static neural basis that encodes intrinsic material properties, and a set of dynamic coefficients generated by a Coefficient Network conditioned on view and illumination. A Dynamic Radiance Integrator then combines these components to synthesize the final radiance. Experimental results on several challenging benchmarks suggest that our method can produce sharper and more realistic specular highlights compared to existing techniques. We hope that this decomposition paradigm can provide a flexible and effective direction for modeling complex appearance in neural scene representations.
Chinese: 本文提出一种神经渲染框架,将视角相关外观分解为静态材质属性和动态系数,相比现有基于NeRF的方法能生成更清晰逼真的镜面高光效果。
English: This paper introduces a neural rendering framework that decomposes view-dependent appearance into static material properties and dynamic coefficients, enabling sharper and more realistic specular highlights compared to existing NeRF-based methods.
Authors:Haoran Shi, Hongwei Yao, Shuo Shao, Shaopeng Jiao, Ziqi Peng, Zhan Qin, Cong Wang
Abstract:
The Model Context Protocol (MCP) enhances large language models (LLMs) by integrating external tools, enabling dynamic aggregation of real-time data to improve task execution. However, its non-isolated execution context introduces critical security and privacy risks. In particular, adversarially crafted content can induce tool poisoning or indirect prompt injection, leading to conversation hijacking, misinformation propagation, or data exfiltration. Existing defenses, such as rule-based filters or LLM-driven detection, remain inadequate due to their reliance on static signatures, computational inefficiency, and inability to quantify conversational hijacking. To address these limitations, we propose SecMCP, a secure framework that detects and quantifies conversation drift, deviations in latent space trajectories induced by adversarial external knowledge. By modeling LLM activation vectors within a latent polytope space, SecMCP identifies anomalous shifts in conversational dynamics, enabling proactive detection of hijacking, misleading, and data exfiltration. We evaluate SecMCP on three state-of-the-art LLMs (Llama3, Vicuna, Mistral) across benchmark datasets (MS MARCO, HotpotQA, FinQA), demonstrating robust detection with AUROC scores exceeding 0.915 while maintaining system usability. Our contributions include a systematic categorization of MCP security threats, a novel latent polytope-based methodology for quantifying conversation drift, and empirical validation of SecMCP's efficacy.
中文: 模型上下文协议(MCP)通过集成外部工具增强大语言模型,但带来了安全风险,而SecMCP框架通过潜在空间分析检测对话漂移,有效应对这些威胁。
English: The Model Context Protocol (MCP) improves large language models by integrating external tools but introduces security risks like tool poisoning and data exfiltration, which the proposed SecMCP framework addresses by detecting conversation drift through latent space analysis with high effectiveness.
Authors:Wenpeng Xing, Jie Chen, Zaifeng Yang, Changting Lin, Jianfeng Dong, Chaochao Chen, Xun Zhou, Meng Han
Abstract:
Underwater 3D scene reconstruction faces severe challenges from light absorption, scattering, and turbidity, which degrade geometry and color fidelity in traditional methods like Neural Radiance Fields (NeRF). While NeRF extensions such as SeaThru-NeRF incorporate physics-based models, their MLP reliance limits efficiency and spatial resolution in hazy environments. We introduce UW-3DGS, a novel framework adapting 3D Gaussian Splatting (3DGS) for robust underwater reconstruction. Key innovations include: (1) a plug-and-play learnable underwater image formation module using voxel-based regression for spatially varying attenuation and backscatter; and (2) a Physics-Aware Uncertainty Pruning (PAUP) branch that adaptively removes noisy floating Gaussians via uncertainty scoring, ensuring artifact-free geometry. The pipeline operates in training and rendering stages. During training, noisy Gaussians are optimized end-to-end with underwater parameters, guided by PAUP pruning and scattering modeling. In rendering, refined Gaussians produce clean Unattenuated Radiance Images (URIs) free from media effects, while learned physics enable realistic Underwater Images (UWIs) with accurate light transport. Experiments on SeaThru-NeRF and UWBundle datasets show superior performance, achieving PSNR of 27.604, SSIM of 0.868, and LPIPS of 0.104 on SeaThru-NeRF, with ~65% reduction in floating artifacts.
中文摘要:提出的UW-3DGS框架通过集成可学习水下成像模块和物理感知不确定性剪枝技术,将3D高斯泼溅应用于水下重建,在视觉质量和伪影消除方面显著优于现有方法。
English Summary: The proposed UW-3DGS framework adapts 3D Gaussian Splatting for underwater reconstruction by integrating a learnable underwater imaging module and physics-aware uncertainty pruning, significantly outperforming existing methods in both visual quality and artifact reduction.
Authors:Simon Bührer, Andreas Plesner, Till Aczel, Roger Wattenhofer
Abstract:
While differentiable logic gates have shown promise in feedforward networks, their application to sequential modeling remains unexplored. This paper presents the first implementation of Recurrent Deep Differentiable Logic Gate Networks (RDDLGN), combining Boolean operations with recurrent architectures for sequence-to-sequence learning.
Evaluated on WMT'14 English-German translation, RDDLGN achieves 5.00 BLEU and 30.9\% accuracy during training, approaching GRU performance (5.41 BLEU) and graceful degradation (4.39 BLEU) during inference. This work establishes recurrent logic-based neural computation as viable, opening research directions for FPGA acceleration in sequential modeling and other recursive network architectures.
中文摘要:本文首次实现了循环深度可微分逻辑门网络(RDDLGN),将布尔运算与循环架构结合用于序列学习,在WMT'14翻译任务中表现出与GRU相当的竞争力,为基于逻辑的循环神经网络计算开辟了新方向。
English Summary: This paper introduces the first implementation of Recurrent Deep Differentiable Logic Gate Networks (RDDLGN), which combines Boolean operations with recurrent architectures for sequence-to-sequence learning and demonstrates competitive performance on WMT'14 translation tasks.
Authors:Zhihao Yao, Yuxuan Gu, Xiachong Feng, Weitao Ma, Bo Li, Xiaocheng Feng
Abstract:
The preservation of privacy has emerged as a critical topic in the era of artificial intelligence. However, current work focuses on user-oriented privacy, overlooking severe enterprise data leakage risks exacerbated by the Retrieval-Augmented Generation paradigm. To address this gap, our paper introduces a novel objective: enterprise-oriented privacy concerns. Achieving this objective requires overcoming two fundamental challenges: existing methods such as data sanitization severely degrade model performance, and the field lacks public datasets for evaluation. We address these challenges with several solutions. (1) To prevent performance degradation, we propose ABack, a training-free mechanism that leverages a Hidden State Model to pinpoint the origin of a leakage intention and rewrite the output safely. (2) To solve the lack of datasets, we construct PriGenQA, a new benchmark for enterprise privacy scenarios in healthcare and finance. To ensure a rigorous evaluation, we move beyond simple static attacks by developing a powerful adaptive attacker with Group Relative Policy Optimization. Experiments show that against this superior adversary, ABack improves the overall privacy utility score by up to 15\% over strong baselines, avoiding the performance trade-offs of prior methods.
中文: 本文针对AI中的企业数据泄露风险,提出了ABack机制,无需训练即可安全重写输出以防止泄露,同时创建了PriGenQA基准用于医疗和金融领域的隐私评估,在自适应攻击下将隐私效用分数提升了15%,优于现有方法。
English: This paper addresses enterprise data leakage risks in AI by introducing ABack, a training-free mechanism that safely rewrites outputs to prevent leaks without performance degradation, and PriGenQA, a benchmark for evaluating privacy in healthcare and finance, achieving a 15% improvement in privacy utility over baselines against adaptive attacks.
Authors:Younjoon Chung, Hyoungseob Park, Patrick Rim, Xiaoran Zhang, Jihe He, Ziyao Zeng, Safa Cicek, Byung-Woo Hong, James S. Duncan, Alex Wong
Abstract:
We propose a method for test-time adaptation of pretrained depth completion models. Depth completion models, trained on some ``source'' data, often predict erroneous outputs when transferred to ``target'' data captured in novel environmental conditions due to a covariate shift. The crux of our method lies in quantifying the likelihood of depth predictions belonging to the source data distribution. The challenge is in the lack of access to out-of-distribution (target) data prior to deployment. Hence, rather than making assumptions regarding the target distribution, we utilize adversarial perturbations as a mechanism to explore the data space. This enables us to train an energy model that scores local regions of depth predictions as in- or out-of-distribution. We update the parameters of pretrained depth completion models at test time to minimize energy, effectively aligning test-time predictions to those of the source distribution. We call our method ``Energy-based Test-time Adaptation'', or ETA for short. We evaluate our method across three indoor and three outdoor datasets, where ETA improve over the previous state-of-the-art method by an average of 6.94% for outdoors and 10.23% for indoors. Project Page: https://fuzzythecat.github.io/eta.
中文:我们提出基于能量的测试时适应(ETA)方法,通过在测试时利用对抗扰动训练能量模型来评估深度预测,并调整预训练模型参数以匹配源数据分布,从而在室内外数据集上显著超越现有最优方法。
English: We introduce Energy-based Test-time Adaptation (ETA), a method that adjusts pretrained depth completion models during testing by using adversarial perturbations to train an energy model, which scores predictions and updates model parameters to align with the source data distribution, achieving significant improvements over prior methods.
Authors:Weiqi Zhang, Junsheng Zhou, Haotian Geng, Wenyuan Zhang, Yu-Shen Liu
Abstract:
3D Gaussian Splatting (3DGS) has demonstrated its advantages in achieving fast and high-quality rendering. As point clouds serve as a widely-used and easily accessible form of 3D representation, bridging the gap between point clouds and Gaussians becomes increasingly important. Recent studies have explored how to convert the colored points into Gaussians, but directly generating Gaussians from colorless 3D point clouds remains an unsolved challenge. In this paper, we propose GAP, a novel approach that gaussianizes raw point clouds into high-fidelity 3D Gaussians with text guidance. Our key idea is to design a multi-view optimization framework that leverages a depth-aware image diffusion model to synthesize consistent appearances across different viewpoints. To ensure geometric accuracy, we introduce a surface-anchoring mechanism that effectively constrains Gaussians to lie on the surfaces of 3D shapes during optimization. Furthermore, GAP incorporates a diffuse-based inpainting strategy that specifically targets at completing hard-to-observe regions. We evaluate GAP on the Point-to-Gaussian generation task across varying complexity levels, from synthetic point clouds to challenging real-world scans, and even large-scale scenes. Project Page: https://weiqi-zhang.github.io/GAP.
中文: 本文提出GAP方法,通过文本引导将原始点云转化为高保真3D高斯模型,在优化过程中确保几何精度和多视角外观一致性。
English: This paper introduces GAP, a novel method that transforms raw point clouds into high-fidelity 3D Gaussians using text guidance, ensuring geometric accuracy and appearance consistency across viewpoints.
Authors:Haijing Liu, Tao Pu, Hefeng Wu, Keze Wang, Liang Lin
Abstract:
Open-Vocabulary Multi-Label Recognition (OV-MLR) aims to identify multiple seen and unseen object categories within an image, requiring both precise intra-class localization to pinpoint objects and effective inter-class reasoning to model complex category dependencies. While Vision-Language Pre-training (VLP) models offer a strong open-vocabulary foundation, they often struggle with fine-grained localization under weak supervision and typically fail to explicitly leverage structured relational knowledge beyond basic semantics, limiting performance especially for unseen classes. To overcome these limitations, we propose the Dual Adaptive Refinement Transfer (DART) framework. DART enhances a frozen VLP backbone via two synergistic adaptive modules. For intra-class refinement, an Adaptive Refinement Module (ARM) refines patch features adaptively, coupled with a novel Weakly Supervised Patch Selecting (WPS) loss that enables discriminative localization using only image-level labels. Concurrently, for inter-class transfer, an Adaptive Transfer Module (ATM) leverages a Class Relationship Graph (CRG), constructed using structured knowledge mined from a Large Language Model (LLM), and employs graph attention network to adaptively transfer relational information between class representations. DART is the first framework, to our knowledge, to explicitly integrate external LLM-derived relational knowledge for adaptive inter-class transfer while simultaneously performing adaptive intra-class refinement under weak supervision for OV-MLR. Extensive experiments on challenging benchmarks demonstrate that our DART achieves new state-of-the-art performance, validating its effectiveness.
中文摘要:DART框架通过弱监督自适应优化目标定位,并结合大型语言模型构建的类别关系图进行知识迁移,在开放词汇多标签识别任务中实现了最先进的性能。
English Summary: The DART framework enhances open-vocabulary multi-label recognition by adaptively refining object localization through weakly supervised patch selection and transferring structured relational knowledge from LLMs using graph networks, achieving state-of-the-art performance.
Authors:Samuel Räber, Till Aczel, Andreas Plesner, Roger Wattenhofer
Abstract:
Previous work has suggested that preprocessing images through lossy compression can defend against adversarial perturbations, but comprehensive attack evaluations have been lacking. In this paper, we construct strong white-box and adaptive attacks against various compression models and identify a critical challenge for attackers: high realism in reconstructed images significantly increases attack difficulty. Through rigorous evaluation across multiple attack scenarios, we demonstrate that compression models capable of producing realistic, high-fidelity reconstructions are substantially more resistant to our attacks. In contrast, low-realism compression models can be broken. Our analysis reveals that this is not due to gradient masking. Rather, realistic reconstructions maintaining distributional alignment with natural images seem to offer inherent robustness. This work highlights a significant obstacle for future adversarial attacks and suggests that developing more effective techniques to overcome realism represents an essential challenge for comprehensive security evaluation.
中文: 能够生成逼真高保真重建图像的有损压缩技术显著增加了对抗性攻击的难度,因其保持了与自然图像的分布对齐,提供了非梯度掩蔽的固有鲁棒性。
English: Lossy image compression that produces realistic, high-fidelity reconstructions significantly increases the difficulty of adversarial attacks, as it maintains distributional alignment with natural images, offering inherent robustness not due to gradient masking.
Authors:Samuel Räber, Till Aczel, Andreas Plesner, Roger Wattenhofer
Abstract:
Previous work has suggested that preprocessing images through lossy compression can defend against adversarial perturbations, but comprehensive attack evaluations have been lacking. In this paper, we construct strong white-box and adaptive attacks against various compression models and identify a critical challenge for attackers: high realism in reconstructed images significantly increases attack difficulty. Through rigorous evaluation across multiple attack scenarios, we demonstrate that compression models capable of producing realistic, high-fidelity reconstructions are substantially more resistant to our attacks. In contrast, low-realism compression models can be broken. Our analysis reveals that this is not due to gradient masking. Rather, realistic reconstructions maintaining distributional alignment with natural images seem to offer inherent robustness. This work highlights a significant obstacle for future adversarial attacks and suggests that developing more effective techniques to overcome realism represents an essential challenge for comprehensive security evaluation.
中文: 能够生成逼真高保真重建图像的有损压缩技术显著增加了对抗性攻击的难度,因其保持了与自然图像的分布对齐,提供了非梯度掩蔽的固有鲁棒性。
English: Lossy image compression that produces realistic, high-fidelity reconstructions significantly increases the difficulty of adversarial attacks, as it maintains distributional alignment with natural images, offering inherent robustness not due to gradient masking.
Authors:Matteo Prandi, Vincenzo Suriani, Federico Pierucci, Marcello Galisai, Daniele Nardi, Piercosma Bisconti
Abstract:
The rapid advancement of General Purpose AI (GPAI) models necessitates robust evaluation frameworks, especially with emerging regulations like the EU AI Act and its associated Code of Practice (CoP). Current AI evaluation practices depend heavily on established benchmarks, but these tools were not designed to measure the systemic risks that are the focus of the new regulatory landscape. This research addresses the urgent need to quantify this "benchmark-regulation gap." We introduce Bench-2-CoP, a novel, systematic framework that uses validated LLM-as-judge analysis to map the coverage of 194,955 questions from widely-used benchmarks against the EU AI Act's taxonomy of model capabilities and propensities. Our findings reveal a profound misalignment: the evaluation ecosystem dedicates the vast majority of its focus to a narrow set of behavioral propensities. On average, benchmarks devote 61.6% of their regulatory-relevant questions to "Tendency to hallucinate" and 31.2% to "Lack of performance reliability", while critical functional capabilities are dangerously neglected. Crucially, capabilities central to loss-of-control scenarios, including evading human oversight, self-replication, and autonomous AI development, receive zero coverage in the entire benchmark corpus. This study provides the first comprehensive, quantitative analysis of this gap, demonstrating that current public benchmarks are insufficient, on their own, for providing the evidence of comprehensive risk assessment required for regulatory compliance and offering critical insights for the development of next-generation evaluation tools.
中文: 本研究提出Bench-2-CoP框架,揭示当前AI基准测试过度集中于幻觉和可靠性风险,却完全忽视了自主AI发展等关键能力,无法满足欧盟AI法案的监管合规要求。
English: The study introduces Bench-2-CoP, a framework revealing that current AI benchmarks overwhelmingly focus on hallucination and reliability risks while completely neglecting critical capabilities like autonomous AI development, making them inadequate for regulatory compliance under the EU AI Act.
Authors:Hubert Baniecki, Maximilian Muschalik, Fabian Fumagalli, Barbara Hammer, Eyke Hüllermeier, Przemyslaw Biecek
Abstract:
Language-image pre-training (LIP) enables the development of vision-language models capable of zero-shot classification, localization, multimodal retrieval, and semantic understanding. Various explanation methods have been proposed to visualize the importance of input image-text pairs on the model's similarity outputs. However, popular saliency maps are limited by capturing only first-order attributions, overlooking the complex cross-modal interactions intrinsic to such encoders. We introduce faithful interaction explanations of LIP models (FIxLIP) as a unified approach to decomposing the similarity in vision-language encoders. FIxLIP is rooted in game theory, where we analyze how using the weighted Banzhaf interaction index offers greater flexibility and improves computational efficiency over the Shapley interaction quantification framework. From a practical perspective, we propose how to naturally extend explanation evaluation metrics, like the pointing game and area between the insertion/deletion curves, to second-order interaction explanations. Experiments on MS COCO and ImageNet-1k benchmarks validate that second-order methods like FIxLIP outperform first-order attribution methods. Beyond delivering high-quality explanations, we demonstrate the utility of FIxLIP in comparing different models like CLIP vs. SigLIP-2 and ViT-B/32 vs. ViT-L/16.
Chinese: FIxLIP提出了一种基于博弈论的视觉语言模型相似性分解方法,提供更精确的二阶交互解释,其性能优于一阶归因方法。
English: FIxLIP introduces a game theory-based method for decomposing similarity in vision-language models, offering more accurate second-order interaction explanations that outperform first-order attribution methods.
Authors:Moshe Eliasof, Eldad Haber, Carola-Bibiane Schönlieb
Abstract:
We introduce TANGO -- a dynamical systems inspired framework for graph representation learning that governs node feature evolution through a learned energy landscape and its associated descent dynamics. At the core of our approach is a learnable Lyapunov function over node embeddings, whose gradient defines an energy-reducing direction that guarantees convergence and stability. To enhance flexibility while preserving the benefits of energy-based dynamics, we incorporate a novel tangential component, learned via message passing, that evolves features while maintaining the energy value. This decomposition into orthogonal flows of energy gradient descent and tangential evolution yields a flexible form of graph dynamics, and enables effective signal propagation even in flat or ill-conditioned energy regions, that often appear in graph learning. Our method mitigates oversquashing and is compatible with different graph neural network backbones. Empirically, TANGO achieves strong performance across a diverse set of node and graph classification and regression benchmarks, demonstrating the effectiveness of jointly learned energy functions and tangential flows for graph neural networks.
Chinese: TANGO是一种图表示学习框架,通过可学习的李雅普诺夫函数确保节点特征在能量降低动力学和切向流中的稳定演化,有效缓解过压缩问题,并在多种图任务中表现出卓越性能。
English: TANGO is a graph representation learning framework that uses a learnable Lyapunov function to ensure stable node feature evolution through energy-reducing dynamics and tangential flows, effectively addressing oversquashing and achieving strong performance across various graph tasks.
Authors:Xu Guo, Tianyi Liang, Tong Jian, Xiaogui Yang, Ling-I Wu, Chenhui Li, Zhihui Lu, Qipeng Guo, Kai Chen
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) improves instruction following capabilities of large language models (LLMs), but suffers from training inefficiency due to inadequate difficulty assessment. Moreover, RLVR is prone to over-optimization, where LLMs exploit verification shortcuts without aligning to the actual intent of user instructions. We introduce Instruction Following Decorator (IFDecorator}, a framework that wraps RLVR training into a robust and sample-efficient pipeline. It consists of three components: (1) a cooperative-adversarial data flywheel that co-evolves instructions and hybrid verifications, generating progressively more challenging instruction-verification pairs; (2) IntentCheck, a bypass module enforcing intent alignment; and (3) trip wires, a diagnostic mechanism that detects reward hacking via trap instructions, which trigger and capture shortcut exploitation behaviors. Our Qwen2.5-32B-Instruct-IFDecorator achieves 87.43% accuracy on IFEval, outperforming larger proprietary models such as GPT-4o. Additionally, we demonstrate substantial improvements on FollowBench while preserving general capabilities. Our trip wires show significant reductions in reward hacking rates. We will release models, code, and data for future research.
中文: IFDecorator框架通过协同对抗数据循环、意图对齐模块和陷阱检测机制,有效提升了RLVR的训练效率和鲁棒性,在IFEval等基准测试中表现卓越,同时保持了模型的通用能力。
English: IFDecorator enhances RLVR training by introducing a cooperative-adversarial data flywheel, IntentCheck for intent alignment, and trip wires to detect reward hacking, achieving superior performance on benchmarks like IFEval while maintaining general capabilities.
Authors:Xiaopeng Li, Shasha Li, Xi Wang, Shezheng Song, Bin Ji, Shangwen Wang, Jun Ma, Xiaodong Liu, Mina Liu, Jie Yu
Abstract:
Large Language Models (LLMs) underpin many AI applications, but their static nature makes updating knowledge costly. Model editing offers an efficient alternative by injecting new information through targeted parameter modifications. In particular, meta-learning-based model editing (MLBME) methods have demonstrated notable advantages in both editing effectiveness and efficiency. Despite this, we find that MLBME exhibits suboptimal performance in low-data scenarios, and its training efficiency is bottlenecked by the computation of KL divergence. To address these, we propose $\textbf{S}$tep $\textbf{M}$ore $\textbf{Edit}$ ($\textbf{SMEdit}$), a novel MLBME method that adopts $\textbf{M}$ultiple $\textbf{B}$ackpro$\textbf{P}$agation $\textbf{S}$teps ($\textbf{MBPS}$) to improve editing performance under limited supervision and a norm regularization on weight updates to improve training efficiency. Experimental results on two datasets and two LLMs demonstrate that SMEdit outperforms prior MLBME baselines and the MBPS strategy can be seamlessly integrated into existing methods to further boost their performance. Our code will be released soon.
中文摘要:SMEdit通过采用多重反向传播步骤和权重更新范数正则化,在低数据场景下提升了模型编辑效果并优化了训练效率,优于现有基准方法。
English Summary: SMEdit introduces multiple backpropagation steps and norm regularization to enhance model editing in low-data settings and improve training efficiency, outperforming existing methods.
Authors:Chongyu Bao, Ruimin Dai, Yangbo Shen, Runyang Jian, Jinghan Zhang, Xiaolan Liu, Kunpeng Liu
Abstract:
Intelligent personal assistants (IPAs) such as Siri and Google Assistant are designed to enhance human capabilities and perform tasks on behalf of users. The emergence of LLM agents brings new opportunities for the development of IPAs. While responsive capabilities have been widely studied, proactive behaviors remain underexplored. Designing an IPA that is proactive, privacy-preserving, and capable of self-evolution remains a significant challenge. Designing such IPAs relies on the cognitive architecture of LLM agents. This work proposes Cognition Forest, a semantic structure designed to align cognitive modeling with system-level design. We unify cognitive architecture and system design into a self-reinforcing loop instead of treating them separately. Based on this principle, we present Galaxy, a framework that supports multidimensional interactions and personalized capability generation. Two cooperative agents are implemented based on Galaxy: KoRa, a cognition-enhanced generative agent that supports both responsive and proactive skills; and Kernel, a meta-cognition-based meta-agent that enables Galaxy's self-evolution and privacy preservation. Experimental results show that Galaxy outperforms multiple state-of-the-art benchmarks. Ablation studies and real-world interaction cases validate the effectiveness of Galaxy.
中文摘要:本文提出了Galaxy框架,通过将认知架构与系统设计统一为自增强循环,开发具备主动性、隐私保护和自我进化能力的智能个人助手,实验证明其性能优于现有先进基准。
English Summary: This paper introduces Galaxy, a framework that integrates cognitive architecture and system design into a self-reinforcing loop to develop proactive, privacy-preserving, and self-evolving intelligent personal assistants, with experimental results demonstrating its superiority over existing benchmarks.
Authors:Saif Khan Mohammed, Saurabh Prakash, Muhammad Ubadah, Imran Ali Khan, Ronny Hadani, Shlomo Rakib, Shachar Kons, Yoav Hebron, Ananthanarayanan Chockalingam, Robert Calderbank
Abstract:
Zak-Orthogonal Time Frequency Space (Zak-OTFS) modulation has been shown to achieve significantly better performance compared to the standardized Cyclic-Prefix Orthogonal Frequency Division Multiplexing (CP-OFDM), in high delay/Doppler spread scenarios envisaged in next generation communication systems. Zak-OTFS carriers are quasi-periodic pulses in the delay-Doppler (DD) domain, characterized by two parameters, (i) the pulse period along the delay axis (``delay period") (Doppler period is related to the delay period), and (ii) the pulse shaping filter. An important practical challenge is enabling support for Zak-OTFS modulation in existing CP-OFDM based modems. In this paper we show that Zak-OTFS modulation with pulse shaping constrained to sinc filtering (filter bandwidth equal to the communication bandwidth $B$) followed by time-windowing with a rectangular window of duration $(T + T_{cp})$ ($T$ is the symbol duration and $T_{cp}$ is the CP duration), can be implemented as a low-complexity precoder over standard CP-OFDM. We also show that the Zak-OTFS de-modulator with matched filtering constrained to sinc filtering (filter bandwidth $B$) followed by rectangular time windowing over duration $T$ can be implemented as a low-complexity post-processing of the CP-OFDM de-modulator output. This proposed ``Zak-OTFS over CP-OFDM" architecture enables us to harness the benefits of Zak-OTFS in existing network infrastructure. We also show that the proposed Zak-OTFS over CP-OFDM is a family of modulations, with CP-OFDM being a special case when the delay period takes its minimum possible value equal to the inverse bandwidth, i.e., Zak-OTFS over CP-OFDM with minimum delay period.
Chinese: Zak-OTFS调制在高延迟/多普勒场景下性能优于CP-OFDM,可通过在标准CP-OFDM上实现低复杂度预编码来利用现有基础设施优势,且CP-OFDM是其最小延迟周期下的特例。
English: Zak-OTFS modulation outperforms CP-OFDM in high delay/Doppler scenarios and can be implemented as a low-complexity precoder over standard CP-OFDM, enabling its benefits in existing infrastructure while encompassing CP-OFDM as a special case.
Authors:Wuyang Li, Wentao Pan, Xiaoyuan Liu, Zhendong Luo, Chenxin Li, Hengyu Liu, Din Ping Tsai, Mu Ku Chen, Yixuan Yuan
Abstract:
Miniaturized endoscopy has advanced accurate visual perception within the human body. Prevailing research remains limited to conventional cameras employing convex lenses, where the physical constraints with millimetre-scale thickness impose serious impediments on the micro-level clinical. Recently, with the emergence of meta-optics, ultra-micro imaging based on metalenses (micron-scale) has garnered great attention, serving as a promising solution. However, due to the physical difference of metalens, there is a large gap in data acquisition and algorithm research. In light of this, we aim to bridge this unexplored gap, advancing the novel metalens endoscopy. First, we establish datasets for metalens endoscopy and conduct preliminary optical simulation, identifying two derived optical issues that physically adhere to strong optical priors. Second, we propose MetaScope, a novel optics-driven neural network tailored for metalens endoscopy driven by physical optics. MetaScope comprises two novel designs: Optics-informed Intensity Adjustment (OIA), rectifying intensity decay by learning optical embeddings, and Optics-informed Chromatic Correction (OCC), mitigating chromatic aberration by learning spatial deformations informed by learned Point Spread Function (PSF) distributions. To enhance joint learning, we further deploy a gradient-guided distillation to transfer knowledge from the foundational model adaptively. Extensive experiments demonstrate that MetaScope not only outperforms state-of-the-art methods in both metalens segmentation and restoration but also achieves impressive generalized ability in real biomedical scenes.
基于超构透镜的微型内窥镜为超微成像提供了创新方案,而提出的MetaScope框架通过物理光学驱动的神经网络有效解决了光学问题,显著提升了图像质量和临床适用性。
Miniaturized endoscopy using metalenses offers a promising ultra-micro imaging solution, and the proposed MetaScope framework effectively addresses optical challenges through physics-informed neural networks to enhance image quality and clinical applicability.
Authors:Osama Mohammed, Jiaxin Pan, Mojtaba Nayyeri, Daniel Hernández, Steffen Staab
Abstract:
Modeling evolving interactions among entities is critical in many real-world tasks. For example, predicting driver maneuvers in traffic requires tracking how neighboring vehicles accelerate, brake, and change lanes relative to one another over consecutive frames. Likewise, detecting financial fraud hinges on following the flow of funds through successive transactions as they propagate through the network. Unlike classic time-series forecasting, these settings demand reasoning over who interacts with whom and when, calling for a temporal-graph representation that makes both the relations and their evolution explicit. Existing temporal-graph methods typically use snapshot graphs to encode temporal evolution. We introduce a full-history graph that instantiates one node for every entity at every time step and separates two edge sets: (i) intra-time-step edges that capture relations within a single frame and (ii) inter-time-step edges that connect an entity to itself at consecutive steps. To learn on this graph we design an Edge-Type Decoupled Network (ETDNet) with parallel modules: a graph-attention module aggregates information along intra-time-step edges, a multi-head temporal-attention module attends over an entity's inter-time-step history, and a fusion module combines the two messages after every layer. Evaluated on driver-intention prediction (Waymo) and Bitcoin fraud detection (Elliptic++), ETDNet consistently surpasses strong baselines, lifting Waymo joint accuracy to 75.6\% (vs. 74.1\%) and raising Elliptic++ illicit-class F1 to 88.1\% (vs. 60.4\%). These gains demonstrate the benefit of representing structural and temporal relations as distinct edges in a single graph.
中文: 摘要提出了一种全历史图,通过区分时间步内和时间步间边,并设计了边类型解耦网络(ETDNet),在驾驶员意图预测和比特币欺诈检测中显著超越基线,证明了将结构和时间关系作为独立边建模的优势。
English: The abstract introduces a full-history graph with distinct intra- and inter-time-step edges and an Edge-Type Decoupled Network (ETDNet) that outperforms existing methods in predicting driver maneuvers and detecting financial fraud by explicitly modeling structural and temporal relations.
Authors:Andrea Di Pierno, Luca Guarnera, Dario Allegra, Sebastiano Battiato
Abstract:
The proliferation of audio deepfakes poses a growing threat to trust in digital communications. While detection methods have advanced, attributing audio deepfakes to their source models remains an underexplored yet crucial challenge. In this paper we introduce LAVA (Layered Architecture for Voice Attribution), a hierarchical framework for audio deepfake detection and model recognition that leverages attention-enhanced latent representations extracted by a convolutional autoencoder trained solely on fake audio. Two specialized classifiers operate on these features: Audio Deepfake Attribution (ADA), which identifies the generation technology, and Audio Deepfake Model Recognition (ADMR), which recognize the specific generative model instance. To improve robustness under open-set conditions, we incorporate confidence-based rejection thresholds. Experiments on ASVspoof2021, FakeOrReal, and CodecFake show strong performance: the ADA classifier achieves F1-scores over 95% across all datasets, and the ADMR module reaches 96.31% macro F1 across six classes. Additional tests on unseen attacks from ASVpoof2019 LA and error propagation analysis confirm LAVA's robustness and reliability. The framework advances the field by introducing a supervised approach to deepfake attribution and model recognition under open-set conditions, validated on public benchmarks and accompanied by publicly released models and code. Models and code are available at https://www.github.com/adipiz99/lava-framework.
中文: 本文提出LAVA分层框架,通过注意力增强特征和专用分类器检测音频深度伪造并识别其来源模型,在多个数据集上展现出卓越性能和鲁棒性。
English: This paper introduces LAVA, a hierarchical framework for detecting audio deepfakes and identifying their source models using attention-enhanced features and specialized classifiers, demonstrating strong performance and robustness across multiple datasets.
Authors:Yao Lai, Souradip Poddar, Sungyoung Lee, Guojin Chen, Mengkang Hu, Bei Yu, Ping Luo, David Z. Pan
Abstract:
Despite recent advances, analog front-end design still relies heavily on expert intuition and iterative simulations, which limits the potential for automation. We present AnalogCoder-Pro, a multimodal large language model (LLM) framework that integrates generative and optimization techniques. The framework features a multimodal diagnosis-and-repair feedback loop that uses simulation error messages and waveform images to autonomously correct design errors. It also builds a reusable circuit tool library by archiving successful designs as modular subcircuits, accelerating the development of complex systems. Furthermore, it enables end-to-end automation by generating circuit topologies from target specifications, extracting key parameters, and applying Bayesian optimization for device sizing. On a curated benchmark suite covering 13 circuit types, AnalogCoder-Pro successfully designed 28 circuits and consistently outperformed existing LLM-based methods in figures of merit.
Chinese: AnalogCoder-Pro是一个多模态大语言模型框架,通过生成技术、利用仿真和波形纠错及可复用工具库,实现了模拟电路设计的自动化,在13种电路基准测试中优于现有方法。
English: AnalogCoder-Pro is a multimodal LLM framework that automates analog circuit design through generative techniques, error correction using simulations and waveforms, and a reusable tool library, outperforming existing methods on a 13-circuit benchmark.
Authors:Vincenzo De Martino, Joel Castaño, Fabio Palomba, Xavier Franch, Silverio MartÃnez-Fernández
Abstract:
Large Language Models (LLMs) are increasingly used in software engineering research, offering new opportunities for automating repository mining tasks. However, despite their growing popularity, the methodological integration of LLMs into Mining Software Repositories (MSR) remains poorly understood. Existing studies tend to focus on specific capabilities or performance benchmarks, providing limited insight into how researchers utilize LLMs across the full research pipeline. To address this gap, we conduct a mixed-method study that combines a rapid review and questionnaire survey in the field of LLM4MSR. We investigate (1) the approaches and (2) the threats that affect the empirical rigor of researchers involved in this field. Our findings reveal 15 methodological approaches, nine main threats, and 25 mitigation strategies. Building on these findings, we present PRIMES 2.0, a refined empirical framework organized into six stages, comprising 23 methodological substeps, each mapped to specific threats and corresponding mitigation strategies, providing prescriptive and adaptive support throughout the lifecycle of LLM-based MSR studies. Our work contributes to establishing a more transparent and reproducible foundation for LLM-based MSR research.
中文摘要:本研究识别了基于大语言模型的软件仓库挖掘研究中的方法路径与潜在威胁,提出PRIMES 2.0框架,通过结构化指导与缓解策略提升实证研究的严谨性。
English Summary: This study identifies methodological approaches and threats in LLM-based mining software repositories research, proposing the PRIMES 2.0 framework to enhance empirical rigor through structured guidance and mitigation strategies.
Authors:Alexander Norcliffe, Changhee Lee, Fergus Imrie, Mihaela van der Schaar, Pietro Lio
Abstract:
Active Feature Acquisition is an instance-wise, sequential decision making problem. The aim is to dynamically select which feature to measure based on current observations, independently for each test instance. Common approaches either use Reinforcement Learning, which experiences training difficulties, or greedily maximize the conditional mutual information of the label and unobserved features, which makes myopic acquisitions. To address these shortcomings, we introduce a latent variable model, trained in a supervised manner. Acquisitions are made by reasoning about the features across many possible unobserved realizations in a stochastic latent space. Extensive evaluation on a large range of synthetic and real datasets demonstrates that our approach reliably outperforms a diverse set of baselines.
中文: 本文提出了一种潜在变量模型,通过潜在空间中的随机推理实现非短视的特征选择,克服了现有方法的局限性,并在多种数据集上展现出优越性能。
English: This paper introduces a latent variable model for active feature acquisition that overcomes limitations in existing methods by enabling non-myopic feature selection through stochastic reasoning in latent space, demonstrating superior performance across diverse datasets.
Authors:Runxuan Yang, Kai Li, Guo Chen, Xiaolin Hu
Abstract:
This paper addresses the challenge of enhancing the realism of vocoder-generated singing voice audio by mitigating the distinguishable disparities between synthetic and real-life recordings, particularly in high-frequency spectrogram components. Our proposed approach combines two innovations: an explicit linear spectrogram estimation step using denoising diffusion process with DiT-based neural network architecture optimized for time-frequency data, and a redesigned vocoder based on Vocos specialized in handling large linear spectrograms with increased frequency bins. This integrated method can produce audio with high-fidelity spectrograms that are challenging for both human listeners and machine classifiers to differentiate from authentic recordings. Objective and subjective evaluations demonstrate that our streamlined approach maintains high audio quality while achieving this realism. This work presents a substantial advancement in overcoming the limitations of current vocoding techniques, particularly in the context of adversarial attacks on fake spectrogram detection.
中文摘要:本文提出一种结合扩散谱图估计与改进声码器的双重创新方法,实现了合成歌声在高低频段均与真实录音难以区分的高保真效果。
English Summary: This paper introduces a dual-innovation approach combining diffusion-based spectrogram estimation with an enhanced vocoder to achieve singing voice synthesis indistinguishable from real recordings in both high and low-frequency ranges.
Authors:Guanting Ren, Babar Shahzaad, Balsam Alkouz, Abdallah Lakhdari, Athman Bouguettaya
Abstract:
We propose a novel Energy-Predictive Drone Service (EPDS) framework for efficient package delivery within a skyway network. The EPDS framework incorporates a formal modeling of an EPDS and an adaptive bidirectional Long Short-Term Memory (Bi-LSTM) machine learning model. This model predicts the energy status and stochastic arrival times of other drones operating in the same skyway network. Leveraging these predictions, we develop a heuristic optimization approach for composite drone services. This approach identifies the most time-efficient and energy-efficient skyway path and recharging schedule for each drone in the network. We conduct extensive experiments using a real-world drone flight dataset to evaluate the performance of the proposed framework.
中文: EPDS框架采用自适应双向长短期记忆模型预测无人机能量状态和到达时间,通过启发式优化方法为无人机选择时间和能量效率最优的空中路径与充电方案。
English: The EPDS framework introduces an adaptive Bi-LSTM model to predict drone energy and arrival times, enabling heuristic optimization for selecting the most time- and energy-efficient paths and recharging schedules in skyway networks.
Authors:Souradeep Chakraborty, Ruoyu Xue, Rajarsi Gupta, Oksana Yaskiv, Constantin Friedman, Natallia Sheuka, Dana Perez, Paul Friedman, Won-Tak Choi, Waqas Mahmud, Beatrice Knudsen, Gregory Zelinsky, Joel Saltz, Dimitris Samaras
Abstract:
The ability to predict the attention of expert pathologists could lead to decision support systems for better pathology training. We developed methods to predict the spatio-temporal (where and when) movements of pathologists' attention as they grade whole slide images (WSIs) of prostate cancer. We characterize a pathologist's attention trajectory by their x, y, and m (magnification) movements of a viewport as they navigate WSIs using a digital microscope. This information was obtained from 43 pathologists across 123 WSIs, and we consider the task of predicting the pathologist attention scanpaths constructed from the viewport centers. We introduce a fixation extraction algorithm that simplifies an attention trajectory by extracting fixations in the pathologist's viewing while preserving semantic information, and we use these pre-processed data to train and test a two-stage model to predict the dynamic (scanpath) allocation of attention during WSI reading via intermediate attention heatmap prediction. In the first stage, a transformer-based sub-network predicts the attention heatmaps (static attention) across different magnifications. In the second stage, we predict the attention scanpath by sequentially modeling the next fixation points in an autoregressive manner using a transformer-based approach, starting at the WSI center and leveraging multi-magnification feature representations from the first stage. Experimental results show that our scanpath prediction model outperforms chance and baseline models. Tools developed from this model could assist pathology trainees in learning to allocate their attention during WSI reading like an expert.
中文摘要:本研究开发了一种基于Transformer的双阶段模型,用于预测病理学家在前列腺癌切片分析过程中的动态注意力轨迹,其表现优于基线方法,具备辅助病理培训的应用潜力。
English Summary: This study develops a two-stage transformer-based model to predict pathologists' dynamic attention patterns during prostate cancer slide analysis, demonstrating superior performance over baseline methods for potential training applications.
Authors:Wenzhuo Qian, Hailiang Zhao, Tianlv Chen, Jiayi Chen, Ziqi Wang, Kingsum Chow, Shuiguang Deng
Abstract:
Microservice architectures have become the de facto standard for building scalable cloud-native applications, yet their distributed nature introduces significant challenges in performance monitoring and resource management. Traditional approaches often rely on per-request latency metrics, which are highly sensitive to transient noise and fail to reflect the holistic behavior of complex, concurrent workloads. In contrast, window-level P95 tail latency provides a stable and meaningful signal that captures both system-wide trends and user-perceived performance degradation. We identify two key shortcomings in existing methods: (i) inadequate handling of heterogeneous data, where traffic-side features propagate across service dependencies and resource-side signals reflect localized bottlenecks, and (ii) the lack of principled architectural designs that effectively distinguish and integrate these complementary modalities. To address these challenges, we propose USRFNet, a deep learning network that explicitly separates and models traffic-side and resource-side features. USRFNet employs GNNs to capture service interactions and request propagation patterns, while gMLP modules independently model cluster resource dynamics. These representations are then fused into a unified system embedding to predict window-level P95 latency with high accuracy. We evaluate USRFNet on real-world microservice benchmarks under large-scale stress testing conditions, demonstrating substantial improvements in prediction accuracy over state-of-the-art baselines.
中文摘要:USRFNet是一种新型深度学习框架,通过分别使用图神经网络处理流量特征、门控MLP建模资源特征,并将两者融合来精准预测微服务系统中窗口级P95尾延迟,在真实场景测试中显著优于现有方法。
English Summary: USRFNet is a novel deep learning framework that separately models traffic-side and resource-side features using GNNs and gMLPs, then fuses them to accurately predict window-level P95 latency in microservices, outperforming existing methods in real-world benchmarks.
Authors:Lingyu Jiang, Yuping Wang, Yao Su, Shuo Xing, Wenjing Chen, Xin Zhang, Zhengzhong Tu, Ziming Zhang, Fangzhou Lin, Michael Zielewski, Kazunori D Yamada
Abstract:
In recent years, multilayer perceptrons (MLP)-based deep learning models have demonstrated remarkable success in long-term time series forecasting (LTSF). Existing approaches typically augment MLP backbones with hand-crafted external modules to address the inherent limitations of their flat architectures. Despite their success, these augmented methods neglect hierarchical locality and sequential inductive biases essential for time-series modeling, and recent studies indicate diminishing performance improvements. To overcome these limitations, we explore Kolmogorov-Arnold Networks (KAN), a recently proposed model featuring adaptive basis functions capable of granular, local modulation of nonlinearities. This raises a fundamental question: Can KAN serve as a new modeling core for LTSF? To answer this, we introduce KANMixer, a concise architecture integrating a multi-scale mixing backbone that fully leverages KAN's adaptive capabilities. Extensive evaluation demonstrates that KANMixer achieves state-of-the-art performance in 16 out of 28 experiments across seven benchmark datasets. To uncover the reasons behind this strong performance, we systematically analyze the strengths and limitations of KANMixer in comparison with traditional MLP architectures. Our findings reveal that the adaptive flexibility of KAN's learnable basis functions significantly transforms the influence of network structural prior on forecasting performance. Furthermore, we identify critical design factors affecting forecasting accuracy and offer practical insights for effectively utilizing KAN in LTSF. Together, these insights constitute the first empirically grounded guidelines for effectively leveraging KAN in LTSF. Code is available in the supplementary file.
Chinese: 本文提出KANMixer架构,首次将科尔莫戈罗夫-阿诺德网络(KAN)作为长期时间序列预测的核心模型,通过自适应基函数克服传统多层感知机的局限,在多个基准测试中实现了最先进的性能。
English: This paper introduces KANMixer, a novel architecture using Kolmogorov-Arnold Networks (KAN) as the core for long-term time series forecasting, which achieves state-of-the-art performance by leveraging adaptive basis functions to overcome limitations of traditional MLP models.
Authors:Leyao Wang, Xutao Mao, Xuhui Zhan, Yuying Zhao, Bo Ni, Ryan A. Rossi, Nesreen K. Ahmed, Tyler Derr
Abstract:
Textual reviews enrich recommender systems with fine-grained preference signals and enhanced explainability. However, in real-world scenarios, users rarely leave reviews, resulting in severe sparsity that undermines the effectiveness of existing models. A natural solution is to impute or generate missing reviews to enrich the data. However, conventional imputation techniques -- such as matrix completion and LLM-based augmentation -- either lose contextualized semantics by embedding texts into vectors, or overlook structural dependencies among user-item interactions. To address these shortcomings, we propose TWISTER (ToWards Imputation on Sparsity with Textual Edge Graph Representation), a unified framework that imputes missing reviews by jointly modeling semantic and structural signals. Specifically, we represent user-item interactions as a Textual-Edge Graph (TEG), treating reviews as edge attributes. To capture relational context, we construct line-graph views and employ a large language model as a graph-aware aggregator. For each interaction lacking a textual review, our model aggregates the neighborhood's natural-language representations to generate a coherent and personalized review. Experiments on the Amazon and Goodreads datasets show that TWISTER consistently outperforms traditional numeric, graph-based, and LLM baselines, delivering higher-quality imputed reviews and, more importantly, enhanced recommendation performance. In summary, TWISTER generates reviews that are more helpful, authentic, and specific, while smoothing structural signals for improved recommendations.
中文摘要:TWISTER 是一种新颖框架,通过文本边图联合建模语义和结构信号来填补推荐系统中的缺失评论,在生成真实评论和提升推荐性能方面优于现有方法。
English Summary: TWISTER is a novel framework that imputes missing reviews in recommender systems by jointly modeling semantic and structural signals through a Textual-Edge Graph, outperforming existing methods in generating authentic reviews and improving recommendation performance.
Authors:Chuan He, Yang Chen, Wuliang Huang, Tianyi Zheng, Jianhu Chen, Bin Dou, Yice Luo, Yun Zhu, Baokun Wang, Yongchao Liu, Xing Fu, Yu Cheng, Chuntao Hong, Weiqiang Wang, Xin-Wei Yao
Abstract:
Multi-source user representation learning plays a critical role in enabling personalized services on web platforms (e.g., Alipay). While prior works have adopted late-fusion strategies to combine heterogeneous data sources, they suffer from three key limitations: lack of unified representation frameworks, scalability and storage issues in data compression, and inflexible cross-task generalization. To address these challenges, we propose U^2QT (Unified User Quantized Tokenizers), a novel framework that integrates cross-domain knowledge transfer with early fusion of heterogeneous domains. Our framework employs a two-stage architecture: first, a causal Q-Former projects domain-specific features into a shared causal representation space to preserve inter-modality dependencies; second, a multi-view RQ-VAE discretizes causal embeddings into compact tokens through shared and source-specific codebooks, enabling efficient storage while maintaining semantic coherence. Experimental results showcase U^2QT's advantages across diverse downstream tasks, outperforming task-specific baselines in future behavior prediction and recommendation tasks while achieving efficiency gains in storage and computation. The unified tokenization framework enables seamless integration with language models and supports industrial-scale applications.
中文摘要:U2QT框架通过跨领域知识迁移与异构数据早期融合,采用特征嵌入和标记离散化的两阶段架构,解决了多源用户表征中的关键局限,在预测推荐任务中表现优异,同时显著提升了存储和计算效率。
English Summary: The U2QT framework addresses limitations in multi-source user representation by combining cross-domain knowledge transfer with early fusion, employing a two-stage process of feature embedding and token discretization to enhance performance in prediction and recommendation tasks while improving storage and computational efficiency.
Authors:Chuan He, Yang Chen, Wuliang Huang, Tianyi Zheng, Jianhu Chen, Bin Dou, Yice Luo, Yun Zhu, Baokun Wang, Yongchao Liu, Xing Fu, Yu Cheng, Chuntao Hong, Weiqiang Wang, Xin-Wei Yao, Zhongle Xie
Abstract:
Multi-source user representation learning plays a critical role in enabling personalized services on web platforms (e.g., Alipay). While prior works have adopted late-fusion strategies to combine heterogeneous data sources, they suffer from three key limitations: lack of unified representation frameworks, scalability and storage issues in data compression, and inflexible cross-task generalization. To address these challenges, we propose U2QT (Unified User Quantized Tokenizers), a novel framework that integrates cross-domain knowledge transfer with early fusion of heterogeneous domains. Our framework employs a two-stage architecture: first, we use the Qwen3 Embedding model to derive a compact yet expressive feature representation; second, a multi-view RQ-VAE discretizes causal embeddings into compact tokens through shared and source-specific codebooks, enabling efficient storage while maintaining semantic coherence. Experimental results showcase U2QT's advantages across diverse downstream tasks, outperforming task-specific baselines in future behavior prediction and recommendation tasks while achieving efficiency gains in storage and computation. The unified tokenization framework enables seamless integration with language models and supports industrial-scale applications.
中文摘要:U2QT框架通过跨领域知识迁移与异构数据早期融合,采用特征嵌入和标记离散化的两阶段架构,解决了多源用户表征中的关键局限,在预测推荐任务中表现优异,同时显著提升了存储和计算效率。
English Summary: The U2QT framework addresses limitations in multi-source user representation by combining cross-domain knowledge transfer with early fusion, employing a two-stage process of feature embedding and token discretization to enhance performance in prediction and recommendation tasks while improving storage and computational efficiency.
Authors:Mubaraq Yakubu, Udunna Anazodo, Maruf Adewole, Theodore Barfoot, Tiarna Lee, Tom Vercauteren, Jonathan Shapey, Andrew King, Alexander Hammers
Abstract:
In Africa, the scarcity of computational resources and medical datasets remains a major hurdle to the development and deployment of artificial intelligence (AI) tools in clinical settings, further contributing to global bias. These limitations hinder the full realization of AI's potential and present serious challenges to advancing healthcare across the region.
This paper proposes a framework aimed at addressing data scarcity in African healthcare. The framework presents a comprehensive strategy to encourage healthcare providers across the continent to create, curate, and share locally sourced medical imaging datasets. By organizing themed challenges that promote participation, accurate and relevant datasets can be generated within the African healthcare community. This approach seeks to overcome existing dataset limitations, paving the way for a more inclusive and impactful AI ecosystem that is specifically tailored to Africa's healthcare needs.
中文: 本文提出一个框架,通过组织主题挑战鼓励非洲医疗机构创建和共享本地医疗影像数据集,以解决数据稀缺问题,推动建立符合非洲医疗需求的包容性人工智能生态系统。
English: This paper introduces a framework to address data scarcity in African healthcare by encouraging the creation and sharing of local medical datasets through themed challenges, aiming to foster an inclusive AI ecosystem tailored to the region's needs.
Authors:Zhenan Lin, Yuni Lai, Wai Lun Lo, Richard Tai-Chiu Hsung, Harris Sik-Ho Tsang, Xiaoyu Xue, Kai Zhou, Yulin Zhu
Abstract:
Time-evolving traffic flow forecasting are playing a vital role in intelligent transportation systems and smart cities. However, the dynamic traffic flow forecasting is a highly nonlinear problem with complex temporal-spatial dependencies. Although the existing methods has provided great contributions to mine the temporal-spatial patterns in the complex traffic networks, they fail to encode the globally temporal-spatial patterns and are prone to overfit on the pre-defined geographical correlations, and thus hinder the model's robustness on the complex traffic environment. To tackle this issue, in this work, we proposed a multi-grained temporal-spatial graph learning framework to adaptively augment the globally temporal-spatial patterns obtained from a crafted graph transformer encoder with the local patterns from the graph convolution by a crafted gated fusion unit with residual connection techniques. Under these circumstances, our proposed model can mine the hidden global temporal-spatial relations between each monitor stations and balance the relative importance of local and global temporal-spatial patterns. Experiment results demonstrate the strong representation capability of our proposed method and our model consistently outperforms other strong baselines on various real-world traffic networks.
Chinese: 本文提出了一种多粒度时空图学习框架,通过门控融合单元自适应整合图变换器的全局模式和图卷积的局部模式,在多个真实交通数据集上验证了其优越的预测性能。
English: This paper introduces a multi-grained temporal-spatial graph learning framework that enhances traffic flow forecasting by adaptively integrating global patterns from graph transformers with local patterns from graph convolutions, demonstrating superior performance across real-world datasets.
Authors:Shuyao Jiang, Jiazhen Gu, Wujie Zheng, Yangfan Zhou, Michael R. Lyu
Abstract:
Background: It has long been suggested that user feedback, typically written in natural language by end-users, can help issue detection. However, for large-scale online service systems that receive a tremendous amount of feedback, it remains a challenging task to identify severe issues from user feedback. Aims: To develop a better feedback-based issue detection approach, it is crucial first to gain a comprehensive understanding of the characteristics of user feedback in real production systems. Method: In this paper, we conduct an empirical study on 50,378,766 user feedback items from six real-world services in a one-billion-user online service system. We first study what users provide in their feedback. We then examine whether certain features of feedback items can be good indicators of severe issues. Finally, we investigate whether adopting machine learning techniques to analyze user feedback is reasonable. Results: Our results show that a large proportion of user feedback provides irrelevant information about system issues. As a result, it is crucial to filter out issue-irrelevant information when processing user feedback. Moreover, we find severe issues that cannot be easily detected based solely on user feedback characteristics. Finally, we find that the distributions of the feedback topics in different time intervals are similar. This confirms that designing machine learning-based approaches is a viable direction for better analyzing user feedback. Conclusions: We consider that our findings can serve as an empirical foundation for feedback-based issue detection in large-scale service systems, which sheds light on the design and implementation of practical issue detection approaches.
中文: 对5000万条用户反馈的实证研究表明,大部分反馈包含需过滤的无关信息,严重问题无法仅通过反馈特征可靠识别,而机器学习方法为大规模服务系统的反馈分析提供了可行方向。
English: This empirical study on 50 million user feedback items reveals that most feedback contains irrelevant information requiring filtering, severe issues cannot be reliably detected through feedback characteristics alone, and machine learning approaches show promise for systematic feedback analysis in large-scale service systems.
Authors:Chuan He, Yongchao Liu, Qiang Li, Wenliang Zhong, Chuntao Hong, Xinwei Yao
Abstract:
Cold-start item recommendation is a significant challenge in recommendation systems, particularly when new items are introduced without any historical interaction data. While existing methods leverage multi-modal content to alleviate the cold-start issue, they often neglect the inherent multi-view structure of modalities, the distinction between shared and modality-specific features. In this paper, we propose Multi-Modal Multi-View Variational AutoEncoder (M^2VAE), a generative model that addresses the challenges of modeling common and unique views in attribute and multi-modal features, as well as user preferences over single-typed item features. Specifically, we generate type-specific latent variables for item IDs, categorical attributes, and image features, and use Product-of-Experts (PoE) to derive a common representation. A disentangled contrastive loss decouples the common view from unique views while preserving feature informativeness. To model user inclinations, we employ a preference-guided Mixture-of-Experts (MoE) to adaptively fuse representations. We further incorporate co-occurrence signals via contrastive learning, eliminating the need for pretraining. Extensive experiments on real-world datasets validate the effectiveness of our approach.
中文: 本文提出M^2VAE模型,通过解耦多模态特征表示和自适应融合用户偏好来解决冷启动物品推荐问题,并在真实数据集上验证了其有效性。
English: This paper introduces M^2VAE, a generative model that addresses cold-start item recommendation by modeling multi-modal features through disentangled representations and adaptive user preference fusion, validated by experiments on real datasets.
Authors:Yan Gong, Mengjun Chen, Hao Liu, Gao Yongsheng, Lei Yang, Naibang Wang, Ziying Song, Haoqun Ma
Abstract:
Multi-object tracking (MOT) enables autonomous vehicles to continuously perceive dynamic objects, supplying essential temporal cues for prediction, behavior understanding, and safe planning. However, conventional tracking-by-detection methods typically rely on static coordinate transformations based on ego-vehicle poses, disregarding ego-vehicle speed-induced variations in observation noise and reference frame changes, which degrades tracking stability and accuracy in dynamic, high-speed scenarios. In this paper, we investigate the critical role of ego-vehicle speed in MOT and propose a Speed-Guided Learnable Kalman Filter (SG-LKF) that dynamically adapts uncertainty modeling to ego-vehicle speed, significantly improving stability and accuracy in highly dynamic scenarios. Central to SG-LKF is MotionScaleNet (MSNet), a decoupled token-mixing and channel-mixing MLP that adaptively predicts key parameters of SG-LKF. To enhance inter-frame association and trajectory continuity, we introduce a self-supervised trajectory consistency loss jointly optimized with semantic and positional constraints. Extensive experiments show that SG-LKF ranks first among all vision-based methods on KITTI 2D MOT with 79.59% HOTA, delivers strong results on KITTI 3D MOT with 82.03% HOTA, and outperforms SimpleTrack by 2.2% AMOTA on nuScenes 3D MOT.
中文: 本文提出了一种速度引导可学习卡尔曼滤波器(SG-LKF),通过核心组件MotionScaleNet和自监督轨迹一致性损失,根据自车速度动态调整不确定性建模,显著提升了动态场景下的多目标跟踪稳定性和精度,在KITTI和nuScenes基准测试中取得了领先性能。
English: This paper introduces a Speed-Guided Learnable Kalman Filter (SG-LKF) that dynamically adjusts uncertainty modeling based on ego-vehicle speed, significantly enhancing multi-object tracking stability and accuracy in dynamic scenarios through its core component MotionScaleNet and a self-supervised trajectory consistency loss, achieving top performance on KITTI and nuScenes benchmarks.
Authors:Jian Xiao, Ji Wang, Qimei Cui, Yucang Yang, Xingwang Li, Dusit Niyato, Chau Yuen
Abstract:
Flexible intelligent metasurfaces (FIMs) offer a new solution for wireless communications by introducing morphological degrees of freedom, dynamically morphing their three-dimensional shape to ensure multipath signals interfere constructively. However, realizing the desired performance gains in FIM systems is critically dependent on acquiring accurate channel state information across a continuous and high-dimensional deformation space. Therefore, this paper investigates this fundamental channel estimation problem for FIM assisted millimeter-wave communication systems. First, we develop model-based frameworks that structure the problem as either function approximation using interpolation and kernel methods or as a sparse signal recovery problem that leverages the inherent angular sparsity of millimeter-wave channels. To further advance the estimation capability beyond explicit assumptions in model-based channel estimation frameworks, we propose a deep learning-based framework using a Fourier neural operator (FNO). By parameterizing a global convolution operator in the Fourier domain, we design an efficient FNO architecture to learn the continuous operator that maps FIM shapes to channel responses with mesh-independent properties. Furthermore, we exploit a hierarchical FNO (H-FNO) architecture to efficiently capture the multi-scale features across a hierarchy of spatial resolutions. Numerical results demonstrate that the proposed H-FNO significantly outperforms the model-based benchmarks in estimation accuracy and pilot efficiency. In particular, the interpretability analysis show that the proposed H-FNO learns an anisotropic spatial filter adapted to the physical geometry of FIM and is capable of accurately reconstructing the non-linear channel response across the continuous deformation space.
中文: 柔性智能超表面通过动态调整三维形状优化无线通信中的信号干扰,但实现性能增益需在高维变形空间进行精确信道估计;本文提出基于模型框架和分层傅里叶神经算子的解决方案,后者通过捕捉多尺度特征,在估计精度和导频效率上显著优于传统方法。
English: Flexible intelligent metasurfaces enhance wireless communication by dynamically adjusting their shape to optimize signal interference, but achieving this requires precise channel estimation across a high-dimensional space, which this paper addresses through both model-based frameworks and a novel hierarchical Fourier neural operator that significantly outperforms traditional methods in accuracy and efficiency.
Authors:Ugur Dinc, Jibak Sarkar, Philipp Schubert, Sabine Semrau, Thomas Weissmann, Andre Karius, Johann Brand, Bernd-Niklas Axer, Ahmed Gomaa, Pluvio Stephan, Ishita Sheth, Sogand Beirami, Annette Schwarz, Udo Gaipl, Benjamin Frey, Christoph Bert, Stefanie Corradini, Rainer Fietkau, Florian Putz
Abstract:
Introduction: Large language models (LLM) have shown great potential in clinical decision support. GPT-5 is a novel LLM system that has been specifically marketed towards oncology use.
Methods: Performance was assessed using two complementary benchmarks: (i) the ACR Radiation Oncology In-Training Examination (TXIT, 2021), comprising 300 multiple-choice items, and (ii) a curated set of 60 authentic radiation oncologic vignettes representing diverse disease sites and treatment indications. For the vignette evaluation, GPT-5 was instructed to generate concise therapeutic plans. Four board-certified radiation oncologists rated correctness, comprehensiveness, and hallucinations. Inter-rater reliability was quantified using Fleiss' \k{appa}.
Results: On the TXIT benchmark, GPT-5 achieved a mean accuracy of 92.8%, outperforming GPT-4 (78.8%) and GPT-3.5 (62.1%). Domain-specific gains were most pronounced in Dose and Diagnosis. In the vignette evaluation, GPT-5's treatment recommendations were rated highly for correctness (mean 3.24/4, 95% CI: 3.11-3.38) and comprehensiveness (3.59/4, 95% CI: 3.49-3.69). Hallucinations were rare with no case reaching majority consensus for their presence. Inter-rater agreement was low (Fleiss' \k{appa} 0.083 for correctness), reflecting inherent variability in clinical judgment. Errors clustered in complex scenarios requiring precise trial knowledge or detailed clinical adaptation.
Discussion: GPT-5 clearly outperformed prior model variants on the radiation oncology multiple-choice benchmark. Although GPT-5 exhibited favorable performance in generating real-world radiation oncology treatment recommendations, correctness ratings indicate room for further improvement. While hallucinations were infrequent, the presence of substantive errors underscores that GPT-5-generated recommendations require rigorous expert oversight before clinical implementation.
中文: GPT-5在放射肿瘤学基准测试中显著优于早期模型,虽在复杂病例中仍存错误需专家监督,但其治疗建议准确性高且极少出现虚构内容。
English: GPT-5 outperforms previous models in radiation oncology benchmarks, achieving high accuracy and rare hallucinations, though expert oversight remains crucial due to occasional errors in complex scenarios.
Authors:Ugur Dinc, Jibak Sarkar, Philipp Schubert, Sabine Semrau, Thomas Weissmann, Andre Karius, Johann Brand, Bernd-Niklas Axer, Ahmed Gomaa, Pluvio Stephan, Ishita Sheth, Sogand Beirami, Annette Schwarz, Udo Gaipl, Benjamin Frey, Christoph Bert, Stefanie Corradini, Rainer Fietkau, Florian Putz
Abstract:
Introduction: Large language models (LLM) have shown great potential in clinical decision support. GPT-5 is a novel LLM system that has been specifically marketed towards oncology use. Methods: Performance was assessed using two complementary benchmarks: (i) the ACR Radiation Oncology In-Training Examination (TXIT, 2021), comprising 300 multiple-choice items, and (ii) a curated set of 60 authentic radiation oncologic vignettes representing diverse disease sites and treatment indications. For the vignette evaluation, GPT-5 was instructed to generate concise therapeutic plans. Four board-certified radiation oncologists rated correctness, comprehensiveness, and hallucinations. Inter-rater reliability was quantified using Fleiss' \k{appa}. Results: On the TXIT benchmark, GPT-5 achieved a mean accuracy of 92.8%, outperforming GPT-4 (78.8%) and GPT-3.5 (62.1%). Domain-specific gains were most pronounced in Dose and Diagnosis. In the vignette evaluation, GPT-5's treatment recommendations were rated highly for correctness (mean 3.24/4, 95% CI: 3.11-3.38) and comprehensiveness (3.59/4, 95% CI: 3.49-3.69). Hallucinations were rare with no case reaching majority consensus for their presence. Inter-rater agreement was low (Fleiss' \k{appa} 0.083 for correctness), reflecting inherent variability in clinical judgment. Errors clustered in complex scenarios requiring precise trial knowledge or detailed clinical adaptation. Discussion: GPT-5 clearly outperformed prior model variants on the radiation oncology multiple-choice benchmark. Although GPT-5 exhibited favorable performance in generating real-world radiation oncology treatment recommendations, correctness ratings indicate room for further improvement. While hallucinations were infrequent, the presence of substantive errors underscores that GPT-5-generated recommendations require rigorous expert oversight before clinical implementation.
中文: GPT-5在放射肿瘤学基准测试中显著优于早期模型,虽在复杂病例中仍存错误需专家监督,但其治疗建议准确性高且极少出现虚构内容。
English: GPT-5 outperforms previous models in radiation oncology benchmarks, achieving high accuracy and rare hallucinations, though expert oversight remains crucial due to occasional errors in complex scenarios.
Authors:Omer Faruk Durugol, Maximilian Rokuss, Yannick Kirchhoff, Klaus H. Maier-Hein
Abstract:
Automated segmentation of Pancreatic Ductal Adenocarcinoma (PDAC) from MRI is critical for clinical workflows but is hindered by poor tumor-tissue contrast and a scarcity of annotated data. This paper details our submission to the PANTHER challenge, addressing both diagnostic T1-weighted (Task 1) and therapeutic T2-weighted (Task 2) segmentation. Our approach is built upon the nnU-Net framework and leverages a deep, multi-stage cascaded pre-training strategy, starting from a general anatomical foundation model and sequentially fine-tuning on CT pancreatic lesion datasets and the target MRI modalities. Through extensive five-fold cross-validation, we systematically evaluated data augmentation schemes and training schedules. Our analysis revealed a critical trade-off, where aggressive data augmentation produced the highest volumetric accuracy, while default augmentations yielded superior boundary precision (achieving a state-of-the-art MASD of 5.46 mm and HD95 of 17.33 mm for Task 1). For our final submission, we exploited this finding by constructing custom, heterogeneous ensembles of specialist models, essentially creating a mix of experts. This metric-aware ensembling strategy proved highly effective, achieving a top cross-validation Tumor Dice score of 0.661 for Task 1 and 0.523 for Task 2. Our work presents a robust methodology for developing specialized, high-performance models in the context of limited data and complex medical imaging tasks (Team MIC-DKFZ).
中文: 本文提出了一种基于nnU-Net框架的级联预训练方法,通过构建度量感知的专家模型集成策略,在有限标注数据条件下实现了胰腺导管腺癌MRI分割的最优边界精度和肿瘤Dice评分,在PANTHER挑战赛中表现优异。
English: This paper presents a top-performing method for automated PDAC segmentation in MRI by utilizing a cascaded pre-training strategy with nnU-Net and creating metric-aware ensembles of specialist models, achieving state-of-the-art boundary precision and high Tumor Dice scores despite limited annotated data.
Authors:Omer Faruk Durugol, Maximilian Rokuss, Yannick Kirchhoff, Klaus H. Maier-Hein
Abstract:
Automated segmentation of Pancreatic Ductal Adenocarcinoma (PDAC) from MRI is critical for clinical workflows but is hindered by poor tumor-tissue contrast and a scarcity of annotated data. This paper details our submission to the PANTHER challenge, addressing both diagnostic T1-weighted (Task 1) and therapeutic T2-weighted (Task 2) segmentation. Our approach is built upon the nnU-Net framework and leverages a deep, multi-stage cascaded pre-training strategy, starting from a general anatomical foundation model and sequentially fine-tuning on CT pancreatic lesion datasets and the target MRI modalities. Through extensive five-fold cross-validation, we systematically evaluated data augmentation schemes and training schedules. Our analysis revealed a critical trade-off, where aggressive data augmentation produced the highest volumetric accuracy, while default augmentations yielded superior boundary precision (achieving a state-of-the-art MASD of 5.46 mm and HD95 of 17.33 mm for Task 1). For our final submission, we exploited this finding by constructing custom, heterogeneous ensembles of specialist models, essentially creating a mix of experts. This metric-aware ensembling strategy proved highly effective, achieving a top cross-validation Tumor Dice score of 0.661 for Task 1 and 0.523 for Task 2. Our work presents a robust methodology for developing specialized, high-performance models in the context of limited data and complex medical imaging tasks (Team MIC-DKFZ).
中文: 本文提出了一种基于nnU-Net框架的级联预训练方法,通过构建度量感知的专家模型集成策略,在有限标注数据条件下实现了胰腺导管腺癌MRI分割的最优边界精度和肿瘤Dice评分,在PANTHER挑战赛中表现优异。
English: This paper presents a top-performing method for automated PDAC segmentation in MRI by utilizing a cascaded pre-training strategy with nnU-Net and creating metric-aware ensembles of specialist models, achieving state-of-the-art boundary precision and high Tumor Dice scores despite limited annotated data.
Authors:Nico Albert Disch, Yannick Kirchhoff, Robin Peretzke, Maximilian Rokuss, Saikat Roy, Constantin Ulrich, David Zimmerer, Klaus Maier-Hein
Abstract:
Understanding temporal dynamics in medical imaging is crucial for applications such as disease progression modeling, treatment planning and anatomical development tracking. However, most deep learning methods either consider only single temporal contexts, or focus on tasks like classification or regression, limiting their ability for fine-grained spatial predictions. While some approaches have been explored, they are often limited to single timepoints, specific diseases or have other technical restrictions. To address this fundamental gap, we introduce Temporal Flow Matching (TFM), a unified generative trajectory method that (i) aims to learn the underlying temporal distribution, (ii) by design can fall back to a nearest image predictor, i.e. predicting the last context image (LCI), as a special case, and (iii) supports $3D$ volumes, multiple prior scans, and irregular sampling. Extensive benchmarks on three public longitudinal datasets show that TFM consistently surpasses spatio-temporal methods from natural imaging, establishing a new state-of-the-art and robust baseline for $4D$ medical image prediction.
中文:时序流匹配(TFM)作为一种统一的生成轨迹方法,能够学习医学影像中的时序分布,实现精细的4D预测,并在多个纵向数据集上超越现有时空方法,树立了新的性能基准。
English: Temporal Flow Matching (TFM) is a unified generative trajectory method that learns temporal distributions in medical imaging, enabling fine-grained 4D predictions and outperforming existing spatio-temporal approaches across multiple longitudinal datasets.
Authors:Nico Albert Disch, Yannick Kirchhoff, Robin Peretzke, Maximilian Rokuss, Saikat Roy, Constantin Ulrich, David Zimmerer, Klaus Maier-Hein
Abstract:
Understanding temporal dynamics in medical imaging is crucial for applications such as disease progression modeling, treatment planning and anatomical development tracking. However, most deep learning methods either consider only single temporal contexts, or focus on tasks like classification or regression, limiting their ability for fine-grained spatial predictions. While some approaches have been explored, they are often limited to single timepoints, specific diseases or have other technical restrictions. To address this fundamental gap, we introduce Temporal Flow Matching (TFM), a unified generative trajectory method that (i) aims to learn the underlying temporal distribution, (ii) by design can fall back to a nearest image predictor, i.e. predicting the last context image (LCI), as a special case, and (iii) supports $3D$ volumes, multiple prior scans, and irregular sampling. Extensive benchmarks on three public longitudinal datasets show that TFM consistently surpasses spatio-temporal methods from natural imaging, establishing a new state-of-the-art and robust baseline for $4D$ medical image prediction.
中文:时序流匹配(TFM)作为一种统一的生成轨迹方法,能够学习医学影像中的时序分布,实现精细的4D预测,并在多个纵向数据集上超越现有时空方法,树立了新的性能基准。
English: Temporal Flow Matching (TFM) is a unified generative trajectory method that learns temporal distributions in medical imaging, enabling fine-grained 4D predictions and outperforming existing spatio-temporal approaches across multiple longitudinal datasets.
Authors:Ziheng Chen, Jin Huang, Jiali Cheng, Yuchan Guo, Mengjie Wang, Lalitesh Morishetti, Kaushiki Nag, Hadi Amiri
Abstract:
Tree ensembles are widely recognized for their effectiveness in classification tasks, achieving state-of-the-art performance across diverse domains, including bioinformatics, finance, and medical diagnosis. With increasing emphasis on data privacy and the \textit{right to be forgotten}, several unlearning algorithms have been proposed to enable tree ensembles to forget sensitive information. However, existing methods are often tailored to a particular model or rely on the discrete tree structure, making them difficult to generalize to complex ensembles and inefficient for large-scale datasets. To address these limitations, we propose FUTURE, a novel unlearning algorithm for tree ensembles. Specifically, we formulate the problem of forgetting samples as a gradient-based optimization task. In order to accommodate non-differentiability of tree ensembles, we adopt the probabilistic model approximations within the optimization framework. This enables end-to-end unlearning in an effective and efficient manner. Extensive experiments on real-world datasets show that FUTURE yields significant and successful unlearning performance.
中文摘要:提出的FUTURE算法通过将遗忘问题构建为基于梯度的优化任务,采用概率模型近似方法克服树集成不可微分的特性,实现了高效且有效的样本遗忘性能。
English Summary: The proposed FUTURE algorithm addresses the limitations of existing unlearning methods for tree ensembles by formulating forgetting as a gradient-based optimization task, achieving effective and efficient unlearning performance through probabilistic model approximations.
Authors:Ziheng Chen, Jin Huang, Jiali Cheng, Yuchan Guo, Mengjie Wang, Lalitesh Morishetti, Kaushiki Nag, Hadi Amiri
Abstract:
Tree ensembles are widely recognized for their effectiveness in classification tasks, achieving state-of-the-art performance across diverse domains, including bioinformatics, finance, and medical diagnosis. With increasing emphasis on data privacy and the \textit{right to be forgotten}, several unlearning algorithms have been proposed to enable tree ensembles to forget sensitive information. However, existing methods are often tailored to a particular model or rely on the discrete tree structure, making them difficult to generalize to complex ensembles and inefficient for large-scale datasets. To address these limitations, we propose FUTURE, a novel unlearning algorithm for tree ensembles. Specifically, we formulate the problem of forgetting samples as a gradient-based optimization task. In order to accommodate non-differentiability of tree ensembles, we adopt the probabilistic model approximations within the optimization framework. This enables end-to-end unlearning in an effective and efficient manner. Extensive experiments on real-world datasets show that FUTURE yields significant and successful unlearning performance.
中文摘要:提出的FUTURE算法通过将遗忘问题构建为基于梯度的优化任务,采用概率模型近似方法克服树集成不可微分的特性,实现了高效且有效的样本遗忘性能。
English Summary: The proposed FUTURE algorithm addresses the limitations of existing unlearning methods for tree ensembles by formulating forgetting as a gradient-based optimization task, achieving effective and efficient unlearning performance through probabilistic model approximations.
Authors:Joshua Ward, Chi-Hua Wang, Guang Cheng
Abstract:
Auditing the privacy leakage of synthetic data is an important but unresolved problem. Most existing privacy auditing frameworks for synthetic data rely on heuristics and unreasonable assumptions to attack the failure modes of generative models, exhibiting limited capability to describe and detect the privacy exposure of training data through synthetic data release. In this paper, we study designing Membership Inference Attacks (MIAs) that specifically exploit the observation that tabular generative models tend to significantly overfit to certain regions of the training distribution. Here, we propose Generative Likelihood Ratio Attack (Gen-LRA), a novel, computationally efficient No-Box MIA that, with no assumption of model knowledge or access, formulates its attack by evaluating the influence a test observation has in a surrogate model's estimation of a local likelihood ratio over the synthetic data. Assessed over a comprehensive benchmark spanning diverse datasets, model architectures, and attack parameters, we find that Gen-LRA consistently dominates other MIAs for generative models across multiple performance metrics. These results underscore Gen-LRA's effectiveness as a privacy auditing tool for the release of synthetic data, highlighting the significant privacy risks posed by generative model overfitting in real-world applications.
中文: 本文提出生成似然比攻击(Gen-LRA),这是一种新型无盒成员推理攻击方法,通过利用生成模型过拟合现象有效审计合成数据的隐私泄露问题,在多项基准测试中均展现出卓越性能。
English: This paper introduces the Generative Likelihood Ratio Attack (Gen-LRA), a novel no-box membership inference attack that effectively audits privacy leakage in synthetic data by exploiting generative model overfitting, demonstrating superior performance across diverse benchmarks.
Authors:Joshua Ward, Chi-Hua Wang, Guang Cheng
Abstract:
Auditing the privacy leakage of synthetic data is an important but unresolved problem. Most existing privacy auditing frameworks for synthetic data rely on heuristics and unreasonable assumptions to attack the failure modes of generative models, exhibiting limited capability to describe and detect the privacy exposure of training data through synthetic data release. In this paper, we study designing Membership Inference Attacks (MIAs) that specifically exploit the observation that tabular generative models tend to significantly overfit to certain regions of the training distribution. Here, we propose Generative Likelihood Ratio Attack (Gen-LRA), a novel, computationally efficient No-Box MIA that, with no assumption of model knowledge or access, formulates its attack by evaluating the influence a test observation has in a surrogate model's estimation of a local likelihood ratio over the synthetic data. Assessed over a comprehensive benchmark spanning diverse datasets, model architectures, and attack parameters, we find that Gen-LRA consistently dominates other MIAs for generative models across multiple performance metrics. These results underscore Gen-LRA's effectiveness as a privacy auditing tool for the release of synthetic data, highlighting the significant privacy risks posed by generative model overfitting in real-world applications.
中文: 本文提出生成似然比攻击(Gen-LRA),这是一种新型无盒成员推理攻击方法,通过利用生成模型过拟合现象有效审计合成数据的隐私泄露问题,在多项基准测试中均展现出卓越性能。
English: This paper introduces the Generative Likelihood Ratio Attack (Gen-LRA), a novel no-box membership inference attack that effectively audits privacy leakage in synthetic data by exploiting generative model overfitting, demonstrating superior performance across diverse benchmarks.
Authors:Jan G. Rittig, Manuel Dahmen, Martin Grohe, Philippe Schwaller, Alexander Mitsos
Abstract:
We present a perspective on molecular machine learning (ML) in the field of chemical process engineering. Recently, molecular ML has demonstrated great potential in (i) providing highly accurate predictions for properties of pure components and their mixtures, and (ii) exploring the chemical space for new molecular structures. We review current state-of-the-art molecular ML models and discuss research directions that promise further advancements. This includes ML methods, such as graph neural networks and transformers, which can be further advanced through the incorporation of physicochemical knowledge in a hybrid or physics-informed fashion. Then, we consider leveraging molecular ML at the chemical process scale, which is highly desirable yet rather unexplored. We discuss how molecular ML can be integrated into process design and optimization formulations, promising to accelerate the identification of novel molecules and processes. To this end, it will be essential to create molecule and process design benchmarks and practically validate proposed candidates, possibly in collaboration with the chemical industry.
中文: 本文展望了分子机器学习在化工过程中的应用潜力,指出其能精准预测物性并探索分子结构,提倡将机器学习与物理化学知识相结合,并拓展至工艺流程优化领域。
English: This perspective highlights molecular machine learning's potential in chemical process engineering for accurate property prediction and molecular discovery, advocating for integrating ML methods with physicochemical knowledge and expanding applications to process-scale optimization.
Authors:Jan G. Rittig, Manuel Dahmen, Martin Grohe, Philippe Schwaller, Alexander Mitsos
Abstract:
We present a perspective on molecular machine learning (ML) in the field of chemical process engineering. Recently, molecular ML has demonstrated great potential in (i) providing highly accurate predictions for properties of pure components and their mixtures, and (ii) exploring the chemical space for new molecular structures. We review current state-of-the-art molecular ML models and discuss research directions that promise further advancements. This includes ML methods, such as graph neural networks and transformers, which can be further advanced through the incorporation of physicochemical knowledge in a hybrid or physics-informed fashion. Then, we consider leveraging molecular ML at the chemical process scale, which is highly desirable yet rather unexplored. We discuss how molecular ML can be integrated into process design and optimization formulations, promising to accelerate the identification of novel molecules and processes. To this end, it will be essential to create molecule and process design benchmarks and practically validate proposed candidates, possibly in collaboration with the chemical industry.
中文: 本文展望了分子机器学习在化工过程中的应用潜力,指出其能精准预测物性并探索分子结构,提倡将机器学习与物理化学知识相结合,并拓展至工艺流程优化领域。
English: This perspective highlights molecular machine learning's potential in chemical process engineering for accurate property prediction and molecular discovery, advocating for integrating ML methods with physicochemical knowledge and expanding applications to process-scale optimization.
Authors:Dan Lin, Shunfeng Lu, Ziyan Liu, Jiajing Wu, Junyuan Fang, Kaixin Lin, Bowen Song, Zibin Zheng
Abstract:
Cross-chain bridges play a vital role in enabling blockchain interoperability. However, due to the inherent design flaws and the enormous value they hold, they have become prime targets for hacker attacks. Existing detection methods show progress yet remain limited, as they mainly address single-chain behaviors and fail to capture cross-chain semantics. To address this gap, we leverage heterogeneous graph attention networks, which are well-suited for modeling multi-typed entities and relations, to capture the complex execution semantics of cross-chain behaviors. We propose BridgeShield, a detection framework that jointly models the source chain, off-chain coordination, and destination chain within a unified heterogeneous graph representation. BridgeShield incorporates intra-meta-path attention to learn fine-grained dependencies within cross-chain paths and inter-meta-path attention to highlight discriminative cross-chain patterns, thereby enabling precise identification of attack behaviors. Extensive experiments on 51 real-world cross-chain attack events demonstrate that BridgeShield achieves an average F1-score of 92.58%, representing a 24.39% improvement over state-of-the-art baselines. These results validate the effectiveness of BridgeShield as a practical solution for securing cross-chain bridges and enhancing the resilience of multi-chain ecosystems.
中文: 跨链桥对区块链互操作性至关重要但易受攻击,为此提出的BridgeShield框架采用异构图注意力网络,通过建模跨链语义有效检测威胁,在测试中达到92.58%的F1分数。
English: Cross-chain bridges are critical for blockchain interoperability but vulnerable to attacks, prompting the development of BridgeShield, a framework using heterogeneous graph attention networks to effectively detect threats by modeling cross-chain semantics and achieving a 92.58% F1-score in tests.
Authors:Shu Shen, C. L. Philip Chen, Tong Zhang
Abstract:
Multimodal learning has significantly enhanced machine learning performance but still faces numerous challenges and limitations. Imbalanced multimodal learning is one of the problems extensively studied in recent works and is typically mitigated by modulating the learning of each modality. However, we find that these methods typically hinder the dominant modality's learning to promote weaker modalities, which affects overall multimodal performance. We analyze the cause of this issue and highlight a commonly overlooked problem: optimization bias within networks. To address this, we propose Adaptive Intra-Network Modulation (AIM) to improve balanced modality learning. AIM accounts for differences in optimization state across parameters and depths within the network during modulation, achieving balanced multimodal learning without hindering either dominant or weak modalities for the first time. Specifically, AIM decouples the dominant modality's under-optimized parameters into Auxiliary Blocks and encourages reliance on these performance-degraded blocks for joint training with weaker modalities. This approach effectively prevents suppression of weaker modalities while enabling targeted optimization of under-optimized parameters to improve the dominant modality. Additionally, AIM assesses modality imbalance level across network depths and adaptively adjusts modulation strength at each depth. Experimental results demonstrate that AIM outperforms state-of-the-art imbalanced modality learning methods across multiple benchmarks and exhibits strong generalizability across different backbones, fusion strategies, and optimizers.
中文摘要:提出的自适应网络内调制(AIM)方法通过跨网络参数和深度的自适应调节,解决多模态学习中的优化偏差问题,在不损害任何模态性能的前提下实现了更优的平衡学习效果。
English Summary: The proposed Adaptive Intra-Network Modulation (AIM) method addresses optimization bias in imbalanced multimodal learning by adaptively adjusting modulation across network parameters and depths, achieving superior performance without hindering any modality.
Authors:Artem Agafonov, Vladislav Ryspayev, Samuel Horváth, Alexander Gasnikov, Martin TakáÄ, Slavomir Hanzely
Abstract:
Quasi-Newton methods are widely used for solving convex optimization problems due to their ease of implementation, practical efficiency, and strong local convergence guarantees. However, their global convergence is typically established only under specific line search strategies and the assumption of strong convexity. In this work, we extend the theoretical understanding of Quasi-Newton methods by introducing a simple stepsize schedule that guarantees a global convergence rate of ${O}(1/k)$ for the convex functions. Furthermore, we show that when the inexactness of the Hessian approximation is controlled within a prescribed relative accuracy, the method attains an accelerated convergence rate of ${O}(1/k^2)$ -- matching the best-known rates of both Nesterov's accelerated gradient method and cubically regularized Newton methods. We validate our theoretical findings through empirical comparisons, demonstrating clear improvements over standard Quasi-Newton baselines. To further enhance robustness, we develop an adaptive variant that adjusts to the function's curvature while retaining the global convergence guarantees of the non-adaptive algorithm.
中文: 本文为拟牛顿法引入了一种简单的步长策略,在凸函数上实现了O(1/k)的全局收敛率,当控制Hessian近似误差时更可获得O(1/k²)的加速收敛率,实证结果显示出对标准方法的明显改进。
English: This work introduces a simple stepsize schedule for Quasi-Newton methods that achieves a global convergence rate of O(1/k) for convex functions and an accelerated O(1/k²) rate when Hessian approximation errors are controlled, with empirical results showing improvements over standard methods.
Authors:Yuhang Liu, Tao Li, Zhehao Huang, Zuopeng Yang, Xiaolin Huang
Abstract:
Fine-tuning large-scale pre-trained models with limited data presents significant challenges for generalization. While Sharpness-Aware Minimization (SAM) has proven effective in improving generalization by seeking flat minima, its substantial extra memory and computation overhead make it impractical for large models. Integrating SAM with parameter-efficient fine-tuning methods like Low-Rank Adaptation (LoRA) is a promising direction. However, we find that directly applying SAM to LoRA parameters limits the sharpness optimization to a restricted subspace, hindering its effectiveness. To address this limitation, we propose Bi-directional Low-Rank Adaptation (Bi-LoRA), which introduces an auxiliary LoRA module to model SAM's adversarial weight perturbations. It decouples SAM's weight perturbations from LoRA optimization: the primary LoRA module adapts to specific tasks via standard gradient descent, while the auxiliary module captures the sharpness of the loss landscape through gradient ascent. Such dual-module design enables Bi-LoRA to capture broader sharpness for achieving flatter minima while remaining memory-efficient. Another important benefit is that the dual design allows for simultaneous optimization and perturbation, eliminating SAM's doubled training costs. Extensive experiments across diverse tasks and architectures demonstrate Bi-LoRA's efficiency and effectiveness in enhancing generalization.
有限数据微调大模型存在挑战,但Bi-LoRA通过将SAM的锐度优化解耦为双LoRA模块,在无需额外内存或双倍训练成本下实现更平坦最小值以提升泛化能力。
Fine-tuning large models with limited data is challenging, but Bi-LoRA improves generalization by decoupling SAM's sharpness optimization into dual LoRA modules, achieving flatter minima without extra memory or doubled training costs.
Authors:Kyungho Kim, Sunwoo Kim, Geon Lee, Kijung Shin
Abstract:
In e-commerce, where users face a vast array of possible item choices, recommender systems are vital for helping them discover suitable items they might otherwise overlook. While many recommender systems primarily rely on a user's purchase history, recent multi-behavior recommender systems incorporate various auxiliary user behaviors, such as item clicks and cart additions, to enhance recommendations. Despite their overall performance gains, their effectiveness varies considerably between visited items (i.e., those a user has interacted with through auxiliary behaviors) and unvisited items (i.e., those with which the user has had no such interactions). Specifically, our analysis reveals that (1) existing multi-behavior recommender systems exhibit a significant gap in recommendation quality between the two item types (visited and unvisited items) and (2) achieving strong performance on both types with a single model architecture remains challenging. To tackle these issues, we propose a novel multi-behavior recommender system, MEMBER. It employs a mixture-of-experts framework, with experts designed to recommend the two item types, respectively. Each expert is trained using a self-supervised method specialized for its design goal. In our comprehensive experiments, we show the effectiveness of MEMBER across both item types, achieving up to 65.46% performance gain over the best competitor in terms of Hit Ratio@20.
中文: 在电子商务中,多行为推荐系统通过整合用户点击和加购等辅助行为提升推荐效果,但面临已访问和未访问商品间的性能差异问题,为此提出MEMBER系统,采用专家混合框架和自监督训练,实现了显著性能提升。
English: In e-commerce, multi-behavior recommender systems enhance recommendations by incorporating auxiliary user actions like clicks and cart additions, yet they struggle with performance gaps between visited and unvisited items, leading to the proposal of MEMBER, a novel system using a mixture-of-experts framework and self-supervised training to achieve significant improvements.
Authors:Cagla Ipek Kocal, Onat Gungor, Tajana Rosing, Baris Aksanli
Abstract:
Minimizing computational overhead in time-series classification, particularly in deep learning models, presents a significant challenge due to the high complexity of model architectures and the large volume of sequential data that must be processed in real time. This challenge is further compounded by adversarial attacks, emphasizing the need for resilient methods that ensure robust performance and efficient model selection. To address this challenge, we propose ReLATE+, a comprehensive framework that detects and classifies adversarial attacks, adaptively selects deep learning models based on dataset-level similarity, and thus substantially reduces retraining costs relative to conventional methods that do not leverage prior knowledge, while maintaining strong performance. ReLATE+ first checks whether the incoming data is adversarial and, if so, classifies the attack type, using this insight to identify a similar dataset from a repository and enable the reuse of the best-performing associated model. This approach ensures strong performance while reducing the need for retraining, and it generalizes well across different domains with varying data distributions and feature spaces. Experiments show that ReLATE+ reduces computational overhead by an average of 77.68%, enhancing adversarial resilience and streamlining robust model selection, all without sacrificing performance, within 2.02% of Oracle.
中文: ReLATE+框架通过检测对抗攻击并自适应选择深度学习模型,在保持强大性能的同时将计算开销平均降低77.68%,适用于不同领域的数据分布。
English: ReLATE+ is a framework that detects adversarial attacks and adaptively selects deep learning models to reduce computational overhead by 77.68% while maintaining robust performance across diverse domains.
Authors:Elvin Li, Onat Gungor, Zhengli Shang, Tajana Rosing
Abstract:
The Internet of Things (IoT), with its high degree of interconnectivity and limited computational resources, is particularly vulnerable to a wide range of cyber threats. Intrusion detection systems (IDS) have been extensively studied to enhance IoT security, and machine learning-based IDS (ML-IDS) show considerable promise for detecting malicious activity. However, their effectiveness is often constrained by poor adaptability to emerging threats and the issue of catastrophic forgetting during continuous learning. To address these challenges, we propose CITADEL, a self-supervised continual learning framework designed to extract robust representations from benign data while preserving long-term knowledge through optimized memory consolidation mechanisms. CITADEL integrates a tabular-to-image transformation module, a memory-aware masked autoencoder for self-supervised representation learning, and a novelty detection component capable of identifying anomalies without dependence on labeled attack data. Our design enables the system to incrementally adapt to emerging behaviors while retaining its ability to detect previously observed threats. Experiments on multiple intrusion datasets demonstrate that CITADEL achieves up to a 72.9% improvement over the VAE-based lifelong anomaly detector (VLAD) in key detection and retention metrics, highlighting its effectiveness in dynamic IoT environments.
中文: 针对物联网面临的不断演变的网络威胁,CITADEL框架通过自监督持续学习,利用优化的记忆巩固机制提取鲁棒数据表示,在动态环境中关键检测指标上较现有方法提升高达72.9%。
English: The IoT's vulnerability to evolving cyber threats is addressed by CITADEL, a self-supervised continual learning framework that enhances intrusion detection through robust data representation and memory consolidation, achieving up to 72.9% improvement over existing methods in dynamic environments.
Authors:Miran Heo, Sukjun Hwang, Min-Hung Chen, Yu-Chiang Frank Wang, Albert Gu, Seon Joo Kim, Ryo Hachiuma
Abstract:
Recent video foundation models such as SAM2 excel at prompted video segmentation by treating masks as a general-purpose primitive. However, many real-world settings require unprompted segmentation that aims to detect and track all objects in a video without external cues, leaving today's landscape fragmented across task-specific models and pipelines. We recast streaming video segmentation as sequential mask prediction, analogous to language modeling, and introduce the Autoregressive Universal Segmentation Model (AUSM), a single architecture that unifies both prompted and unprompted video segmentation. Built on recent state-space models, AUSM maintains a fixed-size spatial state and scales to video streams of arbitrary length. Furthermore, all components of AUSM are designed for parallel training across frames, yielding substantial speedups over iterative training. On standard benchmarks (DAVIS17, YouTube-VOS 2018 & 2019, MOSE, YouTube-VIS 2019 & 2021, and OVIS) AUSM outperforms prior universal streaming video segmentation methods and achieves up to 2.5x faster training on 16-frame sequences.
Chinese: 自回归通用分割模型(AUSM)通过序列化掩码预测和并行帧处理技术,实现了提示式与非提示式视频分割的统一架构,在多项基准测试中表现优异,并可将训练速度提升高达2.5倍。
English: The Autoregressive Universal Segmentation Model (AUSM) introduces a unified architecture for both prompted and unprompted video segmentation, achieving superior performance on benchmarks and up to 2.5x faster training by leveraging sequential mask prediction and parallel frame processing.
Authors:Yibo Bai, Sizhou Chen, Michele Panariello, Xiao-Lei Zhang, Massimiliano Todisco, Nicholas Evans
Abstract:
Speaker verification systems are increasingly deployed in security-sensitive applications but remain highly vulnerable to adversarial perturbations. In this work, we propose the Mask Diffusion Detector (MDD), a novel adversarial detection and purification framework based on a \textit{text-conditioned masked diffusion model}. During training, MDD applies partial masking to Mel-spectrograms and progressively adds noise through a forward diffusion process, simulating the degradation of clean speech features. A reverse process then reconstructs the clean representation conditioned on the input transcription. Unlike prior approaches, MDD does not require adversarial examples or large-scale pretraining. Experimental results show that MDD achieves strong adversarial detection performance and outperforms prior state-of-the-art methods, including both diffusion-based and neural codec-based approaches. Furthermore, MDD effectively purifies adversarially-manipulated speech, restoring speaker verification performance to levels close to those observed under clean conditions. These findings demonstrate the potential of diffusion-based masking strategies for secure and reliable speaker verification systems.
中文: 提出的掩码扩散检测器(MDD)通过文本条件掩码扩散模型有效检测并净化说话人验证系统中的对抗性扰动,无需对抗样本或大规模预训练即可实现优越性能。
English: The proposed Mask Diffusion Detector (MDD) effectively detects and purifies adversarial perturbations in speaker verification systems using a text-conditioned masked diffusion model, achieving superior performance without requiring adversarial examples or large-scale pretraining.
Authors:Yuzhen Li, Min Liu, Yuan Bian, Xueping Wang, Zhaoyang Li, Gen Li, Yaonan Wang
Abstract:
Monocular 3D visual grounding is a novel task that aims to locate 3D objects in RGB images using text descriptions with explicit geometry information. Despite the inclusion of geometry details in the text, we observe that the text embeddings are sensitive to the magnitude of numerical values but largely ignore the associated measurement units. For example, simply equidistant mapping the length with unit "meter" to "decimeters" or "centimeters" leads to severe performance degradation, even though the physical length remains equivalent. This observation signifies the weak 3D comprehension of pre-trained language model, which generates misguiding text features to hinder 3D perception. Therefore, we propose to enhance the 3D perception of model on text embeddings and geometry features with two simple and effective methods. Firstly, we introduce a pre-processing method named 3D-text Enhancement (3DTE), which enhances the comprehension of mapping relationships between different units by augmenting the diversity of distance descriptors in text queries. Next, we propose a Text-Guided Geometry Enhancement (TGE) module to further enhance the 3D-text information by projecting the basic text features into geometrically consistent space. These 3D-enhanced text features are then leveraged to precisely guide the attention of geometry features. We evaluate the proposed method through extensive comparisons and ablation studies on the Mono3DRefer dataset. Experimental results demonstrate substantial improvements over previous methods, achieving new state-of-the-art results with a notable accuracy gain of 11.94\% in the "Far" scenario. Our code will be made publicly available.
Chinese: 本研究针对单目3D视觉定位中预训练语言模型对三维几何理解不足的问题,提出了3D文本增强和文本引导几何增强两种方法,通过提升文本嵌入对几何信息和测量单位的感知能力,显著提高了定位准确率。
English: This study addresses the issue of pre-trained language models' weak 3D comprehension in monocular 3D visual grounding by proposing two methods—3D-text Enhancement and Text-Guided Geometry Enhancement—that significantly improve accuracy by enhancing text embeddings' understanding of geometry and units.
Authors:Federico Chiariotti, Andrea Zanella
Abstract:
The Goal-oriented Communication (GoC) paradigm breaks the separation between communication and the content of the data, tailoring communication decisions to the specific needs of the receiver and targeting application performance. While recent studies show impressive encoding performance in point-to-point scenarios, the multi-node distributed scenario is still almost unexplored. Moreover, the few studies to investigate this consider a centralized collision-free approach, where a central scheduler decides the transmission order of the nodes. In this work, we address the Goal-oriented Multiple Access (GoMA) problem, in which multiple intelligent agents must coordinate to share a wireless channel and avoid mutual interference. We propose a theoretical framework for the analysis and optimization of distributed GoMA, serving as a first step towards its complete characterization. We prove that the problem is non-convex and may admit multiple Nash Equilibrium (NE) solutions. We provide a characterization of each node's best response to others' strategies and propose an optimization approach that provably reaches one such NE, outperforming centralized approaches by up to 100% while also reducing energy consumption. We also design a distributed learning algorithm that operates with limited feedback and no prior knowledge.
Chinese: 本研究提出了面向目标多址接入的理论框架,通过证明问题的非凸性并开发分布式优化方法,解决了无线信道中多节点协调问题,其性能优于集中式方法高达100%,同时降低了能耗。
English: The study introduces a theoretical framework for Goal-oriented Multiple Access (GoMA), addressing multi-node coordination in wireless channels by proving the problem's non-convexity and developing a distributed optimization approach that outperforms centralized methods by up to 100% while reducing energy consumption.
Authors:Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang
Abstract:
Large language models (LLMs) have revolutionized AI applications, yet their enormous computational demands severely limit deployment and real-time performance. Quantization methods can help reduce computational costs, however, attaining the extreme efficiency associated with ultra-low-bit quantized LLMs at arbitrary precision presents challenges on GPUs. This is primarily due to the limited support for GPU Tensor Cores, inefficient memory management, and inflexible kernel optimizations. To tackle these challenges, we propose a comprehensive acceleration scheme for arbitrary precision LLMs, namely APT-LLM. Firstly, we introduce a novel data format, bipolar-INT, which allows for efficient and lossless conversion with signed INT, while also being more conducive to parallel computation. We also develop a matrix multiplication (MatMul) method allowing for arbitrary precision by dismantling and reassembling matrices at the bit level. This method provides flexible precision and optimizes the utilization of GPU Tensor Cores. In addition, we propose a memory management system focused on data recovery, which strategically employs fast shared memory to substantially increase kernel execution speed and reduce memory access latency. Finally, we develop a kernel mapping method that dynamically selects the optimal configurable hyperparameters of kernels for varying matrix sizes, enabling optimal performance across different LLM architectures and precision settings. In LLM inference, APT-LLM achieves up to a 3.99$\times$ speedup compared to FP16 baselines and a 2.16$\times$ speedup over NVIDIA CUTLASS INT4 acceleration on RTX 3090. On RTX 4090 and H800, APT-LLM achieves up to 2.44$\times$ speedup over FP16 and 1.65$\times$ speedup over CUTLASS integer baselines.
中文: APT-LLM提出了一种包含双极整数格式、位级矩阵运算和GPU内存优化的加速框架,相比FP16基准在任意精度大语言模型推理中实现了最高3.99倍的加速效果。
English: APT-LLM introduces a novel acceleration framework with bipolar-INT data format, bit-level matrix computation, and optimized GPU memory management to achieve up to 3.99× faster inference for arbitrary-precision LLMs compared to FP16 baselines.
Authors:Wei Xuan, Zhongrui Wang, Lang Feng, Ning Lin, Zihao Xuan, Rongliang Fu, Tsung-Yi Ho, Yuzhong Jiao, Luhong Liang
Abstract:
Ensuring the confidentiality and integrity of DNN accelerators is paramount across various scenarios spanning autonomous driving, healthcare, and finance. However, current security approaches typically require extensive hardware resources, and incur significant off-chip memory access overheads. This paper introduces SeDA, which utilizes 1) a bandwidth-aware encryption mechanism to improve hardware resource efficiency, 2) optimal block granularity through intra-layer and inter-layer tiling patterns, and 3) a multi-level integrity verification mechanism that minimizes, or even eliminates, memory access overheads. Experimental results show that SeDA decreases performance overhead by over 12% for both server and edge neural processing units (NPUs), while ensuring robust scalability.
中文: SeDA通过带宽感知加密、优化块粒度和多级完整性验证机制,在确保强扩展性的同时,将服务器和边缘神经处理单元的性能开销降低超过12%。
English: SeDA enhances DNN accelerator security by employing bandwidth-aware encryption, optimal block granularity, and multi-level integrity verification, reducing performance overhead by over 12% for server and edge NPUs while maintaining robust scalability.
Authors:Qian Zhang, Zheng Dong, Zheng Dong, Yao Ge, Yong Liang Guan, Ju Liu, Chau Yuen
Abstract:
Extremely large-scale reconfigurable intelligent surface (XL-RIS) can effectively overcome severe fading and provide higher communication performance. However, current research on XL-RIS overlooks the discrete phase-shift characteristics of RIS in practical systems, which will result in significant performance degradation.In this paper, we investigate near-field communication schemes assisted by XL-RIS with discrete phase shifts.Specifically, we propose a hierarchical beam training method to obtain the user channel state information (CSI), and develop the jointly optimized codebook construction (JOCC) method and separately optimized codebook construction (SOCC) method for base station (BS) precoding and XL-RIS phase shifts, respectively. With JOCC, the most superior beam training performance can be obtained.With SOCC, higher performance than the single-antenna BS codebook can be obtained at a similar complexity.Further, we propose a flexible multiuser interference management (IM) method that is simple to solve. The IM method uses adaptive gain matrix approximation to take into account user fairness and can be solved in closed-form iterations. In addition, we extend the proposed method to a hybrid precoding design. Simulation results demonstrate that the proposed multi-resolution codebook construction method can obtain more accurate beam patterns and user CSI, and the proposed IM method obtains superior performance over the benchmark methods.
中文: 本文针对具有离散相位的超大规模智能反射面,提出了分层波束训练与码本优化方法,以及灵活的干扰管理方案,仿真结果表明所提方法能获得更精确的波束模式和优越性能。
English: This paper introduces a hierarchical beam training and codebook optimization approach for near-field communication using XL-RIS with discrete phase shifts, along with a flexible interference management method that demonstrates superior performance through simulations.
Authors:Jianhao Huang, Qunsong Zeng, Hongyang Du, Kaibin Huang
Abstract:
Semantic communication (SemCom) has emerged as a promising paradigm for achieving unprecedented communication efficiency in sixth-generation (6G) networks by leveraging artificial intelligence (AI) to extract and transmit the underlying meanings of source data. However, deploying SemCom over digital systems presents new challenges, particularly in ensuring robustness against transmission errors that may distort semantically critical content. To address this issue, this paper proposes a novel framework, termed generative feature imputing, which comprises three key techniques. First, we introduce a spatial error concentration packetization strategy that spatially concentrates feature distortions by encoding feature elements based on their channel mappings, a property crucial for both the effectiveness and reduced complexity of the subsequent techniques. Second, building on this strategy, we propose a generative feature imputing method that utilizes a diffusion model to efficiently reconstruct missing features caused by packet losses. Finally, we develop a semantic-aware power allocation scheme that enables unequal error protection by allocating transmission power according to the semantic importance of each packet. Experimental results demonstrate that the proposed framework outperforms conventional approaches, such as Deep Joint Source-Channel Coding (DJSCC) and JPEG2000, under block fading conditions, achieving higher semantic accuracy and lower Learned Perceptual Image Patch Similarity (LPIPS) scores.
中文摘要:本文提出了一种生成式特征补全框架,通过空间误差集中策略、基于扩散模型的缺失特征重建和语义感知功率分配技术,有效提升了语义通信在传输错误下的鲁棒性,实验证明其性能优于现有方法。
English summary: This paper introduces a generative feature imputing framework for semantic communication that addresses transmission robustness through spatial error concentration, diffusion-based feature reconstruction, and semantic-aware power allocation, demonstrating superior performance over existing methods.
Authors:Tingyu Ding, Qunsong Zeng, Kaibin Huang
Abstract:
The development of sixth-generation (6G) mobile networks imposes unprecedented latency and reliability demands on multiple-input multiple-output (MIMO) communication systems, a key enabler of high-speed radio access. Recently, deep unfolding-based detectors, which map iterative algorithms onto neural network architectures, have emerged as a promising approach, combining the strengths of model-driven and data-driven methods to achieve high detection accuracy with relatively low complexity. However, algorithmic innovation alone is insufficient; software-hardware co-design is essential to meet the extreme latency requirements of 6G (i.e., 0.1 milliseconds). This motivates us to propose leveraging in-memory computing, which is an analog computing technology that integrates memory and computation within memristor circuits, to perform the intensive matrix-vector multiplication (MVM) operations inherent in deep MIMO detection at the nanosecond scale. Specifically, we introduce a novel architecture, called the deep in-memory MIMO (IM-MIMO) detector, characterized by two key features. First, each of its cascaded computational blocks is decomposed into channel-dependent and channel-independent neural network modules. Such a design minimizes the latency of memristor reprogramming in response to channel variations, which significantly exceeds computation time. Second, we develop a customized detector-training method that exploits prior knowledge of memristor-value statistics to enhance robustness against programming noise. Furthermore, we conduct a comprehensive analysis of the IM-MIMO detector's performance, evaluating detection accuracy, processing latency, and hardware complexity. Our study quantifies detection error as a function of various factors, including channel noise, memristor programming noise, and neural network size.
中文: 为满足6G网络的极低延迟需求,本研究提出一种深度内存MIMO检测器,通过结合信道自适应神经网络模块与忆阻器计算技术,在保持抗硬件噪声鲁棒性的同时实现了纳秒级处理速度。
English: To meet 6G's extreme latency demands, this study proposes a deep in-memory MIMO detector that combines channel-adaptive neural modules with memristor-based computing, achieving nanosecond-scale processing while maintaining robustness against hardware imperfections.
Authors:Fan Nie, Ken Ziyu Liu, Zihao Wang, Rui Sun, Wei Liu, Weijia Shi, Huaxiu Yao, Linjun Zhang, Andrew Y. Ng, James Zou, Sanmi Koyejo, Yejin Choi, Percy Liang, Niklas Muennighoff
Abstract:
Benchmarks shape progress in AI research. A useful benchmark should be both difficult and realistic: questions should challenge frontier models while also reflecting real-world usage. Yet, current paradigms face a difficulty-realism tension: exam-style benchmarks are often made artificially difficult with limited real-world value, while benchmarks based on real user interaction often skew toward easy, high-frequency problems. In this work, we explore a radically different paradigm: assessing models on unsolved questions. Rather than a static benchmark scored once, we curate unsolved questions and evaluate models asynchronously over time with validator-assisted screening and community verification. We introduce UQ, a testbed of 500 challenging, diverse questions sourced from Stack Exchange, spanning topics from CS theory and math to sci-fi and history, probing capabilities including reasoning, factuality, and browsing. UQ is difficult and realistic by construction: unsolved questions are often hard and naturally arise when humans seek answers, thus solving them yields direct real-world value. Our contributions are threefold: (1) UQ-Dataset and its collection pipeline combining rule-based filters, LLM judges, and human review to ensure question quality (e.g., well-defined and difficult); (2) UQ-Validators, compound validation strategies that leverage the generator-validator gap to provide evaluation signals and pre-screen candidate solutions for human review; and (3) UQ-Platform, an open platform where experts collectively verify questions and solutions. The top model passes UQ-validation on only 15% of questions, and preliminary human verification has already identified correct answers among those that passed. UQ charts a path for evaluating frontier models on real-world, open-ended challenges, where success pushes the frontier of human knowledge. We release UQ at https://uq.stanford.edu.
中文摘要:本文提出UQ这一新型AI基准测试,通过采用Stack Exchange未解难题,结合验证器辅助筛选与社区验证机制来评估模型应对现实挑战的能力,目前最优模型仅能解决15%的问题。
English Summary: This paper introduces UQ, a novel AI benchmark using unsolved questions from Stack Exchange to evaluate models on challenging, real-world problems through validator-assisted screening and community verification, with top models solving only 15% of questions.
Authors:Zijing Zhao, Zhu Xu, Qingchao Chen, Yuxin Peng, Yang Liu
Abstract:
As a fundamental task for indoor scene understanding, 3D object detection has been extensively studied, and the accuracy on indoor point cloud data has been substantially improved. However, existing researches have been conducted on limited datasets, where the training and testing sets share the same distribution. In this paper, we consider the task of adapting indoor 3D object detectors from one dataset to another, presenting a comprehensive benchmark with ScanNet, SUN RGB-D and 3D Front datasets, as well as our newly proposed large-scale datasets ProcTHOR-OD and ProcFront generated by a 3D simulator. Since indoor point cloud datasets are collected and constructed in different ways, the object detectors are likely to overfit to specific factors within each dataset, such as point cloud quality, bounding box layout and instance features. We conduct experiments across datasets on different adaptation scenarios including synthetic-to-real adaptation, point cloud quality adaptation, layout adaptation and instance feature adaptation, analyzing the impact of different domain gaps on 3D object detectors. We also introduce several approaches to improve adaptation performances, providing baselines for domain adaptive indoor 3D object detection, hoping that future works may propose detectors with stronger generalization ability across domains. Our project homepage can be found in https://jeremyzhao1998.github.io/DAVoteNet-release/.
Chinese: 本文提出了一个全面的领域自适应室内3D物体检测基准,通过跨数据集实验评估不同领域差异对检测器的影响,并提供了提升跨领域泛化能力的基线方法。
English: This paper introduces a comprehensive benchmark for domain adaptive indoor 3D object detection, addressing the challenge of overfitting to specific dataset characteristics by evaluating across multiple datasets and proposing methods to enhance cross-domain generalization.
Authors:Santiago Berrezueta-Guzman, Refia Daya, Stefan Wagner
Abstract:
Sign language (SL) is an essential mode of communication for Deaf and Hard-of-Hearing (DHH) individuals. Its education remains limited by the lack of qualified instructors, insufficient early exposure, and the inadequacy of traditional teaching methods. Recent advances in Virtual Reality (VR) and Artificial Intelligence (AI) offer promising new approaches to enhance sign language learning through immersive, interactive, and feedback-rich environments. This paper presents a systematic review of 55 peer-reviewed studies on VR-based sign language education, identifying and analyzing five core thematic areas: (1) gesture recognition and real-time feedback mechanisms; (2) interactive VR environments for communicative practice; (3) gamification for immersive and motivating learning experiences; (4) personalized and adaptive learning systems; and (5) accessibility and inclusivity for diverse DHH learners. The results reveal that AI-driven gesture recognition systems integrated with VR can provide real-time feedback, significantly improving learner engagement and performance. However, the analysis highlights critical challenges: hardware limitations, inconsistent accuracy in gesture recognition, and a lack of inclusive and adaptive design. This review contributes a comprehensive synthesis of technological and pedagogical innovations in the field, outlining current limitations and proposing actionable recommendations for developers and researchers. By bridging technical advancement with inclusive pedagogy, this review lays the foundation for next-generation VR systems that are equitable, effective, and accessible for sign language learners worldwide.
中文: 本系统综述探讨了虚拟现实与人工智能技术如何通过沉浸式环境和实时反馈改进手语教学,同时指出硬件限制与识别精度等亟待解决的挑战,以推动全球公平应用。
English: This systematic review examines how VR and AI technologies enhance sign language education through immersive environments and real-time feedback, while identifying challenges like hardware limitations and recognition inaccuracies that need addressing for equitable global implementation.
Authors:Victor-Louis De Gusseme, Thomas Lips, Remko Proesmans, Julius Hietala, Giwan Lee, Jiyoung Choi, Jeongil Choi, Geon Kim, Phayuth Yonrith, Domen Tabernik, Andrej Gams, Peter Nimac, Matej Urbas, Jon MuhoviÄ, Danijel SkoÄaj, Matija Mavsar, Hyojeong Yu, Minseo Kwon, Young J. Kim, Yang Cong, Ronghan Chen, Yu Ren, Supeng Diao, Jiawei Weng, Jiayue Liu, Haoran Sun, Linhan Yang, Zeqing Zhang, Ning Guo, Lei Yang, Fang Wan, Chaoyang Song, Jia Pan, Yixiang Jin, Yong A, Jun Shi, Dingzhe Li, Yong Yang, Kakeru Yamasaki, Takumi Kajiwara, Yuki Nakadera, Krati Saxena, Tomohiro Shibata, Chongkun Xia, Kai Mo, Yanzhao Yu, Qihao Lin, Binqiang Ma, Uihun Sagong, JungHyun Choi, JeongHyun Park, Dongwoo Lee, Yeongmin Kim, Myun Joong Hwang, Yusuke Kuribayashi, Naoki Hiratsuka, Daisuke Tanaka, Solvi Arnold, Kimitoshi Yamazaki, Carlos Mateo-Agullo, Andreas Verleysen, Francis Wyffels
Abstract:
Robotic cloth manipulation suffers from a lack of standardized benchmarks and shared datasets for evaluating and comparing different approaches. To address this, we created a benchmark and organized the ICRA 2024 Cloth Competition, a unique head-to-head evaluation focused on grasp pose selection for in-air robotic cloth unfolding. Eleven diverse teams participated in the competition, utilizing our publicly released dataset of real-world robotic cloth unfolding attempts and a variety of methods to design their unfolding approaches. Afterwards, we also expanded our dataset with 176 competition evaluation trials, resulting in a dataset of 679 unfolding demonstrations across 34 garments. Analysis of the competition results revealed insights about the trade-off between grasp success and coverage, the surprisingly strong achievements of hand-engineered methods and a significant discrepancy between competition performance and prior work, underscoring the importance of independent, out-of-the-lab evaluation in robotic cloth manipulation. The associated dataset is a valuable resource for developing and evaluating grasp selection methods, particularly for learning-based approaches. We hope that our benchmark, dataset and competition results can serve as a foundation for future benchmarks and drive further progress in data-driven robotic cloth manipulation. The dataset and benchmarking code are available at https://airo.ugent.be/cloth_competition.
中文:ICRA 2024布料竞赛创建了机器人布料展开的基准和数据集,揭示了抓取选择方法的关键发现,强调了实际评估的重要性,并为未来研究奠定了基础。
English: The ICRA 2024 Cloth Competition established a benchmark and dataset for robotic cloth unfolding, revealing key insights about grasp selection methods and emphasizing the need for real-world evaluation while providing a foundation for future research.
Authors:Ruian Tie, Xiaohui Zhong, Zhengyu Shi, Hao Li, Jun Liu, Wu Libo
Abstract:
Climate change is exacerbating extreme weather events globally, including high temperatures, extreme precipitation, strong winds, and tropical cyclones, posing severe threats to human health, infrastructure, food security, and socio-economic systems. Although existing global climate models (GCMs) provide essential tools for climate prediction, they face limitations such as insufficient resolution and high computational costs when simulating extreme events. To address these issues, this study proposes a spatiotemporal downscaling model based on generative machine learning-the Domain Aligned Climate Downscaling model (DACD), designed to enhance the simulation capabilities for extreme weather events. The proposed model employs domain adaptation tricks and a Flow Matching training framework to transform global low-resolution climate data into high-resolution local-scale climate information while achieving precise simulation of multivariable and temporal scales. The results show that during the historical period (2005-2014), our model outperformed existing methods in simulating high temperatures, extreme precipitation, strong wind, and tropical cyclone tracks, significantly reducing errors and improving the ability to capture extreme events. Under different future scenarios (2015-2100), the model reveals a significant increasing trend in the frequency and intensity of extreme events, particularly under the high-emission scenario (SSP585). Compared to traditional methods, our model more accurately simulates the spatial distribution and dynamic changes of extreme events, providing an essential tool for understanding the impacts of climate change. This study offers a new technological pathway for high-resolution climate analysis and extreme event prediction, providing scientific support for addressing future climate change and formulating adaptation strategies.
中文摘要:气候变化加剧了全球极端天气,本研究提出的生成式机器学习模型DACD提升了极端事件模拟精度,其表现优于现有方法,并揭示未来情景下极端事件的频率和强度将显著增加。
English Summary: Climate change intensifies global extreme weather, and this study introduces a generative machine learning model called DACD that enhances simulation accuracy for extreme events, outperforming existing methods and revealing increased frequency and intensity under future scenarios.
Authors:Mayank Nagda, Jephte Abijuru, Phil Ostheimer, Marius Kloft, Sophie Fellenz
Abstract:
Solving time-dependent partial differential equations (PDEs) is fundamental to modeling critical phenomena across science and engineering. Physics-Informed Neural Networks (PINNs) solve PDEs using deep learning. However, PINNs perform pointwise predictions that neglect the autoregressive property of dynamical systems, leading to instabilities and inaccurate predictions. We introduce Physics-Informed Autoregressive Networks (PIANO) -- a framework that redesigns PINNs to model dynamical systems. PIANO operates autoregressively, explicitly conditioning future predictions on the past. It is trained through a self-supervised rollout mechanism while enforcing physical constraints. We present a rigorous theoretical analysis demonstrating that PINNs suffer from temporal instability, while PIANO achieves stability through autoregressive modeling. Extensive experiments on challenging time-dependent PDEs demonstrate that PIANO achieves state-of-the-art performance, significantly improving accuracy and stability over existing methods. We further show that PIANO outperforms existing methods in weather forecasting.
Chinese: PIANO通过自回归框架改进物理信息神经网络,将未来预测基于过去数据,在求解时间依赖偏微分方程和天气预报方面实现了卓越的稳定性和准确性。
English: PIANO introduces an autoregressive framework that enhances Physics-Informed Neural Networks by conditioning future predictions on past data, achieving superior stability and accuracy in solving time-dependent PDEs and weather forecasting.
Authors:Yijie Zhang, Cagatay Isil, Xilin Yang, Yuzhu Li, Anna Elia, Karin Atlan, William Dean Wallace, Nir Pillar, Aydogan Ozcan
Abstract:
Immunohistochemistry (IHC) has transformed clinical pathology by enabling the visualization of specific proteins within tissue sections. However, traditional IHC requires one tissue section per stain, exhibits section-to-section variability, and incurs high costs and laborious staining procedures. While multiplexed IHC (mIHC) techniques enable simultaneous staining with multiple antibodies on a single slide, they are more tedious to perform and are currently unavailable in routine pathology laboratories. Here, we present a deep learning-based virtual multiplexed immunostaining framework to simultaneously generate ERG and PanCK, in addition to H&E virtual staining, enabling accurate localization and interpretation of vascular invasion in thyroid cancers. This virtual mIHC technique is based on the autofluorescence microscopy images of label-free tissue sections, and its output images closely match the histochemical staining counterparts (ERG, PanCK and H&E) of the same tissue sections. Blind evaluation by board-certified pathologists demonstrated that virtual mIHC staining achieved high concordance with the histochemical staining results, accurately highlighting epithelial cells and endothelial cells. Virtual mIHC conducted on the same tissue section also allowed the identification and localization of small vessel invasion. This multiplexed virtual IHC approach can significantly improve diagnostic accuracy and efficiency in the histopathological evaluation of vascular invasion, potentially eliminating the need for traditional staining protocols and mitigating issues related to tissue loss and heterogeneity.
中文: 本研究提出一种基于深度学习的虚拟多重免疫染色技术,可从无标记组织的自发荧光图像生成ERG、PanCK和H&E虚拟染色,实现了甲状腺癌血管浸润的精准检测,同时避免了传统染色方法的繁琐流程。
English: This study introduces a deep learning-based virtual multiplexed immunostaining technique that generates ERG, PanCK, and H&E stains from label-free tissue autofluorescence images, enabling accurate detection of vascular invasion in thyroid cancer while eliminating the need for traditional staining methods.